More on masak’s p5

So, I left the benchmark code running with a couple more strings last night before I went to bed. Here are the results I got:

taaatatattgtttcatagtcgtaccacccaacacattgtcctcacg
cataaccggcggctcgtggctttctgtagaccgaatcttcgctgtttgctctg
p5-moritz.pl: 5.3499312
p5-fox.pl: 5.5914252
p5-colomon.pl: 9.0103538
p5-util.pl: 12.8984396
p5-matthias.pl: 26.1100836
There were nobles, who made war against each other; there was the king,
who made war against the cardinal; there was Spain, which made war against the king.
p5-fox.pl: 9.0724477
p5-moritz.pl: 10.4737434
p5-colomon.pl: 12.4736412
p5-matthias.pl: 25.980024
p5-util.pl: 32.2906812

(That’s the same test as the previous post, but I re-ran it again.)

In those times panics were common, and few days passed without some city or other registering in its archives an event of this kind.
There were nobles, who made war against each other; there was the king, who made war against the cardinal; there was Spain, which made war against the king.
p5-colomon.pl: 20.1296963
p5-fox.pl: 29.2432603
p5-moritz.pl: 33.6600319
p5-util.pl: 119.7950674
p5-matthias.pl: 139.7882468

If people have suggestions for more strings to run, I’d be happy to test them, but bear in mind that the benchmark script runs each test 10 times and averages the results, so it may take some time to get results.

UPDATE: I’ve added three more tests. I also added a new and somewhat novel routine from mberends which wasn’t part of masak’s competition. First, mberends’s results on the above tests:

dna, p5-mberends.pl: 25.846477
dumas-1, p5-mberends.pl: 52.3829865
dumas-2, p5-mberends.pl: 208.6451586
accgccaaccggaaagaatgtcctcccaccacaaatgtacgctcgcatggcggttgtcgagtctatgtcggttgcgctatactacatgataatggacgcc
ttacttacccaaagtagaaataagctcgtctttgagaaccgtggactggtactacctatttttagtcaaactcatgactcgcgcctagcccacatacaat
p5-moritz.pl: 15.4340812
p5-colomon.pl,: 16.0729917
p5-fox.pl: 19.6679714
p5-util.pl: 49.7325891
p5-mberends.pl: 107.7859086
p5-matthias.pl: 174.7841558
accgccaaccggaaagaatgtcctcccaccacaaatgtacgctcgcatggcggttgtcgagtctatgtcggttgcgctatactacatgataatggacgccttacttacccaaagtagaaataagctcgtctttgagaaccgtggactggtactacctatttttagtcaaactcatgactcgcgcctagcccacatacaat
tttgtccctggccacgacgttctactatatgttaatgaaacgtaaggaattgcgttggccaagaaacgtccttttcacagatacccgtcgtacctgattaccgctgtagggcgctttttccggctggggcgcgcgtgtctgttggccgggccctacgtaggcctataacggaaagatttgtaccaaattctactacgagg
p5-colomon.pl: 40.274496
p5-moritz.pl: 64.2675568
p5-fox.pl: 88.9493287

(Apologies that I didn’t time all the codes for that one, but it was very obvious the others were much slower, and I needed the Linux box for $work.)

In those times panics were common, and few days passed without some city or other registering in its archives an event of this kind. There were nobles, who made war against each other; there was the king, who made war against the cardinal; there was Spain, which made war against the king.
Then, in addition to these concealed or public, secret or open wars, there were robbers, mendicants, Huguenots, wolves, and scoundrels, who made war upon everybody. The citizens always took up arms readily against thieves, wolves or scoundrels, often against nobles or Huguenots, sometimes against the king, but never against cardinal or Spain. It resulted, then, from this habit that on the said first Monday of April, 1625, the citizens, on hearing the clamor, and seeing neither the red-and-yellow standard nor the livery of the Duc de Richelieu, rushed toward the hostel of the Jolly Miller. When arrived there, the cause of the hubbub was apparent to all.
p5-colomon.pl: 167.9458936
p5-fox.pl: 467.1575598
p5-moritz.pl: 858.1724499
p5-util.pl: 2079.24607
p5-mberends.pl: 3054.501836
p5-matthias.pl: 4872.900568

4 Responses to “More on masak’s p5”

  1. Aaron Sherman Says:

    It would be nice if you could link to mberends solution. I’ve got one myself that seems to perform better than any of your examples under certain circumstances, but it’s impossible to tell without running my own trials…

  2. Aaron Sherman Says:

    So, it looks like my solution doesn’t beat colomon’s for all inputs (it does on the very long English text given above), but it does beat everyone else, and it’s radically simpler than colomon’s solution. I’ll post it when I have a chance, but the algorithm is pretty trivial I just keep two lists: known solutions and “active” solutions. An active solution is any sub-string that I’ve encountered in both source strings, but which might only be the root of a longer solution. I walk forward from 0 to whichever string length is larger (i, below) and do 8 things (there’s a lot of bounds checking in my code that I’m ignoring, below):

    0) Store the current position, i, in both strings as an “active solution” of length 0 (an active solution is a datastructure containing an offset in string1, an offset in string2 and a current length).

    1) check to see if string1[i] eq string2[x] where x is set to the last tested string2 position in each active solution plus one, if so increment its length by 1.

    2) Same as above, but testing string2[i] against active solutions

    3) At this point, all active solutions not matched in steps 1 or 2, above are removed from the list of active solutions and added to known solutions.

    4) For all positions in string2 which are saved in the lookup table (see below) under string1[i], store a new, active solution of length 1.

    5) Same as #4, but reversing string1/string2.

    6) Append the index i to the list at lookup1[string1[i]]

    7) as #6, but save to lookup2[string2[i]]

    Once done with the outer loop, return the longest known solution.

    Practical concerns include the fact that this approach could be memory intensive for very large strings.

    • Aaron Sherman Says:

      Since I fixed a small bug (code uploaded here https://github.com/ajs/perl6-scripts ) I’m no longer beating colomon. However, I am still producing very reasonable times, and beating most of the other contenders for most of the inputs. My worst performance is on inputs that have low numbers of symbols (e.g. the genetics examples), and I shine best on English text.

  3. Martin Berends Says:

    Aaron, this is p5-mberends.pl: http://pastebin.com/rK4tz0UE with the disclaimer that it is not being developed to be competitive and is hobbled by poor chr(0) handling in regexes and strings. It would be a helpful yak shave to remove the workarounds (after eliminating the need for them) and then compare performance. I made a Perl 5 version without workarounds and it runs much faster, but that may just be Perl 5.

Leave a reply to Aaron Sherman Cancel reply