So, I left the benchmark code running with a couple more strings last night before I went to bed. Here are the results I got:
taaatatattgtttcatagtcgtaccacccaacacattgtcctcacg cataaccggcggctcgtggctttctgtagaccgaatcttcgctgtttgctctg p5-moritz.pl: 5.3499312 p5-fox.pl: 5.5914252 p5-colomon.pl: 9.0103538 p5-util.pl: 12.8984396 p5-matthias.pl: 26.1100836
There were nobles, who made war against each other; there was the king, who made war against the cardinal; there was Spain, which made war against the king. p5-fox.pl: 9.0724477 p5-moritz.pl: 10.4737434 p5-colomon.pl: 12.4736412 p5-matthias.pl: 25.980024 p5-util.pl: 32.2906812
(That’s the same test as the previous post, but I re-ran it again.)
In those times panics were common, and few days passed without some city or other registering in its archives an event of this kind. There were nobles, who made war against each other; there was the king, who made war against the cardinal; there was Spain, which made war against the king. p5-colomon.pl: 20.1296963 p5-fox.pl: 29.2432603 p5-moritz.pl: 33.6600319 p5-util.pl: 119.7950674 p5-matthias.pl: 139.7882468
If people have suggestions for more strings to run, I’d be happy to test them, but bear in mind that the benchmark script runs each test 10 times and averages the results, so it may take some time to get results.
UPDATE: I’ve added three more tests. I also added a new and somewhat novel routine from mberends which wasn’t part of masak’s competition. First, mberends’s results on the above tests:
dna, p5-mberends.pl: 25.846477 dumas-1, p5-mberends.pl: 52.3829865 dumas-2, p5-mberends.pl: 208.6451586
accgccaaccggaaagaatgtcctcccaccacaaatgtacgctcgcatggcggttgtcgagtctatgtcggttgcgctatactacatgataatggacgcc ttacttacccaaagtagaaataagctcgtctttgagaaccgtggactggtactacctatttttagtcaaactcatgactcgcgcctagcccacatacaat p5-moritz.pl: 15.4340812 p5-colomon.pl,: 16.0729917 p5-fox.pl: 19.6679714 p5-util.pl: 49.7325891 p5-mberends.pl: 107.7859086 p5-matthias.pl: 174.7841558
accgccaaccggaaagaatgtcctcccaccacaaatgtacgctcgcatggcggttgtcgagtctatgtcggttgcgctatactacatgataatggacgccttacttacccaaagtagaaataagctcgtctttgagaaccgtggactggtactacctatttttagtcaaactcatgactcgcgcctagcccacatacaat tttgtccctggccacgacgttctactatatgttaatgaaacgtaaggaattgcgttggccaagaaacgtccttttcacagatacccgtcgtacctgattaccgctgtagggcgctttttccggctggggcgcgcgtgtctgttggccgggccctacgtaggcctataacggaaagatttgtaccaaattctactacgagg p5-colomon.pl: 40.274496 p5-moritz.pl: 64.2675568 p5-fox.pl: 88.9493287
(Apologies that I didn’t time all the codes for that one, but it was very obvious the others were much slower, and I needed the Linux box for $work.)
In those times panics were common, and few days passed without some city or other registering in its archives an event of this kind. There were nobles, who made war against each other; there was the king, who made war against the cardinal; there was Spain, which made war against the king. Then, in addition to these concealed or public, secret or open wars, there were robbers, mendicants, Huguenots, wolves, and scoundrels, who made war upon everybody. The citizens always took up arms readily against thieves, wolves or scoundrels, often against nobles or Huguenots, sometimes against the king, but never against cardinal or Spain. It resulted, then, from this habit that on the said first Monday of April, 1625, the citizens, on hearing the clamor, and seeing neither the red-and-yellow standard nor the livery of the Duc de Richelieu, rushed toward the hostel of the Jolly Miller. When arrived there, the cause of the hubbub was apparent to all. p5-colomon.pl: 167.9458936 p5-fox.pl: 467.1575598 p5-moritz.pl: 858.1724499 p5-util.pl: 2079.24607 p5-mberends.pl: 3054.501836 p5-matthias.pl: 4872.900568
March 16, 2011 at 10:38 pm |
It would be nice if you could link to mberends solution. I’ve got one myself that seems to perform better than any of your examples under certain circumstances, but it’s impossible to tell without running my own trials…
March 16, 2011 at 11:55 pm |
So, it looks like my solution doesn’t beat colomon’s for all inputs (it does on the very long English text given above), but it does beat everyone else, and it’s radically simpler than colomon’s solution. I’ll post it when I have a chance, but the algorithm is pretty trivial I just keep two lists: known solutions and “active” solutions. An active solution is any sub-string that I’ve encountered in both source strings, but which might only be the root of a longer solution. I walk forward from 0 to whichever string length is larger (i, below) and do 8 things (there’s a lot of bounds checking in my code that I’m ignoring, below):
0) Store the current position, i, in both strings as an “active solution” of length 0 (an active solution is a datastructure containing an offset in string1, an offset in string2 and a current length).
1) check to see if string1[i] eq string2[x] where x is set to the last tested string2 position in each active solution plus one, if so increment its length by 1.
2) Same as above, but testing string2[i] against active solutions
3) At this point, all active solutions not matched in steps 1 or 2, above are removed from the list of active solutions and added to known solutions.
4) For all positions in string2 which are saved in the lookup table (see below) under string1[i], store a new, active solution of length 1.
5) Same as #4, but reversing string1/string2.
6) Append the index i to the list at lookup1[string1[i]]
7) as #6, but save to lookup2[string2[i]]
Once done with the outer loop, return the longest known solution.
Practical concerns include the fact that this approach could be memory intensive for very large strings.
March 17, 2011 at 6:07 am |
Since I fixed a small bug (code uploaded here https://github.com/ajs/perl6-scripts ) I’m no longer beating colomon. However, I am still producing very reasonable times, and beating most of the other contenders for most of the inputs. My worst performance is on inputs that have low numbers of symbols (e.g. the genetics examples), and I shine best on English text.
March 17, 2011 at 5:46 am |
Aaron, this is p5-mberends.pl: http://pastebin.com/rK4tz0UE with the disclaimer that it is not being developed to be competitive and is hobbled by poor chr(0) handling in regexes and strings. It would be a helpful yak shave to remove the workarounds (after eliminating the need for them) and then compare performance. I made a Perl 5 version without workarounds and it runs much faster, but that may just be Perl 5.