Timing Text File Reads

I’ve been worried about the performance of .lines reading text files due to iterator overhead, so I decided to time it tonight. It is a problem, but it appears to not be the biggest problem!

As usual, I probably did things backwards. I started off by creating a ten thousand line text file from Violet Jacob poems and Rakudo’s core.pm. Then I ran it against this:

my $count = 0;
for $*IN.lines {
    $count++;
}

say :$count.perl;

That takes 55s on my MacBook Pro.

The .lines method is just an iterator wrapper around .get, so I tried a version which calls .get directly next.

my $count = 0;
loop {
    $*IN.get // last;
    $count++;
}

say :$count.perl;

That takes 44s on the MBP. So using the iterator adds 25% to the execution time. That’s bad, but in a certain sense, less bad than the straight .get version being so incredibly slow.

Next attempt, .slurp. What’s the overhead of doing things line-by-line, with an autochomp?

$*IN.slurp;

That takes 5.6 seconds. So the line-by-line overhead is terrible AND even the slurp version is crazy slow.

jnthn coded up two PIR versions for comparison. Version 1 reads the entire file in PIR — that took 0.84 seconds on my MBP. Version 2 reads the file line-by-line in PIR, and is drastically faster — 0.03 seconds. (pmichaud++ reported this discrepancy to the Parrot team.)

It looks like .chomp might be the best point of attack. It looks grotesquely inefficient at the moment. But it’s time for bed now….

Update: see the next post for reports on how pmichaud’s .chomp optimization (mentioned in the comments) improved the situation.

This entry was posted on August 21, 2010 at 1:44 am and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

One Response to “Timing Text File Reads”

Pm Says:
August 21, 2010 at 3:22 am | Reply
After updating .chomp to be far more efficient, the time to read a 10,000 line text file went from 75 seconds to 21 seconds on my system.

Of course, there’s still more we can do to improve efficiency — it just shows that some of our builtins aren’t well-factored yet.

Pm

Just Rakudo It