Timing Text File Reads, Part 3

So, the last two posts suggest there is major overhead to using the lazy iterator approach with .lines. I decided to explore this by rolling my own iterator to read a file. First, I suspected the gather / take has a big overhead, so I just tried for a basic customer iterator first:

class LinesIter is Iterator {
    has $!filehandle;
    has $!value;
    
    method infinite() { False } # or should this be True?
    
    method reify() {
        unless $!value.defined {
            $!value := pir::new('Parcel');
            my $line = $!filehandle.get;
            if $line.defined {
                pir::push($!value, $line);
                pir::push($!value, LinesIter.new(:filehandle($!filehandle)));
            }
        }
        $!value;
    }
}

That takes 21.7, very slightly more than the standard gather / take version. So much for the theory gather / take is inefficient here!

I'm told that the spec requires that .lines be strictly lazy so you can mix in calls to .get. I don't know where, and it seems a bit crazy to me. But anyway, by those lights the following potential optimizations are actually illegal, because they break the strict connection between .lines and .get.

Here's one that does three .gets at a time, cutting the number of iterator objects created by two-thirds.

class LinesIter is Iterator {
    has $!filehandle;
    has $!value;
    
    method infinite() { False }
    
    method reify() {
        unless $!value.defined {
            $!value := pir::new('Parcel');
            my $line = $!filehandle.get;
            if $line.defined {
                pir::push($!value, $line);
                $line = $!filehandle.get;
                if $line.defined {
                    pir::push($!value, $line);
                    $line = $!filehandle.get;
                    if $line.defined {
                        pir::push($!value, $line);
                        pir::push($!value, LinesIter.new(:filehandle($!filehandle)));
                    }
                }
            }
        }
        $!value;
    }
}

Clocking in at 15.6 seconds -- significantly better than the current .lines implementation, significantly worse than the .get version -- this was actually the best variant I came up with.

I tried upping the count to 8, but it actually ran a touch slower then. And jnthn suggested a version which tried to optimize creation of the iterator objects, but it actually ran significantly slower than the naive version.

I'm not really sure where this leaves us...

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

%d bloggers like this: