Archive for August, 2010

The Series Operator and Memoization

August 28, 2010

Masak recently posted on the series operator solution to Pascal’s Triangle. I’d like to focus on an important property of his final solution that hasn’t been much discussed yet.

But first, let me back up and consider another well-known problem with similar properties: the Fibonacci series. A straightforward recursive solution looks like this in Perl 6:

sub Fibonacci($n) {
    return $n if $n == 0 | 1;
    Fibonacci($n - 1) + Fibonacci($n - 2);
}

This nicely implements the definition, but it has one critical problem — it is incredibly inefficient. (And not just in Rakudo — in any language!) The reason is that calculating Fibonacci(N) requires all the work of calculating Fibonacci(N-1) and Fibonacci(N-2). The number of subroutine calls involved grows as fast as the series does.

The standard recursive technique for dealing with this problem is called memoization. Basically, you cache the values which have already been computed somewhere, so you only need to compute them once. That’s a bit awkward to implement by hand — it roughly doubles the complexity of the routine, I’d say. Luckily, lots of languages have automatic ways to apply memoization. Perl 6 is supposed to, with the “is cached” modifier to a subroutine definition, but Rakudo doesn’t have it implemented yet.

That’s okay, because there is more than one way to do it, and there is a great way which is available already in Rakudo.

> my @Fibonacci := (0, 1, -> $a, $b { $a + $b } ... *); 1;
1
> say @Fibonacci[5]
5
> say @Fibonacci[10]
55
> say @Fibonacci[30]
832040

(I’ve used parentheses instead of do because that’s how I roll, but otherwise this is exactly the same idea as Masak’s solution. The additional 1; is needed only because I typed this into the REPL, and without it the REPL will try to evaluate the entire infinite series so it can print it out.)

You can’t see it here, of course, but calculating these numbers was very quick. That’s because this solution using the series operator is equivalent to writing a very smart memoization. Infinite lists in Perl 6 are implemented (roughly speaking) with an array of values that have been already calculated, and an iterator which knows how to calculate further values as needed.

So when you say @Fibonacci[5] the first time, the list calls the iterator generated by the series operator to get the first six values, storing them in @Fibonacci[0] through @Fibonacci[5]. (Those values are generated in a straightforward iterative way, so it is very efficient.) When you then say @Fibonacci[10] those first six values are still there, and only the next five values must be calculated. If at that point you said @Fibonacci[8], that is just a simple array lookup of the already calculated value!

Think about it. It would take a very smart automatic memoizer to somehow figure out that the function would only be called when $n was a non-negative integer, so that an array could be used to efficiently cache the results. Using a series operator this way gets you that kind of performance automatically already in Rakudo.

So it’s a double win. Using the series operator is not only the most succinct way to express this series in Perl 6. It’s also an extremely efficient way of calculating the series.

Quick ABC fixes

August 26, 2010

SF’s ABC project continues to be dead in the water, but that hasn’t stopped me from needing some quick ABC scripts. I’m preparing a mini-book of tunes for an Irish Traditional Dance Music workshop, and needed a couple quick programming tricks to make my life easier.

First, a file of ABC tunes should have a unique “X:” index field for each tune. I’d gathered these tunes from around the Internet; correspondingly, about half had “X: 1″, and the rest had some random number. This was easily fixed with a simple Perl 6 script:

my $x = 1;
for $*IN.lines {
    when /^X\:/ { say "X: { $x++ }"; }
    say $_;
}

The only thing at all odd here is that I’ve used when in a for loop. It just smartmatches the topic ($_) against the regex there, and if it’s a match, executes the code that follows and then skips to the next iteration of the loop.

The trickier thing I needed to do was transpose a couple of tunes, because the versions I found on the Internet were in the wrong key. Cutely, this is pretty easily done with the .trans method in Perl 6.

for $*IN.lines -> $line {
    # say (~$line).trans("C..GABc..gab" => "FGABc..gab");  # from D to G
    say (~$line).trans("C..GABc..gab" => "D..GABc..gabh"); # from G to A
}

This needs to be applied only to the notes of the tune, and if you use the “G to A” transposition and you have an “h” as a note, you need to switch it to “c'”. (You also need to change the declared key signature in the tune’s header. And accidentals might need work. This is definitely crude ABC-handling code, but it suited my purposes perfectly last night.)

There is one big gotcha here. Why is it (~$line) instead of just $line? Well, internally .lines is calling .chomp. And apparently .chomp no longer returns a Rakudo Str, which seems to mean some other .trans is executed, causing no end of problems.

Anyway, at some point I intend to try to get the real ABC module working in Rakudo (if no one else beats me to it). In the meantime, I’ve got to survive a weekend’s worth of folk festival first.

Timing Text File Reads, Part 3

August 21, 2010

So, the last two posts suggest there is major overhead to using the lazy iterator approach with .lines. I decided to explore this by rolling my own iterator to read a file. First, I suspected the gather / take has a big overhead, so I just tried for a basic customer iterator first:

class LinesIter is Iterator {
    has $!filehandle;
    has $!value;
    
    method infinite() { False } # or should this be True?
    
    method reify() {
        unless $!value.defined {
            $!value := pir::new('Parcel');
            my $line = $!filehandle.get;
            if $line.defined {
                pir::push($!value, $line);
                pir::push($!value, LinesIter.new(:filehandle($!filehandle)));
            }
        }
        $!value;
    }
}

That takes 21.7, very slightly more than the standard gather / take version. So much for the theory gather / take is inefficient here!

I’m told that the spec requires that .lines be strictly lazy so you can mix in calls to .get. I don’t know where, and it seems a bit crazy to me. But anyway, by those lights the following potential optimizations are actually illegal, because they break the strict connection between .lines and .get.

Here’s one that does three .gets at a time, cutting the number of iterator objects created by two-thirds.

class LinesIter is Iterator {
    has $!filehandle;
    has $!value;
    
    method infinite() { False }
    
    method reify() {
        unless $!value.defined {
            $!value := pir::new('Parcel');
            my $line = $!filehandle.get;
            if $line.defined {
                pir::push($!value, $line);
                $line = $!filehandle.get;
                if $line.defined {
                    pir::push($!value, $line);
                    $line = $!filehandle.get;
                    if $line.defined {
                        pir::push($!value, $line);
                        pir::push($!value, LinesIter.new(:filehandle($!filehandle)));
                    }
                }
            }
        }
        $!value;
    }
}

Clocking in at 15.6 seconds — significantly better than the current .lines implementation, significantly worse than the .get version — this was actually the best variant I came up with.

I tried upping the count to 8, but it actually ran a touch slower then. And jnthn suggested a version which tried to optimize creation of the iterator objects, but it actually ran significantly slower than the naive version.

I’m not really sure where this leaves us…

Timing Text File Reads, Part 2

August 21, 2010

Overnight pmichaud rewrote .chomp in PIR to optimize it, leading to huge improvements in performance. The .lines version of the code has gone from 55s to 21.5s; the .get version from 44s to 7.2s. Weirdly enough, the .slurp version has gone from 5.6s to 9s. (No idea why.)

The good news is that the performance of the .get version is clearly much, much better. (Still with plenty of room for improvement!) The bad news is that the penalty for using the iterator-based version is now much more significant — it adds nearly 200% to the read time!

But then, that’s another potential spot for optimization…

Timing Text File Reads

August 21, 2010

I’ve been worried about the performance of .lines reading text files due to iterator overhead, so I decided to time it tonight. It is a problem, but it appears to not be the biggest problem!

As usual, I probably did things backwards. I started off by creating a ten thousand line text file from Violet Jacob poems and Rakudo’s core.pm. Then I ran it against this:

my $count = 0;
for $*IN.lines {
    $count++;
}

say :$count.perl;

That takes 55s on my MacBook Pro.

The .lines method is just an iterator wrapper around .get, so I tried a version which calls .get directly next.

my $count = 0;
loop {
    $*IN.get // last;
    $count++;
}

say :$count.perl;

That takes 44s on the MBP. So using the iterator adds 25% to the execution time. That’s bad, but in a certain sense, less bad than the straight .get version being so incredibly slow.

Next attempt, .slurp. What’s the overhead of doing things line-by-line, with an autochomp?

$*IN.slurp;

That takes 5.6 seconds. So the line-by-line overhead is terrible AND even the slurp version is crazy slow.

jnthn coded up two PIR versions for comparison. Version 1 reads the entire file in PIR — that took 0.84 seconds on my MBP. Version 2 reads the file line-by-line in PIR, and is drastically faster — 0.03 seconds. (pmichaud++ reported this discrepancy to the Parrot team.)

It looks like .chomp might be the best point of attack. It looks grotesquely inefficient at the moment. But it’s time for bed now….

Update: see the next post for reports on how pmichaud’s .chomp optimization (mentioned in the comments) improved the situation.

Factorial and Memoizing

August 10, 2010

Just saw this post. Tried to comment there, but it failed.

As I understand it, Perl 6 is supposed to have a memoizer built-in, something like sub fact is cached ($n). But that definitely isn’t implemented in Rakudo yet.

There is an interesting way to do this that does work in Rakudo:

> my $fact = gather { take 1; my $a = 1; for 1..* { $a *= $_; take $a + 0 } }; say $fact[3];
6
> say $fact[3]
6
> say $fact[4]
24
> say $fact[10]
3628800
> say $fact.munch(20).perl
(1, 1, 2, 6, 24, 120, 720, 5040, 40320, 362880, 3628800, 39916800, 479001600, 6227020800, 87178291200, 1307674368000, 20922789888000, 355687428096000, 6.402373705728e+15, 1.21645100408832e+17)

That defines $fact as a lazy list of the factorials, which you can then access as, say, $fact[10] (for 10!). The list will only contain actual values for the factorials you have asked for (and any others needed to compute those).

Edited to add: Just realized there is (almost?) an easier way of doing this.

> my $fact = [\*] 1..*; say $fact[3]
24
> $fact.munch(20).perl
(1, 2, 6, 24, 120, 720, 5040, 40320, 362880, 3628800, 39916800, 479001600, 6227020800, 87178291200, 1307674368000, 20922789888000, 355687428096000, 6.402373705728e+15, 1.21645100408832e+17, 2.43290200817664e+18)

As you can see, this is a much cleaner way of generating the list, but the indices are now off by one.


Follow

Get every new post delivered to your Inbox.