Parsing STEP

A month ago I started perl6-ISO_10303-21. In plain English, it’s the beginnings of a toolset for working with CAD files in the STEP format. All it does is parse so far.

The STEP (that’s ISO-10303) Part 21 standard provides a BNF (I think) grammar for ASCII STEP files. It was downright trivial to translate it to a Perl 6 grammar — looking at the commits, it took me about two days to get it to read the first file. THen it sat there for a long time until I starting throwing files at it.

Since then I’ve made a couple of simple changes. First, I replace C-style comments in the file with spaces before parsing; the spec’s grammar doesn’t mention them at all.

Second, I discovered three of the basic sample files I was using were actually illegal, with newlines embedded in strings. I fretted about what to do for a while: should I expand the grammar to accept illegal but “standard” files, or keep the grammar strict so that I could use it to test files my own code exports for correctness?

Then I realized I could have my cake and eat it too:

grammar ISO_10303_21::LooseGrammar is ISO_10303_21::Grammar {
    token non_q_char { <special> | <digit> | <space> | <lower> | <upper> | \v }
}

I just defined a new subclass, changing the non_q_char token to also accept vertical whitespaces, too. Now ISO_10303_21::Grammar is strict, and ISO_10303_21::LooseGrammar is loose.

The next obstacle I’ve run into was speed. The 1.3 meg sample file took about five minutes to prase in Rakudo, and 10+ hours (!) in Niecza. Luckily jnthn++ was on the job, and sent me a patch about fifteen minutes after I sent him the profile of the parse today. (Rakudo’s got --profile built in.) That shaved off 1/6th of the running time.

He also pointed out that tokens like

token standard_keyword { <upper> [ <upper> | <digit> ]* }

were unnecessarily creating a Match object for each character in the keyword. Since all we want is the full text of the keyword, this could be changed to

token standard_keyword { <.upper> [ <.upper> | <.digit> ]* }

and it would suppress those Match objects. That reduced the running time by another 10%.

Another few days like this and it will be fast enough to do some useful work for me…

6 Responses to “Parsing STEP”

  1. Jakub Narebski (@jnareb) Says:

    Have you tried using Perl 5’s Marpa?

  2. Bruce Says:

    If the files are ASCII and not UTF-8, be sure to flag it as such. Parsing involves many substr-type operations which are much more expensive on UTF-8 strings than strings of fixed-width characters.

    • colomon Says:

      That’s actually an interesting question. By spec, the files are ASCII. In practice, you frequently see files in…. err, I don’t know the correct technical term, but 8-bit characters with high bit clear meaning ASCII and high bit set meaning ISO 8859 page 1 I think. Being able to read both would be a good thing. I would not be surprised to someday find UTF-8 files as well. Really, only characters in STEP strings can sensibly be non-ASCII, so it’s not too hard to suss out the meaning intended. On the other hand, I have no idea how to usefully handle those differences in Perl 6 input.

  3. skids Says:

    When faced with this same problem in defining an IPv4 address parser that understands masks/wildcards, I settled on a very similar solution. Rather than define a “loose” grammar though, I decided it was best just to document the tokens that users could override in the manpage, because there were several ways to loosen the grammar up and a user might want any given combination of them.

    Note that one can also do the same thing with the companion “actions” class of production rules. You can also override the grammar’s parse method to link the default production rules with the class:

    our grammar Strict {

    method parse($target, :$rule?, :$actions = Actions, *%opt) {
    nextwith($target, :rule($rule), :actions($actions), %opt);
    }

    method emit($target, :$rule = 'TOP', :$emit = Emit, *%opt) {
    # for now just call designated rule.
    $emit."$rule"($target);
    }
    }

    …the second “emit” method is something I added that mirrors the “actions” and produces the correct text to serialize an AST after modifying it. I think that’s a pattern we will see a lot of so it might be worthy of some spec work…

Leave a reply to Bruce Cancel reply