Generate and Execute
A Paradigm for Perl Programs
Wolfgang Laun
Thales Rail Signalling Solutions GesmbH

Introduction

Historic Background

The genesis of this paper began with my detection of a project that became known as "Perl Power Tools". Its declared goal was the reimplementation of all Unix tools, or at least as many as possible. A look at the site hosting the collection as it was then was somewhat disappointing. Not because all the more sophisticated utilities were missing - that was to be expected - but also because some of the simpler programs weren't good enough. It seems that the general tenor was that Perl is bound to be slower than C, and so even a considerable loss of speed was taken for granted.

I had already ventured to implement sed, the Stream Editor, in Perl, and this eventually made its way into the Perl 5 distribution. So I decided to test the implementation techniques I'd used in sed with a few other Unix utilities. This paper collects some lessons learned from these experiments.

Compile vs. Interpret

The fact that many tasks just take longer if programmed in Perl than in, say, C is correctly attributed to the fact that the Perl compiler does not produce machine code but a byte code, which is interpreted. Combined with the way Perl stores variables, this is bound to create a noticeable execution time penalty.

Now consider a program like sed ([1]). Any sed invocation is controlled by a sed program, to be written in a somewhat peculiar language, where the basic implied execution cycle repeats this program for each input line. A sed implementation is confronted with the problem how to handle the representation of the program during its execution.

A sed implementation that interprets would use some suitable representations of the sed commands and an interpreter executing on a command sequence. It's easy to see that this incurs the typical overhead.

Alternatively, a sed implementation may choose to compile the sed program by translating its commands into some other language X, and then compile and execute the generated program. The problem with this approach is that you don't have a guarantee that a compiler for language X is available when you execute sed. And, even if you have, writing the generated text to a file or into a pipe, feeding this text to the compiler running as a subprocess, and then executing the compiled program as yet another subprocess, creates a noticeable overhead that makes this approach impractical in most cases. Perl, however, is compiler and byte code interpreter rolled into one, with compile time and run time blending into each other. The Perl builtin function eval (with an expression as its argument, not a block) is, of course, the essential function for compiling and running Perl source code from within a Perl program.

Do not confuse the compile-and-execute being done by eval with a system call that invokes the program perl with some script as input. There are cases where another more or less independent process has its merits, but here we're interested in the faster approach using eval.

The Techniques

Preparing to Call eval

To call eval on an expression, you have to build "a little Perl program" (quote from perlfunc.pod). Perl parses, compiles and executes the string resulting from the expression. Here the term "program" is to be understood in a purely syntactic way.

Of course you can compose a complete program and use a single evaluation call to compile and execute, but there is another option that may come in handy. In any case, we should be aware that the code submitted to eval is "executed in the lexical context of the calling program". The more important implications of this are:

  1. The evaluated code may access all the resources of the calling program. This enables you to provide a set of subroutines without having to pass them through the eval call.
  2. The evaluated code might affect your code in some unexpected and undesired way.

Generate Code Peacemeal

If you think that it is more convenient to create your program in instalments, you could call eval repeatedly. Each piece of code (except, possibly, for the last one) would contain one or more subroutine definitions. Structuring the generated code in subroutines is a a good idea anyway, as all local variables required in the generated code won't interfere with variables used during the translation stage.

Another useful technique for separating the two program levels is the usage of a special package for the generated code.

Generate a Full Perl Program

This is the approach used in the implementation of sed. The code generator may operate in one of two modes:

  1. Generate the program and write it to standard output (and terminate).
  2. Generate the program and execute it.
The first alternative is useful if the generated program is likely to be kept for repeated execution. (With Perl's sed, people were also interested how some sed command or pattern could be implemented in Perl.) If, however, you have a set of subroutines that are to be called by the generated code, they'll have to be included into the generated text.

Preparing Perl Code

Variations in the generated code that depend on the input program will not only require conditional statements to select the next piece of object (Perl) code but also modification by insertion of some input statement parameter value.

Constant chunks can be added to the generated code by using a Perl here-document with a quoted delimiter to avoid interpolating the document text. I use comment markers so that such a section of code stands out from the surrounding text.

#-->>>
$Code .= <<'TheEnd';   # append another chunk
	if( $oldData eq $theData ){
            print "*\n" unless $inSame;
        # ...
	$oldData = $theData;
TheEnd
#--<<<
If you have to interpolate some parameter, you have to escape all sigils of runtime variables with a backslash. In the example below, the contents of $trailoff are interpolated into the generated text. (It also indicates the idea of collecting the code chunks in an array, to be joined prior to its evaluation.)
#-->>>
push( @Code, <<TheEnd ); # collect chunks in an array
    }
    \$theOffset += \$got;
    $trailoff
}
TheEnd
#--<<<

Writing lots of Perl code in this way is no fun. It's risky, too, because a forgotten backslash may not result in an error the Perl compiler could throw back at you. For sed, a table-driven approach helped to avoid these problems. A hash maps sed command codes to table rows of four entries:

Entry Usage
Number of addresses Check input
Statement type Handling of command operand
Generator subroutineCalled to generate code
Code to be generatedPassed to generator subroutine

All generator subroutines have the same signature, and they are called with the actual command addresses, the last entry in the table row and other information common to all commands.

The simple entry for dealing with the sed command "g" (which replaces the contents of the pattern space with the hold space) is the one shown below.

$ComTab{g} = [ 2, '', \&Emit, '{ $_ = $Hold };' ];
You can still use here-documents for nice formatting of the code chunks, but the sed implementation avoids interpolation. The entry below implements the sed command "d" that deletes the pattern space (and starts the next cycle).
$ComTab{d} = [ 2, '', \&Emit,  <<'-X-' ];
{ $doPrint = 0;
  goto EOS;
}
-X-
A simple substitution rule built into a generator subroutine permits the insertion of a command parameter. This is not done by interpolation; the Branch generator function simply does s/XXX/$label/ on the code template to insert the actual label argument of the sed command. ("b" and "t" are sed's branching operations.)
$ComTab{b} = [ 2, 'str', \&Branch, '{ goto XXX; }' ];
$ComTab{t} = [ 2, 'str', \&Branch, '{ goto XXX if _t() }' ];

Managing Variants for the Generated Code

Writing a generator is a little like writing many programs at the same time. Depending on the options and commands that must be represented in the resulting code, the generated program may vary considerably. There may be alternatives for reading the program's input data; a chunk of that input data may have to be processed in one of many possible ways, and sometimes several steps may have to be generated, one after the other.

One way to come to grips with this challenge is to begin your design by first writing a program that interprets the input program. Although this is a step in the wrong directions, it is (as so often) easier. Then you analyze this program by separating the decisions that depend on the command and options from those that depend on the data. This will show you what can be done in the generator, and what has to be put into the generated code.

We'll illustrate this with a typical sequence of decisions as it is required for handling an invocation of cut ([1]), such as

cut -f3-5
Expressed in pseudocode, the processing logic would be
   while readline( $line )   # read, strip newline
      if $fOption then
          if index( $line, $delim ) < 0
              if not $sOption then
	      	  printline $line;
              next;
          @fields = split( $delim, $line );
          @result = ()
          for $index in @indexlist          
              if $index < @fields then
                  push( @result, $fields[$index] )
          printline join( $delim, @result );
      else 
          ...
It is obvious that $fOption is a loop invariant for the while loop, so that this test can be made at compile time. Also, the code required to handle a line without a field separator does not entirely depend on the current line. The set of index values defining the requested fields is another invariant, but this has to be processed considering the number of fields in the current line. (But can you think of a way of avoiding the loop over the index values?)

How to Avoid Loops

Even though all Perl programmers, sooner or later, come across the builtin functions grep and map, not all of them realize the potential of these powerful functions. (This may be due to some ingrained dislike for having to write a transformation from right to left, where we have been educated to write left-to-right.)

The loop hidden within a grep or map call is much faster than the equivalent explicit for loop because the iteration over the argument list is handled internally by the grep or map operation. Here is a comparison of three ways for filtering an array.

Code Time
@b = grep($_ % 2, @a)
22.08
for(@a){ push(@b, $_) if $_ % 2;}
29.15
for($i = 0; $i < @a; $i++){ push(@b, $a[$i]) if $a[$i] % 2;}
49.13

We'll not discuss whether writing as many loops as possible with grep or map is good programming style. But we're planning to generate code that is as fast as possible, a no-holds-barred enterprise.

Let's reconsider the loop for handling the fields in a cut -f run.

   @fields = split( $delim, $line );
   @result = ()
   for $index in @indexlist          
       if $index < @result then
           push( @result, $fields[$index] )
A straightforward implementation might use for:
for my $index ( @indexlist ){
    push( @result, $fields[$index] ) if $index < @fields;
}
But we want to avoid for loops, and so we use grep to eliminate index values beyond the end of the current field list and map to obtain the field value from the index:
@result = map( $fields[$_], grep( $_ < @fields, @indexlist ) );
So, let's benchmark our gain and celebrate:

Code Time
for $index (@indexlist){
  push(@result, $fields[$index]) if $index < @fields;
}
18.36
@result = map($fields[$_], grep($_ < @fields, @indexlist))
23.05

Goodness gracious me - what went wrong? We have replaced a for loop with implicit loops of map and grep - exactly, loops, mind the plural. Even though these loops aren't nested loops in the usual sense, they still create too much overhead. So, let's try a single map:

my @result = map($_ < @fields ? ($fields[$_]):(), @indexlist);
Although this is faster than the map-grep combination, it's still a little slower than the for loop. The cumbersome test of an index value against the length of the current field list and the necessity of returning a list in any case appears to slow us down.

But is it really necessary to loop over the index values in @indexlist? We know that this array is a loop invariant, and we know that Perl lets us write something like @a[@b], resulting in a selection of @a according to the index values in @b. We still need a grep to eliminate the undefined "phantom fields" beyond the end of the field list obtained from the current line:

@result = grep( defined($_), @fields[@indexlist] );
This filtering isn't necessary if the field list is long enough. We could test for this once, and use the plain sublist expression if there are enough fields. - The table below summarizes all variants.

Code Time
for my $index (@indexlist){
  last if $index >= @fields;
  push(@result, $fields[$index]);
}
18.20
@result = map($_ < @fields ? ($fields[$_]):(), @indexlist);
21.66
@result = map($fields[$_], grep($_ < @fields, @indexlist));
23.05
@result = grep(defined($_), @fields[@indexlist]);
10.96
@result =
  $maxindex < @fields ? @fields[@indexlist]
                      : grep(defined($_), @fields[@indexlist]);
9.14

String Processing

Processing a string character by character is a frequent task. A straightforward loop may be coded like this:

for( my $i = 0; $i < length($s); $i++ ){
    my $c = substr( $s, $i, 1 );
    # ...(do something with $c)
}
Having seen that a for loop iterating over the elements of an array can be beaten by using map or grep, we may very well conjecture that we can be faster here, too. The additional task required for dealing with a string is splitting the string into characters. In addition to substr, there is split, and let's not forget the almost chameleonic unpack.

The Perl implementation of od (the Unix utility for dumping files in octal and other formats, [1]) contains functions (for the options -ta and -tc) that require such a processing of a string: split into characters, and return the joined results of looking up the characters' ordinal values in a table. There is indeed more than one way to do this, so let's see...

Code Time
$res =
  join( '',
        map( $aTab[ord()&0x7f], split( '', $s ) ) );
22.48
$res =
  join( '',
        map( $aTab[$_&0x7f], unpack( 'C' . length($s), $s )));
14.41
$res = '';
for( my $i = 0; $i < length($s); $i++ ){
  $res .= $aTab[ord(substr($s, $i, 1))&0x7f];
}
11.92
$res = '';
for my $n ( split('', $s) ){
  $res .= $aTab[ord($n)&0x7f];
}
17.82
$res = '';
for my $n ( unpack('C' . length($s), $s) ){
  $res .= $aTab[$n&0x7f];
}
8.57

It's remarkable that the very Perlish and impressive looking combination of join, map and split comes in last. It seems to be a case of too many cooks spoiling the broth.

You might think that the variant using unpack is faster because it produces the ordinal values without the additional ord() function call, but this isn't true. Even separating a string into individual characters and rebuilding this string is faster if the splitting is done with unpack:

# Faster than $res = join('', split('', $s)) 
my $res = '';
for my $n ( unpack('C' . length($s), $s) ){
  $res .= chr($n);
}

Divert Output

The macro processor m4 ([1]) has a builtin macro divert that can be used to create buffered output streams. Adding output to a diversion just adds to an in-memory buffer. Just prior to program termination all accumulated buffers are written to the main (standard) output file.

Such a feature is quite useful for any compiler or text generator as it lets you generate output and redirect it into a diversion. Details are hidden in some "emit" procedure that keeps track of the current diversion. There is a function to write some diversion into the output stream, and all pending diversions are automatically emitted before the program terminates.

For a simple example, imagine some option that requires code both at the top of the main processing loop and before its end. With a diversion feature you can emit code for both ends from within a single if statement.

if( $featureX ){
    emit( $code_for_loop_start );
    divert( LoopEnd );            # divert to "loop end" code
    emit( $code_for_loop_end );
    divert();                     # return to mainline code
}

Interpolation: Roll Your Own

In the section Preparing Perl Code we have seen a few examples of how Perl code can be generated by writing here-documents and letting Perl interpolate variables into the document's text. The inconvenient part of this is the necessity of having to quote all references to variables (scalars, arrays and hashes) that are intended to remain verbatim. Since the number of compile-time variables that should be interpolated is usually small compared to the number of run-time variables, and interpolation is almost always done using scalars, you might consider to code the interpolation mechanism yourself. Basically, all you need to do is run the code snippet about to be emitted through a substitution such as the one shown below. Notice that the pattern also matches names included in #{...}, which is provided with the same intent as Perl's own ${...}.

$snippet =~ s/\#(?: ([[:alpha:]]\w*)
                 | {([[:alpha:]]\w*)})/ lookup($1||$2) /egx;
The example shows that compile-time variables should be marked with a "sigil" that is distinct from anything that might occur in the processed code. (This isn't intended to keep you from generating comments - just make sure that you leave a space after the '#'.)

The important issue with this approach is the way the compile-time variables and the lookup function are implemented. With Perl's own interpolation, all visible variables can participate, and Perl will find a local lexical variable just as easily as a package variable. If you write your own lookup function, you'll have to provide a symbol table as well. The easiest way would be to use a separate package (here: ctv) just for these compile time variables, so that the lookup function simply accessesy the stash storing the package variables:

sub lookup {
    my( $name ) = @_;
    no strict 'refs';
    my $res = ${"ctv::$name"};
    return $res if defined( $res );
    die( "no such compile time variable: $name\n" );
}
To avoid warnings about variables being used only once, you may use
no warnings 'once';

Notice that using a package-cum-stash lets you use local on the compile-time variables. By applying it within some block to one of the package variables you create the effect of a local variable, even within a recursive function.

Simple Pitfalls

There are a few places where literally any text might have to be generated, according to some unconstrained user input.

One such place is a string literal. Well, we all know that surrounding a string with apostophes ("'") requires only the apostrophe and the backslash ("\") to be escaped with a backslash. To generate code containing a string literal, one might come up with a simple function like this:

sub asString {
    my( $value ) = @_;
    $value =~ s/(['\\])/\\$1/g;
    return "'" . $value . "'";
}
Control characters won't be a problem, although the generated text will not render them in a legible way. To better this, you might split your string literal into concatenated sections, with printable substrings quoted with an apostrophe and non-printable substrings quoted with quotes ('"') and the characters therein written in hexadecimal notation. (Perl performs the concatenation once at compile time.)

Using the builtin function quotemeta with quotes as delimiters is another option, but this will also leave the control characters themselves as they are, and you may get oodles of backslashes.

And don't forget that printf and sprintf have yet another metacharacter that requires special treatment.

Guarding Against the Improbable

Below is a subroutine from sed. It modifies a line (in $$codref) that contains the start of a here-document, replacing "TheEnd" with a suitable delimiter and appends the same delimiter to the document text (in $$argref). Notice the while loop that makes sure that the delimiter does not occur within the here-document.

sub safeHere($$){
    my( $codref, $argref ) = @_;
    my $eod = 'EOD000';
    while( $$argref =~ /^$eod$/m ){
        $eod++;
    }
    $$codref =~ s/TheEnd/$eod/e;
    $$argref .= "$eod\n"; 
}

Problems

Not Impossible, But...

While I was doing my stunt in connection with the "Perl Power Tools" project I came across a few requirements in the POSIX specifications that posed quite difficult problems. They are mainly due to the fact that Unix utilities may make use of libraries which provide just the functions you need, whereas in Perl you have to make do with the features Perl provides. This section discusses three of these problems that result from such differences.

Regular Expressions

By definition, POSIX utilities use POSIX Basic or Extended Regular Expressions. Typical places are in sed, m4 and lex ([1]). Perl's regular expressions are not only much more powerful, but also different in more or less subtle ways. This means that you can express a POSIX RE by an equivalent Perl RE, but you have to parse the POSIX RE, symbol by symbol, and compile an equivalent Perl RE.

Since the Perl implementation of sed is intended to be an exact replica of the Unix implementation, there was no way of avoiding this arduous task. But would a Perl implementation of sed have been less useful if it simply had adopted Perl regular expressions?

It's even worse if you look at the GNU implementations of these utilities. They extend POSIX REs with - admittedly useful - functions, but this just complicates the issue for a Perl implementation.

Almost, But Not Quite the Same

An extremely subtle difference between POSIX sed and my Perl implementation is due to the way the empty pattern ("//") is interpreted in sed and the way it is implemented in Perl. Both differ from the POSIX definition.

Traditional sed implementations interpret the empty pattern to repeat the last regular expression match, both in an address and in the replacement part of the "s" command. In Perl, if the pattern evaluates to the empty string, the last successfully matched regular expression is used instead. The subtle difference makes it impossible to use Perl's empty RE correctly as a replacement for sed's empty RE. An exact duplication of sed's behaviour is astonishingly difficult to implement, and certainly not worth the effort. The traditional usage of the empty RE is in an "s" command, as in the following example where "today" is replaced by "tomorrow":

   /today/ s//tomorrow/
Since the last successful match is bound to be with /today/, this match is replaced. Only rather contrived examples exhibit the difference in the behaviour.
   /append/ a \
   This is appended.
   /insert/ i \
   This is inserted.
   // a \
   Triggered by empty pattern.
Here, /bin/sed inserts the "Triggered" line only for an input line that matches /insert/ since this is the "last match" (although not always a successful one). The Perl implementation of sed emits the "Triggered" line for a match with either pattern.

Pattern Matching on an Input Stream

This problem arose during my Perl rewrite of lex, the Unix lexical analyzer generator. The generated code is supposed to match subsequent sections of its input stream. Regular expressions may, of course, match across any number of input lines. So, how do you implement this in Perl where Pattern matching operations act on a variable? It's not possible to extend the string with some additional input data if the regex engine reaches the end of the string while still matching and hoping for more. (Not even one of the fantastic Perl 5.10 extensions for RE would do the trick - I think.)

My solution was to let the generated program read the entire input before trying to match any patterns. This has certain disadvantages, but implementing a regex engine that can pull in more input while pattern matching is in progress would require hacking Perl's regex engine or something equally demanding, and that, admittedly, scared me off.

Results and Conclusion

Perl Is Slower, Except...

The table below compares the Perl implementation of od available on CPAN (ppt-0.12), my Perl implementation and Linux' /usr/bin/od. (The times were obtained with /usr/bin/time, from a run processing a 3MB file with approximately 110000 lines, with output diverted to /dev/null.)

Format od/CPAN od/Perl /usr/bin/od
-ta 9.21 4.27 1.21
-tx2 3.28 2.43 0.48
-tdS 3.48 2.55 0.54
-tf4 4.48 3.60 1.23
-tax2dSf4 17.08 9.14 3.37

It appears that much of the speed difference between C and Perl is due to input and output. A rough estimate of the time required for formatting alone indicates that the relation between the C utility and the faster Perl implementation is only about 1:2.5.

Perl doesn't look so good in a similar comparison of Perl implementations of cut with Linux' /usr/bin/cut.

Options cut/CPAN cut/Perl /usr/bin/cut
-d: -f1,3,5,7 57.5521.84 0.74
-c1-10,21-30 13.33 4.69 0.65

A look at the C code reveals that it benefits from a byte-oriented input routine that reads up to the next field delimiter or newline, whichever is first. In Perl, a line-oriented approach is more convenient, but, apparently, much slower: processing a line requires an inner loop. (There is also a portablility issue lurking around the corner...)

The Perl implementation of sed (distributed with Perl 5) provides a good chance for winning bets against people who think that a Perl implementation is bound to be slower than an equivalent program written in C. The table below shows the execution times for several sed scripts.

Script /bin/sed Perl's sed
Centering lines 51.19 3.46
Binary to decimal 16.99 1.89
Reverse characters of lines 9.51 7.94
Increment number 0.71 2.15
Reverse lines in a file 15.51 17.64
-e s/[a-d]/X/g 2.93 2.04

Lessons Learned

Generating Perl code that's tailored to an algorithm with all parameters and decisions resulting from them in place is a highly recommended technique if you want to be fast.

Use benchmark. Harmless looking Perl code can be slow. Try for alternatives.

Mind your loops.

Trying to implement one of the more complex Unix utilities in Perl will make you very well acquainted with that utility, and it will teach you an inredible lot about some of Perl's features. In short, it is a highly recommended experience - if you can spare the time.

References

[1] The Open Group Technical Standard "Commands and Utilities", Issue 4, Version 2. Available from https://www.opengroup.org