The genesis of this paper began with my detection of a project that became known as "Perl Power Tools". Its declared goal was the reimplementation of all Unix tools, or at least as many as possible. A look at the site hosting the collection as it was then was somewhat disappointing. Not because all the more sophisticated utilities were missing - that was to be expected - but also because some of the simpler programs weren't good enough. It seems that the general tenor was that Perl is bound to be slower than C, and so even a considerable loss of speed was taken for granted.
I had already ventured to implement sed, the Stream Editor, in Perl, and this eventually made its way into the Perl 5 distribution. So I decided to test the implementation techniques I'd used in sed with a few other Unix utilities. This paper collects some lessons learned from these experiments.
The fact that many tasks just take longer if programmed in Perl than in, say, C is correctly attributed to the fact that the Perl compiler does not produce machine code but a byte code, which is interpreted. Combined with the way Perl stores variables, this is bound to create a noticeable execution time penalty.
Now consider a program like sed ([1]). Any sed invocation is controlled by a sed program, to be written in a somewhat peculiar language, where the basic implied execution cycle repeats this program for each input line. A sed implementation is confronted with the problem how to handle the representation of the program during its execution.
A sed implementation that interprets would use some suitable representations of the sed commands and an interpreter executing on a command sequence. It's easy to see that this incurs the typical overhead.
Alternatively, a sed implementation may choose to compile the
sed program by translating its commands into some other language X, and
then compile and execute the generated program. The problem with this
approach is that you don't have a guarantee that a compiler for language
X is available when you execute sed. And, even if you have, writing
the generated text to a file or into a pipe, feeding this text to the
compiler running as a subprocess, and then executing the compiled
program as yet another subprocess, creates a noticeable overhead that
makes this approach impractical in most cases. Perl, however, is
compiler and byte code interpreter rolled into one, with compile time
and run time blending into each other. The Perl builtin function
eval
(with an expression as its argument, not
a block) is, of course, the essential function for compiling and
running Perl source code from within a Perl program.
Do not confuse the compile-and-execute being done by eval
with a system call that invokes the program perl
with
some script as input. There are cases where another more or less
independent process has its merits, but here we're interested in
the faster approach using eval
.
eval
To call eval
on an expression, you have to build "a little
Perl program" (quote from perlfunc.pod
). Perl parses, compiles
and executes the string resulting from the expression. Here the term "program"
is to be understood in a purely syntactic way.
Of course you can compose a complete program and use a single evaluation call to compile and execute, but there is another option that may come in handy. In any case, we should be aware that the code submitted to eval is "executed in the lexical context of the calling program". The more important implications of this are:
If you think that it is more convenient to create your program in
instalments, you could call eval
repeatedly. Each piece of
code (except, possibly, for the last one) would contain one or more
subroutine definitions. Structuring the generated code in subroutines
is a a good idea anyway, as all local variables required in the generated
code won't interfere with variables used during the translation stage.
Another useful technique for separating the two program levels is the usage of a special package for the generated code.
This is the approach used in the implementation of sed. The code generator may operate in one of two modes:
Variations in the generated code that depend on the input program will not only require conditional statements to select the next piece of object (Perl) code but also modification by insertion of some input statement parameter value.
Constant chunks can be added to the generated code by using a Perl here-document with a quoted delimiter to avoid interpolating the document text. I use comment markers so that such a section of code stands out from the surrounding text.
#-->>> $Code .= <<'TheEnd'; # append another chunk if( $oldData eq $theData ){ print "*\n" unless $inSame; # ... $oldData = $theData; TheEnd #--<<<If you have to interpolate some parameter, you have to escape all sigils of runtime variables with a backslash. In the example below, the contents of
$trailoff
are interpolated into the
generated text. (It also indicates the idea of collecting the
code chunks in an array, to be joined prior to its evaluation.)
#-->>> push( @Code, <<TheEnd ); # collect chunks in an array } \$theOffset += \$got; $trailoff } TheEnd #--<<<
Writing lots of Perl code in this way is no fun. It's risky, too, because a forgotten backslash may not result in an error the Perl compiler could throw back at you. For sed, a table-driven approach helped to avoid these problems. A hash maps sed command codes to table rows of four entries:
Entry | Usage |
---|---|
Number of addresses | Check input |
Statement type | Handling of command operand |
Generator subroutine | Called to generate code |
Code to be generated | Passed to generator subroutine |
All generator subroutines have the same signature, and they are called with the actual command addresses, the last entry in the table row and other information common to all commands.
The simple entry for dealing with the sed command "g
"
(which replaces the contents of the pattern space with the hold space)
is the one shown below.
$ComTab{g} = [ 2, '', \&Emit, '{ $_ = $Hold };' ];You can still use here-documents for nice formatting of the code chunks, but the sed implementation avoids interpolation. The entry below implements the sed command "
d
" that deletes the pattern
space (and starts the next cycle).
$ComTab{d} = [ 2, '', \&Emit, <<'-X-' ]; { $doPrint = 0; goto EOS; } -X-A simple substitution rule built into a generator subroutine permits the insertion of a command parameter. This is not done by interpolation; the
Branch
generator function simply does s/XXX/$label/
on the code template to insert the actual label argument of the
sed command. ("b
" and "t
" are sed's branching
operations.)
$ComTab{b} = [ 2, 'str', \&Branch, '{ goto XXX; }' ]; $ComTab{t} = [ 2, 'str', \&Branch, '{ goto XXX if _t() }' ];
Writing a generator is a little like writing many programs at the same time. Depending on the options and commands that must be represented in the resulting code, the generated program may vary considerably. There may be alternatives for reading the program's input data; a chunk of that input data may have to be processed in one of many possible ways, and sometimes several steps may have to be generated, one after the other.
One way to come to grips with this challenge is to begin your design by first writing a program that interprets the input program. Although this is a step in the wrong directions, it is (as so often) easier. Then you analyze this program by separating the decisions that depend on the command and options from those that depend on the data. This will show you what can be done in the generator, and what has to be put into the generated code.
We'll illustrate this with a typical sequence of decisions as it is
required for handling an invocation of cut
([1]), such as
cut -f3-5Expressed in pseudocode, the processing logic would be
while readline( $line ) # read, strip newline if $fOption then if index( $line, $delim ) < 0 if not $sOption then printline $line; next; @fields = split( $delim, $line ); @result = () for $index in @indexlist if $index < @fields then push( @result, $fields[$index] ) printline join( $delim, @result ); else ...It is obvious that
$fOption
is a loop invariant for
the while loop, so that this test can be made at compile time. Also,
the code required to handle a line without a field separator does
not entirely depend on the current line. The set of index values
defining the requested fields is another invariant, but this has to
be processed considering the number of fields in the current line.
(But can you think of a way of avoiding the loop over the index
values?)
Even though all Perl programmers, sooner or later, come across the
builtin functions grep
and map
, not all of
them realize the potential of these powerful functions. (This may be
due to some ingrained dislike for having to write a transformation
from right to left, where we have been educated to write left-to-right.)
The loop hidden within a grep
or map
call is
much faster than the equivalent explicit for
loop because
the iteration over the argument list is handled internally by the grep
or map operation. Here is a comparison of three ways for filtering
an array.
Code | Time |
---|---|
@b = grep($_ % 2, @a) | 22.08 |
for(@a){ push(@b, $_) if $_ % 2;} | 29.15 |
for($i = 0; $i < @a; $i++){ push(@b, $a[$i]) if $a[$i] % 2;} | 49.13 |
We'll not discuss whether writing as many loops as possible with
grep
or map
is good programming style. But
we're planning to generate code that is as fast as possible, a
no-holds-barred enterprise.
Let's reconsider the loop for handling the fields in a cut -f
run.
@fields = split( $delim, $line ); @result = () for $index in @indexlist if $index < @result then push( @result, $fields[$index] )A straightforward implementation might use
for
:
for my $index ( @indexlist ){ push( @result, $fields[$index] ) if $index < @fields; }But we want to avoid for loops, and so we use
grep
to eliminate index values beyond the end of the current field list
and map
to obtain the field value from the index:
@result = map( $fields[$_], grep( $_ < @fields, @indexlist ) );So, let's benchmark our gain and celebrate:
Code | Time |
---|---|
for $index (@indexlist){ push(@result, $fields[$index]) if $index < @fields; } | 18.36 |
@result = map($fields[$_], grep($_ < @fields, @indexlist)) | 23.05 |
Goodness gracious me - what went wrong? We have replaced a for loop with
implicit loops of map
and grep
- exactly,
loops, mind the plural. Even though these loops aren't nested loops
in the usual sense, they still create too much overhead. So, let's try a single
map
:
my @result = map($_ < @fields ? ($fields[$_]):(), @indexlist);Although this is faster than the map-grep combination, it's still a little slower than the for loop. The cumbersome test of an index value against the length of the current field list and the necessity of returning a list in any case appears to slow us down.
But is it really necessary to loop over the index values in
@indexlist
? We know that this array is a loop invariant,
and we know that Perl lets us write something like @a[@b]
,
resulting in a selection of @a
according to the index
values in @b
. We still need a grep
to
eliminate the undefined "phantom fields" beyond the end of the
field list obtained from the current line:
@result = grep( defined($_), @fields[@indexlist] );This filtering isn't necessary if the field list is long enough. We could test for this once, and use the plain sublist expression if there are enough fields. - The table below summarizes all variants.
Code | Time |
---|---|
for my $index (@indexlist){ last if $index >= @fields; push(@result, $fields[$index]); } | 18.20 |
@result = map($_ < @fields ? ($fields[$_]):(), @indexlist); | 21.66 |
@result = map($fields[$_], grep($_ < @fields, @indexlist)); | 23.05 |
@result = grep(defined($_), @fields[@indexlist]); | 10.96 |
@result = $maxindex < @fields ? @fields[@indexlist] : grep(defined($_), @fields[@indexlist]); | 9.14 |
Processing a string character by character is a frequent task. A straightforward loop may be coded like this:
for( my $i = 0; $i < length($s); $i++ ){ my $c = substr( $s, $i, 1 ); # ...(do something with $c) }Having seen that a for loop iterating over the elements of an array can be beaten by using
map
or grep
, we may very
well conjecture that we can be faster here, too. The additional task
required for dealing with a string is splitting the string into characters.
In addition to substr
, there is split
, and
let's not forget the almost chameleonic unpack
.
The Perl implementation of od
(the Unix utility for dumping
files in octal and other formats, [1]) contains
functions (for the options -ta
and -tc
) that
require such a processing of a string: split into characters, and return
the joined results of looking up the characters' ordinal values in a table.
There is indeed more than one way to do this, so let's see...
Code | Time |
---|---|
$res = join( '', map( $aTab[ord()&0x7f], split( '', $s ) ) ); | 22.48 |
$res = join( '', map( $aTab[$_&0x7f], unpack( 'C' . length($s), $s ))); | 14.41 |
$res = ''; for( my $i = 0; $i < length($s); $i++ ){ $res .= $aTab[ord(substr($s, $i, 1))&0x7f]; } | 11.92 |
$res = ''; for my $n ( split('', $s) ){ $res .= $aTab[ord($n)&0x7f]; } | 17.82 |
$res = ''; for my $n ( unpack('C' . length($s), $s) ){ $res .= $aTab[$n&0x7f]; } | 8.57 |
It's remarkable that the very Perlish and impressive looking
combination of join
, map
and
split
comes in last. It seems to be a case of
too many cooks spoiling the broth.
You might think that the variant using unpack
is faster because
it produces the ordinal values without the additional ord()
function call, but this isn't true. Even separating a string into
individual characters and rebuilding this string is faster if the splitting
is done with unpack
:
# Faster than $res = join('', split('', $s)) my $res = ''; for my $n ( unpack('C' . length($s), $s) ){ $res .= chr($n); }
The macro processor m4 ([1]) has a builtin
macro divert
that can be used to create buffered output
streams. Adding output to
a diversion just adds to an in-memory buffer. Just prior to program
termination all accumulated buffers are written to the main (standard)
output file.
Such a feature is quite useful for any compiler or text generator as it lets you generate output and redirect it into a diversion. Details are hidden in some "emit" procedure that keeps track of the current diversion. There is a function to write some diversion into the output stream, and all pending diversions are automatically emitted before the program terminates.
For a simple example, imagine some option that requires code both at the top of the main processing loop and before its end. With a diversion feature you can emit code for both ends from within a single if statement.
if( $featureX ){ emit( $code_for_loop_start ); divert( LoopEnd ); # divert to "loop end" code emit( $code_for_loop_end ); divert(); # return to mainline code }
In the section Preparing Perl Code we
have seen a few examples of how Perl code can be generated by writing
here-documents and letting Perl interpolate variables into the
document's text. The inconvenient part of this is the necessity of
having to quote all references to variables (scalars, arrays and
hashes) that are intended to remain verbatim. Since the number of
compile-time variables that should be interpolated is usually small
compared to the number of run-time variables, and interpolation is almost
always done using scalars, you might consider to code the interpolation
mechanism yourself. Basically, all you need to do is run the code
snippet about to be emitted through a substitution such as the one
shown below. Notice that the pattern also matches names included in
#{
...}
, which is provided with the same
intent as Perl's own ${
...}
.
$snippet =~ s/\#(?: ([[:alpha:]]\w*) | {([[:alpha:]]\w*)})/ lookup($1||$2) /egx;The example shows that compile-time variables should be marked with a "sigil" that is distinct from anything that might occur in the processed code. (This isn't intended to keep you from generating comments - just make sure that you leave a space after the '
#
'.)
The important issue with this approach is the way the compile-time
variables and the lookup function are implemented. With Perl's own
interpolation, all visible variables can participate, and
Perl will find a local lexical variable just as easily as a package
variable. If you write your own lookup function, you'll have to
provide a symbol table as well. The easiest way would be to use
a separate package (here: ctv
) just for these compile
time variables, so that the lookup function simply accessesy the stash
storing the package variables:
sub lookup { my( $name ) = @_; no strict 'refs'; my $res = ${"ctv::$name"}; return $res if defined( $res ); die( "no such compile time variable: $name\n" ); }To avoid warnings about variables being used only once, you may use
no warnings 'once';
Notice that using a package-cum-stash lets you use local
on the compile-time variables. By applying it within some block to
one of the package variables you create the effect of a local variable,
even within a recursive function.
There are a few places where literally any text might have to be generated, according to some unconstrained user input.
One such place is a string literal. Well, we all know that surrounding
a string with apostophes ("'
") requires only the apostrophe
and the backslash ("\
") to be escaped with a backslash. To
generate code containing a string literal, one might come up with a
simple function like this:
sub asString { my( $value ) = @_; $value =~ s/(['\\])/\\$1/g; return "'" . $value . "'"; }Control characters won't be a problem, although the generated text will not render them in a legible way. To better this, you might split your string literal into concatenated sections, with printable substrings quoted with an apostrophe and non-printable substrings quoted with quotes ('
"
') and the characters therein written in hexadecimal
notation. (Perl performs the concatenation once at compile time.)
Using the builtin function quotemeta
with quotes as delimiters
is another option, but this will also leave the control characters
themselves as they are, and you may get oodles of backslashes.
And don't forget that printf
and sprintf
have yet another metacharacter that requires special treatment.
Below is a subroutine from sed
. It modifies a line
(in $$codref
) that contains the start of a
here-document, replacing "TheEnd"
with a suitable
delimiter and appends the same delimiter to the document text
(in $$argref
). Notice the while loop that makes
sure that the delimiter does not occur within the here-document.
sub safeHere($$){ my( $codref, $argref ) = @_; my $eod = 'EOD000'; while( $$argref =~ /^$eod$/m ){ $eod++; } $$codref =~ s/TheEnd/$eod/e; $$argref .= "$eod\n"; }
While I was doing my stunt in connection with the "Perl Power Tools" project I came across a few requirements in the POSIX specifications that posed quite difficult problems. They are mainly due to the fact that Unix utilities may make use of libraries which provide just the functions you need, whereas in Perl you have to make do with the features Perl provides. This section discusses three of these problems that result from such differences.
By definition, POSIX utilities use POSIX Basic or Extended Regular Expressions. Typical places are in sed, m4 and lex ([1]). Perl's regular expressions are not only much more powerful, but also different in more or less subtle ways. This means that you can express a POSIX RE by an equivalent Perl RE, but you have to parse the POSIX RE, symbol by symbol, and compile an equivalent Perl RE.
Since the Perl implementation of sed is intended to be an exact replica of the Unix implementation, there was no way of avoiding this arduous task. But would a Perl implementation of sed have been less useful if it simply had adopted Perl regular expressions?
It's even worse if you look at the GNU implementations of these utilities. They extend POSIX REs with - admittedly useful - functions, but this just complicates the issue for a Perl implementation.
An extremely subtle difference between POSIX sed and my Perl implementation
is due to the way the empty pattern ("//
") is interpreted in
sed and the way it is implemented in Perl. Both differ from the POSIX
definition.
Traditional sed implementations interpret the empty pattern to repeat the last regular expression match, both in an address and in the replacement part of the "s" command. In Perl, if the pattern evaluates to the empty string, the last successfully matched regular expression is used instead. The subtle difference makes it impossible to use Perl's empty RE correctly as a replacement for sed's empty RE. An exact duplication of sed's behaviour is astonishingly difficult to implement, and certainly not worth the effort. The traditional usage of the empty RE is in an "s" command, as in the following example where "today" is replaced by "tomorrow":
/today/ s//tomorrow/Since the last successful match is bound to be with
/today/
,
this match is replaced. Only rather contrived examples exhibit the
difference in the behaviour.
/append/ a \ This is appended. /insert/ i \ This is inserted. // a \ Triggered by empty pattern.Here,
/bin/sed
inserts the "Triggered" line only for
an input line that matches /insert/
since this is the
"last match" (although not always a successful one). The
Perl implementation of sed emits the "Triggered" line for a match
with either pattern.
This problem arose during my Perl rewrite of lex
, the
Unix lexical analyzer generator. The generated code is supposed to
match subsequent sections of its input stream. Regular expressions
may, of course, match across any number of input lines. So, how do
you implement this in Perl where Pattern matching operations act
on a variable? It's not possible to extend the string with some
additional input data if the regex engine reaches the end of the
string while still matching and hoping for more. (Not even
one of the fantastic Perl 5.10 extensions for RE would do the
trick - I think.)
My solution was to let the generated program read the entire input before trying to match any patterns. This has certain disadvantages, but implementing a regex engine that can pull in more input while pattern matching is in progress would require hacking Perl's regex engine or something equally demanding, and that, admittedly, scared me off.
The table below compares the Perl implementation of od
available on CPAN (ppt-0.12
), my Perl implementation
and Linux' /usr/bin/od
. (The times were obtained with
/usr/bin/time
, from a run processing a 3MB file
with approximately 110000 lines, with output diverted to
/dev/null
.)
Format | od/CPAN | od/Perl | /usr/bin/od |
---|---|---|---|
-ta | 9.21 | 4.27 | 1.21 |
-tx2 | 3.28 | 2.43 | 0.48 |
-tdS | 3.48 | 2.55 | 0.54 |
-tf4 | 4.48 | 3.60 | 1.23 |
-tax2dSf4 | 17.08 | 9.14 | 3.37 |
It appears that much of the speed difference between C and Perl is due to input and output. A rough estimate of the time required for formatting alone indicates that the relation between the C utility and the faster Perl implementation is only about 1:2.5.
Perl doesn't look so good in a similar comparison of Perl implementations of
cut
with Linux' /usr/bin/cut
.
Options | cut/CPAN | cut/Perl | /usr/bin/cut |
---|---|---|---|
-d: -f1,3,5,7 | 57.55 | 21.84 | 0.74 |
-c1-10,21-30 | 13.33 | 4.69 | 0.65 |
The Perl implementation of sed
(distributed with Perl 5)
provides a good chance for winning bets against people who think that
a Perl implementation is bound to be slower than an equivalent program
written in C. The table below shows the execution times for several
sed scripts.
Script | /bin/sed | Perl's sed |
---|---|---|
Centering lines | 51.19 | 3.46 |
Binary to decimal | 16.99 | 1.89 |
Reverse characters of lines | 9.51 | 7.94 |
Increment number | 0.71 | 2.15 |
Reverse lines in a file | 15.51 | 17.64 |
-e s/[a-d]/X/g | 2.93 | 2.04 |
Generating Perl code that's tailored to an algorithm with all parameters and decisions resulting from them in place is a highly recommended technique if you want to be fast.
Use benchmark. Harmless looking Perl code can be slow. Try for alternatives.
Mind your loops.
Trying to implement one of the more complex Unix utilities in Perl will make you very well acquainted with that utility, and it will teach you an inredible lot about some of Perl's features. In short, it is a highly recommended experience - if you can spare the time.
[1] The Open Group Technical Standard "Commands and Utilities", Issue 4, Version 2. Available from https://www.opengroup.org