It spells AWK because it stands for the names Aho, Weinberger and Kernighan, all AT&T Bell Labs employees in 1977 when awk first appeared.
AWK is an amazing tool. It's a UNIX tool, but even WINDOWS addicts know it. Despite the existence of its successor perl it does not die out. (Maybe perl code is a little hard to read? Maybe the attitude "We let you do it in many ways" is not so useful for source code?-)
AWK is one of the utilities occurring in almost any shell script bigger than 100 lines. Anytime the UNIX shell-script interpreter is coming to its limits, AWK helps out. I would not say that AWK code is easy to read, but not as hard as perl, and even a bit easier than shell scripts. Its syntax feels like a free form of C without pointers (how relieving!).
Warning: it is an interpreted script language and has no explicitly declared data types!
Let's see how many awk variants are installed. Open a terminal prompt on your LINUX or WINDOWS + CYGWIN system, and enter the following command line (without "$", this is the system's prompt)
$ ls -l /usr/bin/*awk
lrwxrwxrwx 1 root root 21 Jan 17 22:26 /usr/bin/awk -> /etc/alternatives/awk
-rwxr-xr-x 1 root root 538224 Jul 2 2013 /usr/bin/dgawk
-rwxr-xr-x 1 root root 441512 Jul 2 2013 /usr/bin/gawk
-rwxr-xr-x 1 root root 3188 Jul 2 2013 /usr/bin/igawk
-rwxr-xr-x 1 root root 117768 Mar 24 2014 /usr/bin/mawk
lrwxrwxrwx 1 root root 22 Jan 17 22:26 /usr/bin/nawk -> /etc/alternatives/nawk
-rwxr-xr-x 1 root root 445608 Jul 2 2013 /usr/bin/pgawk
Maybe you need to enter ls -la /bin/*awk
, this depends on your LINUX variant.
When you enter the ls
file-list command with an awk
filter pipe, you see this:
$ ls -l /usr/bin/*awk | awk '{print $9, $11}'
/usr/bin/awk /etc/alternatives/awk
/usr/bin/dgawk
/usr/bin/gawk
/usr/bin/igawk
/usr/bin/mawk
/usr/bin/nawk /etc/alternatives/nawk
/usr/bin/pgawk
We used AWK as column filter, to see only column 9 and 11 (when present). Column 11 represents the target when the file node is a symbolic link.
This is really a short program: {print $9, $11}
, don't you think?
What can we learn from that?
That is what all people do with AWK: feed in lines of a file and convert column contents to some new shape.
Put {print $9, $11}
into a file named NameAndLink.awk (the extension is not obligatory) ....
{ print $9, $11 }
.... and then do this:
$ ls -l /usr/bin/*awk | awk -f NameAndLink.awk
# same result as above
Within the file you do not need the 'single quotes' any more. For bigger AWK applications, a separate file for the source code is very recommendable.
Another thing you can do is to tag the file in its head with the according command-interpreter, so that it can be executed as a script:
NameAndLink.awk#!/usr/bin/awk -f
{ print $9, $11 }
Mind that now you need to set execute-permissions on it:
$ chmod u+x NameAndLink.awk
$ ls -l /usr/bin/*awk | NameAndLink.awk
# same result as above
Here is a skeleton of how most AWK programs look like.
awk '
BEGIN {
print "Starting";
}
/a/ {
print "Got a";
}
/b/ || /c/ {
print "Got b or c, see yourself: " $0
}
/b/ {
print "Got b"
}
{
print "Generally I got " $0;
}
END {
print "Ending"
}
' <file.txt
Assuming we have a file.txt with content
a
b
c
d
we would see following output:
Starting
Got a
Generally I got a
Got b or c, see yourself: b
Got b
Generally I got b
Got b or c, see yourself: c
Generally I got c
Generally I got d
Ending
What can we learn from that?
Similarities to XSLT and CSS are obvious: it is a pattern-matching language.
Instead of starting "yet another AWK tutorial", I want to demonstrate the power of it in a little application I needed recently. That application should process a Maven pom.xml file and enrich it with version numbers.
<?xml version="1.0"?>
<project>
<groupId>com.mycompany.app</groupId>
<artifactId>my-module</artifactId>
<version>1.0.0-SNAPSHOT</version>
<parent>
<groupId>com.mycompany.app</groupId>
<artifactId>my-app</artifactId>
<version>1.0-SNAPSHOT</version>
<relativePath>../parent/pom.xml</relativePath>
</parent>
<dependencies>
<dependency>
<groupId>fri.example.test</groupId>
<artifactId>module-one</artifactId>
</dependency>
<dependency>
<artifactId>module-two</artifactId>
<groupId>fri.example.test</groupId>
</dependency>
<dependency>
<artifactId>module-hundred</artifactId>
<groupId>fri.example.test</groupId>
</dependency>
<dependency>
<groupId>fri.example.test</groupId>
<artifactId>module-three</artifactId>
<exclusions>
<exclusion>
<groupId>fri.example.test</groupId>
<artifactId>module-five</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<artifactId>module-four</artifactId>
<groupId>fri.example.test</groupId>
</dependency>
</dependencies>
</project>
There is a parent
referenced, let's assume that it holds a dependencyManagement
section where versions for modules are defined, and we processed these into maven-resolve.txt
file by calling mvn dependency:resolve
(for example by using another awk script;-).
module-four 1.4
module-two 1.2
module-three 1.3
module-one 1.1
We want the associated versions to be put into the module dependency elements like this:
<?xml version="1.0"?>
<project>
<groupId>com.mycompany.app</groupId>
<artifactId>my-module</artifactId>
<version>1.0.0-SNAPSHOT</version>
<parent>
<groupId>com.mycompany.app</groupId>
<artifactId>my-app</artifactId>
<version>1.0-SNAPSHOT</version>
<relativePath>../parent/pom.xml</relativePath>
</parent>
<dependencies>
<dependency>
<groupId>fri.example.test</groupId>
<artifactId>module-one</artifactId>
<version>1.1</version>
</dependency>
<dependency>
<artifactId>module-two</artifactId>
<groupId>fri.example.test</groupId>
<version>1.2</version>
</dependency>
<dependency>
<artifactId>module-hundred</artifactId>
<groupId>fri.example.test</groupId>
</dependency>
<dependency>
<groupId>fri.example.test</groupId>
<artifactId>module-three</artifactId>
<exclusions>
<exclusion>
<groupId>fri.example.test</groupId>
<artifactId>module-five</artifactId>
</exclusion>
</exclusions>
<version>1.3</version>
</dependency>
<dependency>
<artifactId>module-four</artifactId>
<groupId>fri.example.test</groupId>
<version>1.4</version>
</dependency>
</dependencies>
</project>
As you can see there are some subtleties in the example pom.xml. To irritate the script, there is an exclusion
tag containing an artifactId
tag. Sometimes the groupId
and artifactId
tags are swapped. And there is a module-hundred
for which we have no version.
We need to read the maven-resolve.txt
file at BEGIN. Then we will read the pom.xml
file and input a version tag wherever an according module occurs. The resulting AWK source is amazing short. I wrote it as shell script, to show how shell variables can be integrated into the AWK program.
1 | #!/bin/bash |
First we read maven-resolve.info
line by line using the built-in getline
command. We can use an input redirection with the getline
command like in a shell script. Mind that the input redirection file name needs to be wrapped into "double quotes" inside the AWK program. But the shell variable needs to be outside the AWK program, so it is enclosed in single quotes, which 'splits' the AWK program and exposes $versions
to the shell for substitution.
When reading the file, AWK works as usual by splitting the line and provding $1 - $N. We use this to fill up the "associative array" moduleVersions
with our module/version informations. In AWK you don't need to declare variables, you simply use them. AWK will create them when needed. They're all global.
So after this we have a map of module/version associations. Now we process every line of pom.xml
. Because we need the module name from within the artifactId
tags, I decided to match them in the { common braces }. The built-in match()
function can give us the the text within the artifactId
tags, because I enclosed that into ( parentheses ). That enclosed text will appear in the third parameter matchArray. Mind that AWK has array indexes from 1-n, not from 0-n, so I get the name of the module from matchArray[1]
. And I query the map with that module name to get its version.
When then a line appears that contains a closing dependency
tag, the script checks whether a version exists for that passed artifactId
, and inserts a version
tag if so. Finally every line of the pom.xml
gets printed out unchanged by print $0
.
The exclusions
patterns are there because within a Maven exclusion also artifactId
tags can appear. And these exclusions are within dependency
elements. So such would break the artifact version found just before, and thus the script uses the inExclusion
state to avoid this. Without that state, module-three
would miss its version.
Please mind that awk does not understand XML, so e.g. an XML comment at the wrong place could break the script. This is is just a quick line-reader solution!
That's it, AWK contains a lot more, so use it and enjoy the brevity. I've also seen bigger applications written in AWK, but like all script languages it lacks encapsulation, and thus is not suitable for big modular projects.
ɔ⃝ Fritz Ritzberger, 2015-05-27