AWK

AWK (/ɔːk/[4]) is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool.

The AWK language is a data-driven scripting language consisting of a set of actions to be taken against streams of textual data – either run directly on files or used as part of a pipeline – for purposes of extracting or transforming text, such as producing formatted reports.

[5] AWK was created at Bell Labs in the 1970s,[6] and its name is derived from the surnames of its authors: Alfred Aho (author of egrep), Peter Weinberger (who worked on tiny relational databases), and Brian Kernighan.

The acronym is pronounced the same as the name of the bird species auk, which is illustrated on the cover of The AWK Programming Language.

According to Brian Kernighan, one of the goals of AWK was to have a tool that would easily manipulate both numbers and strings.

AWK was also inspired by Marc Rochkind's programming language that was used to search for patterns in input data, and was implemented using yacc.

[12] GNU AWK may be the most widely deployed version[13] because it is included with GNU-based Linux packages.

[12] Brian Kernighan's nawk (New AWK) source was first released in 1993 unpublicized, and publicly since the late 1990s; many BSD systems use it to avoid the GPL license.

They share the line-oriented, data-driven paradigm, and are particularly suited to writing one-liner programs, due to the implicit main loop and current line variables.

The power and terseness of early AWK programs – notably the powerful regular expression handling and conciseness due to implicit variables, which facilitate one-liners – together with the limitations of AWK at the time, were important inspirations for the Perl language (1987).

The program tests each record against each of the conditions in turn, and executes the action for each expression that is true.

As handy syntactic sugar, /regexp/ without using the tilde operator matches against the current record; this syntax derives from sed, which in turn inherited it from the ed editor, where / is used for searching.

This syntax of using slashes as delimiters for regular expressions was subsequently adopted by Perl and ECMAScript, and is now common.

AWK commands can include function calls, variable assignments, calculations, or any combination thereof.

Also, some flavors support the inclusion of dynamically linked libraries, which can also provide more functions.

The print command can also display the results of calculations and/or function calls: Output may be sent to a file: or through a pipe: AWK's built-in variables include the field variables: $1, $2, $3, and so on ($0 represents the entire record).

The names of these are added to the end of the argument list, though values for these should be omitted when calling the function.

s is incremented by the numeric value of $NF, which is the last word on the line as defined by AWK's field separator (by default, white-space).

Word frequency using associative arrays: The BEGIN block sets the field separator to any sequence of non-alphabetic characters.

In this case, for every field on the line, we add one to the number of times that word, first converted to lowercase, appears.

Finally, this is written in pure awk, without help from a shell or without the need to know too much about the implementation of the awk script (as the variable assignment on command line one does), but is a bit lengthy: The BEGIN is necessary not only to extract the first argument, but also to prevent it from being interpreted as a filename after the BEGIN block ends.

ARGC, the number of arguments, is always guaranteed to be ≥1, as ARGV[0] is the name of the command that executed the script, most often the string "awk".

If you explicitly set ARGC to 1 so that there are no arguments, awk will simply quit because it feels there are no more input files.

On Unix-like operating systems self-contained AWK scripts can be constructed using the shebang syntax.

The language is described in the book The AWK Programming Language, published 1988, and its implementation was made available in releases of UNIX System V. To avoid confusion with the incompatible older version, this version was sometimes called "new awk" or nawk.

This implementation was released under a free software license in 1996 and is still maintained by Brian Kernighan (see external links below).