awk
awk is a POSIX standard tool to do line-by-line manipulation of text. For example, to print the 4th column in all piped-in lines:
$ cat input | awk '{print $4}'
Pre-defined Variables
Records:
awk
splits the input into records using the record separator (RS
). This defaults to a newline (\n
) meaning each line is a record.$0
represents the entire record.
Fields:
- Each record (
$0
) is split into fields using the field separator (FS
). - The field separator
FS
defaults to whitespace. It may be changed withawk -F $SEP
or withFS="$SEP"
during execution (most often in aBEGIN
block). POSIXawk
only supports a single-characterFS
. $N
represents theN
-th field for each line starting with$1
being the first field and so on.NF
represents the number of fields. This may be used in expressions. For example, the last field is$(NF)
, the second-to-last is$(NF-1)
and so on.- Any time
$0
is modified, the fields are re-calculated. For example,gsub(/"/, "");
affects$0
since no third parameter is specified, removes double quotes from$0
, and re-calculates the fields.
Basic capabilities
- Arithmetic functions
- Strings are concatenated with whitespace instead of an operator. For
example, to append a period:
mystr = mystr ".";
- There is no concept of
NULL
, so to unset a variable, just set it to a blank string (this will evaluate to false in anif
):myvar = "";
; - awk assigns by value instead of by reference, so to duplicate a
string, just assign to a new variable; for example,
str2 = str1;
. - When doing arithmetic logic (e.g.
{ if ( $1 > 5 ) { print $0; } }
), always add0
to the coerced string to avoid strange evaluations (e.g.{ if ( $1 + 0 > 5 ) { print $0; } }
).
String functions
- String length:
length(mystr)
- Split string
mystr
into arraypieces
using a split regex:n = split(mystr, pieces, / /);
withn
being the number of resulting elements inpieces
- The resulting array is 1-indexed, so the first piece is
pieces[1]
.
- The resulting array is 1-indexed, so the first piece is
- Find all instances of a regex in a string and replace:
gsub(/regex/, "replace", string);
- Return a substring of string starting from a 1-based index position:
newstring = substr(string, i);
- A third parameter may be specified for the maximum substring length.
- Find 1-based index of the position of the first match of regex in a
string or 0 if not found:
i = match(string, regex);
- Trim whitespace:
function trimWhitespace(str) { gsub(/[ \t\n\r]+/, "", str); return str; }
Arrays
- Associative arrays don't need to be initialized:
mymap["key"]="value";
- Array length:
function arrayLength(array) { l = 0; for (i in array) l++; return l; }
- Loop through an array:
for (key in array) { item=array[key]; }
- If the array was created from
split
, looping through may not be in order, so instead do:l=arrayLength(pieces); for (i=1; i<=l; i++) { item=array[i]; }
- If the array was created from
- To clear an array:
delete myarray;
- If an array is created from a function such as
split
, then "indexing into it" starts at 1:split("1,2,3", pieces, /,/); print pieces[1];
- Awk cannot return an array from a function. Instead, use a global variable (and delete the array at the beginning of the function).
- POSIX awk has limited support for multi-dimensional arrays; however,
you can add some additional loops:
function array2d_tokeys(array) { delete a2d2k; for (key in array) { split(key, pieces, SUBSEP); if (length(a2d2k[pieces[1]]) == 0) { a2d2k[pieces[1]] = pieces[2]; } else { a2d2k[pieces[1]] = a2d2k[pieces[1]] SUBSEP pieces[2]; } } } function array2d_print_bykeys(original_array, a2d2k_array) { for (key in a2d2k_array) { split(a2d2k_array[key], pieces, SUBSEP); for (piecesKey in pieces) { print key "," pieces[piecesKey] " = " original_array[key, pieces[piecesKey]]; } } } function play() { my2darray["key1", "subkey1"] = "val1"; my2darray["key1", "subkey2"] = "val2"; my2darray["key2", "subkey1"] = "val3"; my2darray["key2", "subkey2"] = "val4"; array2d_tokeys(my2darray); array2d_print_bykeys(data, a2d2k); }
Tips
POSIX awk
gsub
doesn't support backreferences other than the entire match. However, you can accomplish this with multiple statements by replacing with something unique which then you search for. For example, in the string "01/01/2020:00:00:00", to replace the first colon with a space:gsub(/\/....:/, "&@@", $0); gsub(/:@@/, " ", $0); print $0;
Get current file name (or
-
forstdin
):FILENAME
Get line number for the current file:
FNR
Run something at the beginning of each file:
FNR == 1 { print; }
Run something at the end of each file:
FNR == 1 { if (!firstFNR) { firstFNR = 1; } else { endOfFile(0); } fname = FILENAME; } END { endOfFile(1); } function endOfFile(lastFile) { print("Finished " fname); if (lastFile) { print("Finished all files"); } }
Get line number for all processed lines so far:
NR
Print something to
stderr
:print("WARNING: some warning") > "/dev/stderr";
Change the return code:
END { exit 1; }
Skip blank lines:
NF > 0
Execute a shell command based on each line:
{ system("ip route change " $0 " quickack 1"); }
Execute a shell command and read results into a string variable:
cmd = "uname"; cmd | getline os; close(cmd);
Don't process remaining patterns for a line ("The next statement shall cause all further processing of the current input record to be abandoned"):
/pattern1/ { # do some processing print; next; # don't run the catch-all pattern } { print; }
A common way to send a list of files to
awk
is withfind
; for example,find . -type f -print0 | xargs -0 ~/myscript.awk
. However,xargs
is limited in how many arguments it can pass to the program; if the limit is exceeded,xargs
executes the program multiple times thus theawk
script cannot store global state for all files or execute a singleBEGIN
orEND
block. An alternative is to cat everything into awk (e.g.find . -type f -print0 | xargs -0 cat | ~/myscript.awk
) but thenFILENAME
andFNR
are lost. Instead, you can dropxargs
and have the awk script modifyARGV
dynamically:The arguments in ARGV can be modified or added to; ARGC can be altered. As each input file ends, awk shall treat the next non-null element of ARGV, up to the current value of ARGC-1, inclusive, as the name of the next input file. Thus, setting an element of ARGV to null means that it shall not be treated as an input file. The name '-' indicates the standard input. If an argument matches the format of an assignment operand, this argument shall be treated as an assignment rather than a file argument. \
So execution goes from:
find . -type f -print0 | xargs -0 ~/checkissues.awk
To:
find . -type f | ~/checkissues.awk
Here's the awk snippet that does the ARGV modification:
# xargs allows a limited number of arguments, but POSIX awk allows # us to add files to process by adding to ARGV: # https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html#tag_20_06_13_03 FILENAME == "-" { ARGV[ARGC] = $0; ARGC += 1; # No need to process any of the other patterns for this line: next; }
Concatenate all lines into a single, space-delimited output
awk '{ printf("%s ", $0); } END { printf("\n"); }' $FILE
Parse hex string to decimal
#!/usr/bin/awk -f
# Create hex character } decimal map, e.g. hexhcars["A"] = 10, etc.
BEGIN {
for (i=0; i<16; i++) {
hexchars[sprintf("%x", i)] = i;
hexchars[sprintf("%X", i)] = i;
}
}
function trimWhitespace(str) {
gsub(/[ \t\n\r]+/, "", str);
return str;
}
# Return 1 if str is a number (unknown radix),
# 2 if hex (0x prefix or includes [a-fA-F]),
# and 0 if number not matched by regexes.
# str is trimmed of whitespace.
function isNumber(str) {
str = trimWhitespace(str);
if (length(str) > 0) {
if (str ~ /^[+-]?[0-9]+(\.[0-9]+)?([eE][+-]?[0-9]+)?$/ || str ~ /^(0[xX])?[0-9a-fA-F]+$/) {
if (str ~ /^0[xX]/ || str ~ /[a-fA-F]/) {
return 2;
} else {
return 1;
}
}
}
return 0;
}
# If str is a hexadecimal number (0x prefix optional), then return its decimal value;
# otherwise, return -1.
# str is trimmed of whitespace.
function parseHex(str) {
numResult = isNumber(str);
if (numResult == 1 || numResult == 2) {
str = trimWhitespace(str);
if (str ~ /^0[xX]/) {
str = substr(str, 3);
}
result = 0;
for (i=1; i<=length(str); i++) {
result = (result * 16) + hexchars[substr(str, i, 1)];
}
return result;
}
return -1;
}
# If str is a decimal number, then return its decimal value;
# otherwise, return -1.
# str is trimmed of whitespace.
function parseDecimal(str) {
if (isNumber(str) == 1) {
return trimWhitespace(str) + 0;
}
return -1;
}
Calculate a Pearson correlation coefficient between two columns of numbers
#!/usr/bin/awk -f
# usage: pearson.awk file
# Calculate pearson correlation (r) between two columns (defaults to first two columns). This will skip any rows that have non-number values.
# Background:
# https://ocw.mit.edu/resources/res-6-012-introduction-to-probability-spring-2018/part-i-the-fundamentals/the-correlation-coefficient/
# https://ocw.mit.edu/resources/res-6-012-introduction-to-probability-spring-2018/part-i-the-fundamentals/interpreting-the-correlation-coefficient/
#
# Example:
# $ pearson.awk -v x_right=3 -v y_right=4 access_log
#
# Options:
# Column offset of X values starting from the left (1 is the first column, 2 is second, etc.):
# -v x_left=N
# Column offset of X values from the right (0 is the last column, 1 is second-to-last, etc.):
# -v x_right=N
# Column offset of Y values starting from the left (1 is the first column, 2 is second, etc.):
# -v y_left=N
# Column offset of Y values from the right (0 is the last column, 1 is second-to-last, etc.):
# -v y_right=N
# Suppress warnings about lines being skipped:
# -v suppress_skip_warnings=1
# For debugging, print out just values and those that are just numbers:
# -v debug_values=1
BEGIN {
if (ARGC == 1) {
print("ERROR: no file specified") > "/dev/stderr";
exit 1;
} else if (ARGC == 2) {
# Make sure they're not using stdin; we can't double proces that
if (ARGV[ARGC - 1] == "-") {
print("ERROR: standard file must be used instead of stdin") > "/dev/stderr";
exit 1;
}
# Duplicate the file name to process data twice: once to calculate the means and the second time
# to calculate the pearson correlation.
ARGV[ARGC] = ARGV[ARGC - 1];
ARGC++;
} else {
print("ERROR: only one file supported") > "/dev/stderr";
exit 1;
}
if (length(x_left) == 0 && length(x_right) == 0) {
x_left = 1;
} else if (length(x_left) > 0 && length(x_right) > 0) {
print("ERROR: only one of the x_left or x_right values should be specified") > "/dev/stderr";
exit 1;
}
if (length(y_left) == 0 && length(y_right) == 0) {
y_left = 2;
} else if (length(y_left) > 0 && length(y_right) > 0) {
print("ERROR: only one of the y_left or y_right values should be specified") > "/dev/stderr";
exit 1;
}
}
FNR == 1 {
file_number++;
}
function checkNumber(str) {
if (str ~ /^[+-]?[0-9]+(\.[0-9]+)?([eE][+-]?[0-9]+)?$/) {
return str + 0;
} else {
return "NaN";
}
}
function getX() {
if (x_left) {
result = $(x_left);
} else if (x_right) {
result = $(NF - x_right);
} else {
print("ERROR: invalid argument specifying X column") > "/dev/stderr";
return "NaN";
}
return checkNumber(result);
}
function getY() {
if (y_left) {
result = $(y_left);
} else if (y_right) {
result = $(NF - y_right);
} else {
print("ERROR: invalid argument specifying Y column") > "/dev/stderr";
return "NaN";
}
return checkNumber(result);
}
function areNumbers(x, y) {
return x != "NaN" && y != "NaN";
}
function skipWarn() {
if (length($0) > 0 && !suppress_skip_warnings) {
print("WARNING: skipping line " FNR " because both numbers not found: " $0) > "/dev/stderr";
}
}
# Skip blank lines or errant lines without at least two columns
NF < 2 {
if (file_number == 1) {
skipWarn();
}
next;
}
# First pass of the file: calculate sums
file_number == 1 {
x = getX();
y = getY();
if (areNumbers(x, y)) {
count++;
x_sum += x;
y_sum += y;
if (debug_values) {
print x, y;
}
} else {
skipWarn();
}
}
# First pass of the file: calculate the means at the end of the file
file_number == 2 && FNR == 1 {
x_mean = x_sum / count;
y_mean = y_sum / count;
}
# Second pass of the file: add to the variance/covariance sums
file_number == 2 {
x = getX();
y = getY();
if (areNumbers(x, y)) {
x_diff_from_mean = (x - x_mean);
x_variance_sum += x_diff_from_mean * x_diff_from_mean;
y_diff_from_mean = (y - y_mean);
y_variance_sum += y_diff_from_mean * y_diff_from_mean;
covariance_sum += x_diff_from_mean * y_diff_from_mean;
}
}
# Finally, calculate everything and print
END {
if (count > 0 && !debug_values) {
x_variance = (x_variance_sum / count);
x_stddev = sqrt(x_variance);
y_variance = (y_variance_sum / count);
y_stddev = sqrt(y_variance);
covariance = covariance_sum / count;
if (x_stddev == 0) {
print("ERROR: X standard deviation is 0") > "/dev/stderr";
exit 1;
} else if (y_stddev == 0) {
print("ERROR: Y standard deviation is 0") > "/dev/stderr";
exit 1;
}
pearson = covariance / (x_stddev * y_stddev);
printf("x sum = %.2f, count = %d, mean = %.2f, variance = %.2f, stddev = %.2f\n", x_sum, count, x_mean, x_variance, x_stddev);
printf("y sum = %.2f, count = %d, mean = %.2f, variance = %.2f, stddev = %.2f\n", y_sum, count, y_mean, y_variance, y_stddev);
printf("covariance = %.2f\n", covariance);
printf("pearson correlation coefficient (r) = %.2f\n", pearson);
printf("coefficient of determination (r^2) = %.2f\n", (pearson * pearson));
}
}