awk

awk is a POSIX standard tool to do line-by-line manipulation of text. For example, to print the 4th column in all piped-in lines:

$ cat input | awk '{print $4}'

Pre-defined Variables

Records:

  1. awk splits the input into records using the record separator (RS). This defaults to a newline (\n) meaning each line is a record.
  2. $0 represents the entire record.

Fields:

  1. Each record ($0) is split into fields using the field separator (FS).
  2. The field separator FS defaults to whitespace. It may be changed with awk -F $SEP or with FS="$SEP" during execution (most often in a BEGIN block). POSIX awk only supports a single-character FS.
  3. $N represents the N-th field for each line starting with $1 being the first field and so on.
  4. NF represents the number of fields. This may be used in expressions. For example, the last field is $(NF), the second-to-last is $(NF-1) and so on.
  5. Any time $0 is modified, the fields are re-calculated. For example, gsub(/"/, ""); affects $0 since no third parameter is specified, removes double quotes from $0, and re-calculates the fields.

Basic capabilities

  1. Arithmetic functions
  2. Strings are concatenated with whitespace instead of an operator. For example, to append a period: mystr = mystr ".";
  3. There is no concept of NULL, so to unset a variable, just set it to a blank string (this will evaluate to false in an if): myvar = "";;
  4. awk assigns by value instead of by reference, so to duplicate a string, just assign to a new variable; for example, str2 = str1;.
  5. When doing arithmetic logic (e.g. { if ( $1 > 5 ) { print $0; } }), always add 0 to the coerced string to avoid strange evaluations (e.g. { if ( $1 + 0 > 5 ) { print $0; } }).

String functions

String functions:

  1. String length: length(mystr)
  2. Split string mystr into array pieces using a split regex: n = split(mystr, pieces, / /); with n being the number of resulting elements in pieces
    • The resulting array is 1-indexed, so the first piece is pieces[1].
  3. Find all instances of a regex in a string and replace: gsub(/regex/, "replace", string);
  4. Return a substring of string starting from a 1-based index position: newstring = substr(string, i);
    1. A third parameter may be specified for the maximum substring length.
  5. Find 1-based index of the position of the first match of regex in a string or 0 if not found: i = match(string, regex);
  6. Trim whitespace:
    function trimWhitespace(str) {
      gsub(/[ \t\n\r]+/, "", str);
      return str;
    }

Arrays

  1. Associative arrays don't need to be initialized: mymap["key"]="value";
  2. Array length:
    function arrayLength(array) {
      l = 0;
      for (i in array) l++;
      return l;
    }
  3. Loop through an array: for (key in array) { item=array[key]; }
    1. If the array was created from split, looping through may not be in order, so instead do: l=arrayLength(pieces); for (i=1; i<=l; i++) { item=array[i]; }
  4. To clear an array: delete myarray;
  5. If an array is created from a function such as split, then "indexing into it" starts at 1: split("1,2,3", pieces, /,/); print pieces[1];
  6. Awk cannot return an array from a function. Instead, use a global variable (and delete the array at the beginning of the function).
  7. POSIX awk has limited support for multi-dimensional arrays; however, you can add some additional loops:
    function array2d_tokeys(array) {
      delete a2d2k;
      for (key in array) {
        split(key, pieces, SUBSEP);
        if (length(a2d2k[pieces[1]]) == 0) {
          a2d2k[pieces[1]] = pieces[2];
        } else {
          a2d2k[pieces[1]] = a2d2k[pieces[1]] SUBSEP pieces[2];
        }
      }
    }
     
    function array2d_print_bykeys(original_array, a2d2k_array) {
      for (key in a2d2k_array) {
        split(a2d2k_array[key], pieces, SUBSEP);
        for (piecesKey in pieces) {
          print key "," pieces[piecesKey] " = " original_array[key, pieces[piecesKey]];
        }
      }
    }
    
    function play() {
      my2darray["key1", "subkey1"] = "val1";
      my2darray["key1", "subkey2"] = "val2";
      my2darray["key2", "subkey1"] = "val3";
      my2darray["key2", "subkey2"] = "val4";
      array2d_tokeys(my2darray);
      array2d_print_bykeys(data, a2d2k);
    }

Tips

  1. POSIX awk gsub doesn't support backreferences other than the entire match. However, you can accomplish this with multiple statements by replacing with something unique which then you search for. For example, in the string "01/01/2020:00:00:00", to replace the first colon with a space:

     gsub(/\/....:/, "&@@", $0);
     gsub(/:@@/, " ", $0);
     print $0;
  2. Get current file name (or - for stdin): FILENAME

  3. Get line number for the current file: FNR

  4. Run something at the beginning of each file: FNR == 1 { print; }

  5. Run something at the end of each file:

    FNR == 1 {
      if (!firstFNR) {
        firstFNR = 1;
      } else {
        endOfFile(0);
      }
      fname = FILENAME;
    }
    
    END {
      endOfFile(1);
    }
    
    function endOfFile(lastFile) {
      print("Finished " fname);
      if (lastFile) {
        print("Finished all files");
      }
    }
  6. Get line number for all processed lines so far: NR

  7. Print something to stderr: print("WARNING: some warning") > "/dev/stderr";

  8. Change the return code: END { exit 1; }

  9. Skip blank lines: NF > 0

  10. Execute a shell command based on each line: { system("ip route change " $0 " quickack 1"); }

  11. Execute a shell command and read results into a string variable: cmd = "uname"; cmd | getline os; close(cmd);

  12. Don't process remaining patterns for a line ("The next statement shall cause all further processing of the current input record to be abandoned"):

    /pattern1/ {
      # do some processing
      print;
      
      next; # don't run the catch-all pattern
    }
    
    { print; }
  13. A common way to send a list of files to awk is with find; for example, find . -type f -print0 | xargs -0 ~/myscript.awk. However, xargs is limited in how many arguments it can pass to the program; if the limit is exceeded, xargs executes the program multiple times thus the awk script cannot store global state for all files or execute a single BEGIN or END block. An alternative is to cat everything into awk (e.g. find . -type f -print0 | xargs -0 cat | ~/myscript.awk) but then FILENAME and FNR are lost. Instead, you can drop xargs and have the awk script modify ARGV dynamically:

    The arguments in ARGV can be modified or added to; ARGC can be altered. As each input file ends, awk shall treat the next non-null element of ARGV, up to the current value of ARGC-1, inclusive, as the name of the next input file. Thus, setting an element of ARGV to null means that it shall not be treated as an input file. The name '-' indicates the standard input. If an argument matches the format of an assignment operand, this argument shall be treated as an assignment rather than a file argument. \

    So execution goes from:

    find . -type f -print0 | xargs -0 ~/checkissues.awk

    To:

    find . -type f | ~/checkissues.awk

    Here's the awk snippet that does the ARGV modification:

    # xargs allows a limited number of arguments, but POSIX awk allows
    # us to add files to process by adding to ARGV:
    # https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html#tag_20_06_13_03
    FILENAME == "-" {
      ARGV[ARGC] = $0;
      ARGC += 1;
      # No need to process any of the other patterns for this line:
      next;
    }

Concatenate all lines into a single, space-delimited output

awk '{ printf("%s ", $0); } END { printf("\n"); }' $FILE

Parse hex string to decimal

#!/usr/bin/awk -f

# Create hex character } decimal map, e.g. hexhcars["A"] = 10, etc.
BEGIN {
  for (i=0; i<16; i++) {
    hexchars[sprintf("%x", i)] = i;
    hexchars[sprintf("%X", i)] = i;
  }
}

function trimWhitespace(str) {
  gsub(/[ \t\n\r]+/, "", str);
  return str;
}

# Return 1 if str is a number (unknown radix),
# 2 if hex (0x prefix or includes [a-fA-F]),
# and 0 if number not matched by regexes.
# str is trimmed of whitespace.
function isNumber(str) {
  str = trimWhitespace(str);
  if (length(str) > 0) {
    if (str ~ /^[+-]?[0-9]+(\.[0-9]+)?([eE][+-]?[0-9]+)?$/ || str ~ /^(0[xX])?[0-9a-fA-F]+$/) {
      if (str ~ /^0[xX]/ || str ~ /[a-fA-F]/) {
        return 2;
      } else {
        return 1;
      }
    }
  }
  return 0;
}

# If str is a hexadecimal number (0x prefix optional), then return its decimal value;
# otherwise, return -1.
# str is trimmed of whitespace.
function parseHex(str) {
  numResult = isNumber(str);
  if (numResult == 1 || numResult == 2) {
    str = trimWhitespace(str);
    if (str ~ /^0[xX]/) {
      str = substr(str, 3);
    }
    result = 0;
    for (i=1; i<=length(str); i++) {
      result = (result * 16) + hexchars[substr(str, i, 1)];
    }
    return result;
  }
  return -1;
}

# If str is a decimal number, then return its decimal value;
# otherwise, return -1.
# str is trimmed of whitespace.
function parseDecimal(str) {
  if (isNumber(str) == 1) {
    return trimWhitespace(str) + 0;
  }
  return -1;
}

Calculate a Pearson correlation coefficient between two columns of numbers

#!/usr/bin/awk -f
# usage: pearson.awk file
#   Calculate pearson correlation (r) between two columns (defaults to first two columns). This will skip any rows that have non-number values.
#   Background:
#     https://ocw.mit.edu/resources/res-6-012-introduction-to-probability-spring-2018/part-i-the-fundamentals/the-correlation-coefficient/
#     https://ocw.mit.edu/resources/res-6-012-introduction-to-probability-spring-2018/part-i-the-fundamentals/interpreting-the-correlation-coefficient/
#
# Example:
#   $ pearson.awk -v x_right=3 -v y_right=4 access_log
#
# Options:
#   Column offset of X values starting from the left (1 is the first column, 2 is second, etc.):
#     -v x_left=N
#   Column offset of X values from the right (0 is the last column, 1 is second-to-last, etc.):
#     -v x_right=N
#   Column offset of Y values starting from the left (1 is the first column, 2 is second, etc.):
#     -v y_left=N
#   Column offset of Y values from the right (0 is the last column, 1 is second-to-last, etc.):
#     -v y_right=N
#   Suppress warnings about lines being skipped:
#     -v suppress_skip_warnings=1
#   For debugging, print out just values and those that are just numbers:
#     -v debug_values=1
BEGIN {
  if (ARGC == 1) {
    print("ERROR: no file specified") > "/dev/stderr";
    exit 1;
  } else if (ARGC == 2) {
    # Make sure they're not using stdin; we can't double proces that
    if (ARGV[ARGC - 1] == "-") {
      print("ERROR: standard file must be used instead of stdin") > "/dev/stderr";
      exit 1;
    }
    # Duplicate the file name to process data twice: once to calculate the means and the second time
    # to calculate the pearson correlation.
    ARGV[ARGC] = ARGV[ARGC - 1];
    ARGC++;
  } else {
    print("ERROR: only one file supported") > "/dev/stderr";
    exit 1;
  }
  if (length(x_left) == 0 && length(x_right) == 0) {
    x_left = 1;
  } else if (length(x_left) > 0 && length(x_right) > 0) {
    print("ERROR: only one of the x_left or x_right values should be specified") > "/dev/stderr";
    exit 1;
  }
  if (length(y_left) == 0 && length(y_right) == 0) {
    y_left = 2;
  } else if (length(y_left) > 0 && length(y_right) > 0) {
    print("ERROR: only one of the y_left or y_right values should be specified") > "/dev/stderr";
    exit 1;
  }
}
FNR == 1 {
  file_number++;
}
function checkNumber(str) {
  if (str ~ /^[+-]?[0-9]+(\.[0-9]+)?([eE][+-]?[0-9]+)?$/) {
    return str + 0;
  } else {
    return "NaN";
  }
}
function getX() {
  if (x_left) {
    result = $(x_left);
  } else if (x_right) {
    result = $(NF - x_right);
  } else {
    print("ERROR: invalid argument specifying X column") > "/dev/stderr";
    return "NaN";
  }
  return checkNumber(result);
}
function getY() {
  if (y_left) {
    result = $(y_left);
  } else if (y_right) {
    result = $(NF - y_right);
  } else {
    print("ERROR: invalid argument specifying Y column") > "/dev/stderr";
    return "NaN";
  }
  return checkNumber(result);
}
function areNumbers(x, y) {
  return x != "NaN" && y != "NaN";
}
function skipWarn() {
  if (length($0) > 0 && !suppress_skip_warnings) {
    print("WARNING: skipping line " FNR " because both numbers not found: " $0) > "/dev/stderr";
  }
}
# Skip blank lines or errant lines without at least two columns
NF < 2 {
  if (file_number == 1) {
    skipWarn();
  }
  next;
}
# First pass of the file: calculate sums
file_number == 1 {
  x = getX();
  y = getY();
  if (areNumbers(x, y)) {
    count++;
    x_sum += x;
    y_sum += y;
    if (debug_values) {
      print x, y;
    }
  } else {
    skipWarn();
  }
}
# First pass of the file: calculate the means at the end of the file
file_number == 2 && FNR == 1 {
  x_mean = x_sum / count;
  y_mean = y_sum / count;
}
# Second pass of the file: add to the variance/covariance sums
file_number == 2 {
  x = getX();
  y = getY();
  if (areNumbers(x, y)) {
    x_diff_from_mean = (x - x_mean);
    x_variance_sum += x_diff_from_mean * x_diff_from_mean;
    y_diff_from_mean = (y - y_mean);
    y_variance_sum += y_diff_from_mean * y_diff_from_mean;
    covariance_sum += x_diff_from_mean * y_diff_from_mean;
  }
}
# Finally, calculate everything and print
END {
  if (count > 0 && !debug_values) {
    x_variance = (x_variance_sum / count);
    x_stddev = sqrt(x_variance);
    y_variance = (y_variance_sum / count);
    y_stddev = sqrt(y_variance);
    covariance = covariance_sum / count;
    if (x_stddev == 0) {
      print("ERROR: X standard deviation is 0") > "/dev/stderr";
      exit 1;
    } else if (y_stddev == 0) {
      print("ERROR: Y standard deviation is 0") > "/dev/stderr";
      exit 1;
    }
    pearson = covariance / (x_stddev * y_stddev);
    printf("x sum = %.2f, count = %d, mean = %.2f, variance = %.2f, stddev = %.2f\n", x_sum, count, x_mean, x_variance, x_stddev);
    printf("y sum = %.2f, count = %d, mean = %.2f, variance = %.2f, stddev = %.2f\n", y_sum, count, y_mean, y_variance, y_stddev);
    printf("covariance = %.2f\n", covariance);
    printf("pearson correlation coefficient (r) = %.2f\n", pearson);
    printf("coefficient of determination (r^2) = %.2f\n", (pearson * pearson));
  }
}