[Toc] # UNIX IV ## the awk command a programming language for doing common data manipulation tasks with only a few lines of program, which is a _pattern action_ language Awk looks a little like C but it automatically handles input, field splitting, initialization (initial value is 0), and memory management: built-in string and number data types, no variable declarations and associative arrays. ![截圖 2024-05-30 晚上8.56.23](https://hackmd.io/_uploads/rkrKPxIVR.png) * running an AWK program * like `grep` can take input ``` % grep 'pattern' file % awk 'program' file ``` * like `sed` can load its program ``` % sed 'program' < file % awk 'program' < file % ./sedscript < file % ./awkscript < file ``` _*note: remembering adding the `-f` flags onto the first line of declaration like `#!/usr/bin/awk -f`._ * structure of an AWK program * an optional BEGIN segment, `BEGIN{action}` for processing to execute prior to reading input * pattern (action pairs), `pattern{action}` 1. pricessing for input data 2. for each pattern matched, the corresponding action is taken 3. the default _pattern_ is to match all lines 4. the default _action_ is to print the line (or record) 5. you can skip one of them above, but not both, where actions are enclosed in `{}` * an optional ENG segment, `END{action}` processing after end of input data ``` cat f | awk '1{print}' # prints all lines cat f | awk '1' # prints all lines cat f | awk '1' # prints all lines cat f | awk '1;1;1' # prints each line three times cat f | awk 'print' # syntax error (not in {}) cat f | awk '{x++}x%2' # prints odd lines cat f | awk '++x%2' # prints odd lines ``` ![截圖 2024-05-30 晚上9.13.18](https://hackmd.io/_uploads/Bkjuox8N0.png) ## AWK variables and flags ### built-in varialbe * `$0` the entire line: if we want to reference the input line * to _save the line_ into a _variable_ `awk '{x=$0}' < file` * to print every line `awk '1' < file` `awk '{print} < file'` `awk '{print $p} < file'` * `$n` fidle n: similiar to `cut -f`, which can parse the line into fields. for instance, `$4` would indicate the 4^th^ field _*note: `-f` field from each input line (seperated with "__\t__")._ \ ==EXAMPLE== print the first and the twelfth fields: \ `awk '{print $1, $12}'` if the items to print are separated by commas, then they will be output with a single space between. `awk '{print $1""$12}'` or `awk '{print $1$12}'` or `awk '{print $1 $12}'` if the items are __not__ separated by commas, then they will be output directly next to each other. ```tcsh= yenubuntu:~/unix> cat file | awk '{print $1,$12}' 1 12 yenubuntu:~/unix> cat file | awk '{print $1$12}' 112 yenubuntu:~/unix> cat file | awk '{print $1 $12}' 112 ``` * `NF` number of __Fields__ in current line (or record) * `NR` number of lines (or __Records__) read so far ![截圖 2024-05-31 凌晨12.51.29](https://hackmd.io/_uploads/HyTqR784R.png) * `-f <filename>` use the file instead of a one-liner script could be on line 1 `#/usr/bin/awk -f` to make the file an executable * `-F "<symbol>"` use the symbol(s) in "x" for the _field separator_ \ ==EXAMPLE== ```tcsh= yenubuntu:~/unix> echo "hello:my:name:is:Alaa" > file yenubuntu:~/unix> awk -F":" '{print $4}' file is ``` ```tcsh= yenubuntu:~/unix> echo "a b c#d e" | tr "#" "\t" > file yenubuntu:~/unix> cat file a b c d e yenubuntu:~/unix> awk '{print $4}' file d yenubuntu:~/unix> awk -F " " '{print $4}' file d yenubuntu:~/unix> awk -F "[ ]" '{print $4}' file c d yenubuntu:~/unix> awk -F "[\t]" '{print $2}' file d e yenubuntu:~/unix> awk -F "[\t]" '{print $1}' file a b c yenubuntu:~/unix> yenubuntu:~/unix> awk -F "[ \t]" '{print $4}' file c yenubuntu:~/unix> awk -F "[ \t]*" '{print $4}' file d yenubuntu:~/unix> awk -F "[ \t]+" '{print $4}' file d yenubuntu:~/unix> awk '{print $4}' file d yenubuntu:~/unix> ``` _*note: `[]` in a regex._ * you might want to access arguments (of course, you <font color=red>CANNOT</font> use $1, $2, etc since these are usd for ++field++) * built-in variables are `AGRC` and `ARGV` (same as C) `awk 'your_code_here' filename` then, `ARGC == 2` and `ARGV[0] == "awk", ARGV[1] == filename` \ whereas, these may not like what we think: awk arguments are filenames not generic prarmeterc that you can define howevre we would like. just as you could specify the `-F` flag, you can also change the field separator from within the awk program (but it will only apply to future input lines/records) * `FS` input Field Separator, default `BEGIN{FS="[ \t]+"}` Notice that the default consumes <font color=red>ALL</font> of the blank space between fields. So awk WON'T know HOW MANY spaces were in the input lines. unlesss you either: 1. override the FS default 2. directly inspect `$0` * `RS` input Record Separator, default `BEGIN{RS="\n"}` why is the line number called NR (Number of Record) instead of NL (Number of Lines)? because 1. you don't have to use the default -- you can change it with RS 2. and if you change it thet won't be input lines any more. so we instead use the generic word, "record" ==EXAMPLE== ```tcsh= yenubuntu:~/unix> echo "A:B:C" > f2 yenubuntu:~/unix> cat f2 | awk -F: '1' A:B:C yenubuntu:~/unix> cat f2 | awk -F: '{$1=$1}1' A B C yenubuntu:~/unix> cat f2 | awk 'BEGIN{FS=":"}{$1=$1}1' A B C yenubuntu:~/unix> cat f2 | awk '{FS=":"}{$1=$1}1' A:B:C yenubuntu:~/unix> cat f2 f2 | awk '{FS=":"}{$1=$1}1' A:B:C A B C yenubuntu:~/unix> cat f2 f2 | awk 'BEGIN{FS=":"}NR==2{$1=$1}1' A:B:C A B C yenubuntu:~/unix> cat f2 f2 | awk 'BEGIN{FS=":"}NR==1{$1=$1}1' A B C A:B:C yenubuntu:~/unix> cat f2 | awk 'BEGIN{RS=":"}1' A B C yenubuntu:~/unix> cat f2 | awk '{RS=":"}1' A:B:C yenubuntu:~/unix> cat f2 f2 | awk '{RS=":"}1' A:B:C A B C yenubuntu:~/unix> ``` * line 4: `{$1=$1}` looks useless; however, it seems different on ++line 3++ and ++line 5++. this _input FS_ was changed, but the __output__ field separator is still the __default__ (a space). Then, why the line 3 didn't change? Since the program ++__didn't allow__ awk to recompute $0++, but the line 5 did. * line 13: `NR==2` only ++sometimes allows++ awk to recompute $0. * line 23: yes there is an empty line here. :::danger but why? ::: * `OFS` Output Field Separator, default `BEGIN{OPS=" "}` the idea works a bit tricky. the issue is: `$0` does not automatically update to match with OFS, for example: ```tcsh= yenubuntu:~/unix> echo "a\tb" | awk '1;{$1=$1}1' a b a b ``` * line 2: the default `OFS` is one space, which is, __`$0` does not automatically update__. * the `;` is _necessary_ to indicate that the action that follows in NOT matched to the pattern that precedes it. * line 3: however, it doesn't mean we cann't manually force a recalculation of $0, which is `{$1=$1}`. * `ORS` Output Record Separator, default `BEGIN{ORS="\n"}` the `OFS` and `ORS` can also be string not just single characters. (but, of course, they cannot be regex like `FS` and `RS` can.) ![image](https://hackmd.io/_uploads/BJoF_GPEC.png) ### output formatting * `print` command usea a _simplistic_ format * put _space_ wherever there are <font color=red>commas</font>,"`,`" * insert a `\n` at the end * `printf` allows highly-formatted output as C * the usage is same as the C language (remember that awk is a programming language as well) * for example: `awk '{printf("apple is $%6.2f dollars.", $1)}'` ## operators most of them are as same as C * `=` assignment * `==` equality operator * `!=` inverse equality operator * `~/!~` (use) __extended__ regex comparison ==EXAMPLE== `awk '$1~"[.][0-9]+E"{print $1}'` `awk '$0!~"(a|b)y"{print "nope"}'` * `&&`, `||`, `!` logical operator * `<`, `>`, `<=`, `>=` relational operators * `+`, `-`, `*`, `/`, `%`, `^` math operators * `^` exponent * `space`(implict or explict) string concatenation ==EXAMPLE== `awk '{print A $1}{A=$1}'`, where a space between `A` and `$1` `awk '{print A$1}{A=$1}'`, but it is ++optional++, it will be _implictly assumed_ when no opeartor is given \ `awk '{print A B}{A=$1}'`; however, this time the space is needed to prevent creating a new variable called "AB" ## good at data validation ==EXAMPLE== Suppose I had a file of all of the salaries of the employees in a company and that I want to see if any of the entrees are broken: \ salaries.txt ``` Alan Turing 9.40 40 Alfred Aho 3.50 40 Peter Weinberger 7.50 400 Barbara Liskov 7.50 40 Brian Kernighan $9.00 40 D. Knuth 8.50 35 Grace Hopper 40 9.10 Jon Von Neuman 9.80 35 L. Page 8.50 40 40 Sergey Brin 7.50 55 Tim Berners Lee 8.50 15 ``` valadate.awk ```tcsh! #!/usr/bin/awk -f NF != 3 { print $0, "number of fields not equal to 3" } $2 < 7.5{ print $0, "rate below minimum or incorrect" } $2 > 10 { print $0, "rate exceeds $10 per hour"} $3 < 0 { print $0, "negative hours worked" } $3 > 60 { print $0, "too many hours worked" } ``` result: ```= yenubuntu:~/unix> cat salaries.txt | awk -F"[\t]" -f validate.awk Alfred Aho 3.50 40 rate below minimum or incorrect Peter Weinberger 7.50 400 too many hours worked Brian Kernighan $9.00 40 rate below minimum or incorrect Grace Hopper 40 9.10 rate exceeds $10 per hour L. Page 8.50 40 40 number of fields not equal to 3 ``` noticing on the ++line 4++, the `-f2` is _broken_ (which hava a $ sign), but why the result isn't broken in this way? ## how awk handles strings vs numbers? ```tcah!= yenubuntu:~/unix> echo|awk '{print 10+10,"10" "10"}' 20 1010 yenubuntu:~/unix> echo|awk '{print "10"+"10",10 10}' 20 1010 yenubuntu:~/unix> echo|awk '{print "55NotANum"+0,0+"NotANum"}' 55 0 yenubuntu:~/unix> echo|awk '{print("a"<"b"),("a">"A"),("a"<"ap")}' 1 1 1 yenubuntu:~/unix> echo|awk '{print("10"<"9"),(10<9),("$"<"7")}' 1 0 1 yenubuntu:~/unix> echo|awk '{print("$9"<"7.5"),("$9"<7.5)}' 1 1 yenubuntu:~/unix> echo 8@X | tr "@" "\n" | awk \ ? '{print($1,$1+0,$1+"0",$1 "o",$NF-1,$(NF-1))}' 8 8 8 8o 7 8 X 0 0 Xo -1 X ``` * line 3: what happens when you use a numeric operator on string operands? what heppens when you use a string operator on numeric operands? since awk _converts_ the operand types so that they __work with the operator__. * line 5: what if the operand cannot be converted to a number? that ++any number at the front++ of the operand is used, and if there is no number at the front, then it __converts to a 0__. hence, `0+"NotANum"` equals to `0+0`. * line 7: according to the __ASCII__ (same as C). That is, the ASCII of `a`, `b`, `A` is 97, 97 and 65 respectively. then `"ap"` is a longer word, well __the longer is bigger__ principle. * line 14: * `$NF-1`: take the last field and subtract 1 from it * `$(NF-1)`: take the 2^nd^-to-last field. but if `NF==1`, then `$0` is the full line. ## the pattern selection part of pattern/action pair ### complex cases awk patterns are good for selecting specific lines from the input for further processing, for instance below: * selection by comparison: `$2 >= 8 {print}` `$2 * $3 > 50 {printf("%6.2f for %s\n", $2 * $3, $1)}` * selection by logical operation: `$2 >= 8 || $3 >= 20` * selection by text context: `$1 == "Grace"` `$1 ~ "Grace"` `/Grace/` (which has the same meaning as it has in sed, regex. however, there is matching to lines containing the ++extended++ regex, which is equivalent to `$0 ~ "Grace"`) ### the BEGIN and END patterns special pattern `BEGIN` matches before the first input line is read; `END` matched after ther last input line has been read. this allows for initial and wrap-up processing ``` BEGIN { FS=","; print "NAME RATE HOURS"; print "" } { print } END { print "total number of employees is", NR } ``` ## computing with AWK * Counting is easy to do with AWK: `$3 > 15 { emp = emp + 1}` `END { print emp, "employees worked more than 15 hrs"}` * Computing Sums and Averages is also simple: ``` { pay = pay + $2 * $3 } END { print NR, "employees" #NR’s value is from the final line print "total pay is", pay print "average pay is", pay/NR } ``` * Printing the Last Input Line Although NR retains its value after the last input line has been read, $0 does not (on some systems): ``` { last = $0 } END { print NR ":", last } ``` ## handling text Awk variables can hold strings of characters as well as numbers, and Awk conveniently translates back and forth as needed * The following program finds the employee who is paid the most per hour `$2 > maxrate { maxrate = $2; maxemp = $1 }` `END { print "highest hourly rate:",maxrate,"for",maxemp }` * String Concatenation: the space operator: New strings can be created by combining old ones ``` { names = names $1 ", " } END { print names } ``` --- remaing context of the course [here](https://docs.google.com/presentation/d/1pdrbL8v_4cbMk7EU7p1V31EecKI0bobt/edit?usp=share_link&ouid=109644673666755707985&rtpof=true&sd=true)