[Toc]
# UNIX IV
## the awk command
a programming language for doing common data manipulation tasks with only a few lines of program, which is a _pattern action_ language
Awk looks a little like C but it automatically handles input, field splitting, initialization (initial value is 0), and memory management: built-in string and number data types, no variable declarations and associative arrays.

* running an AWK program
* like `grep` can take input
```
% grep 'pattern' file
% awk 'program' file
```
* like `sed` can load its program
```
% sed 'program' < file
% awk 'program' < file
% ./sedscript < file
% ./awkscript < file
```
_*note: remembering adding the `-f` flags onto the first line of declaration like `#!/usr/bin/awk -f`._
* structure of an AWK program
* an optional BEGIN segment, `BEGIN{action}`
for processing to execute prior to reading input
* pattern (action pairs), `pattern{action}`
1. pricessing for input data
2. for each pattern matched, the corresponding action is taken
3. the default _pattern_ is to match all lines
4. the default _action_ is to print the line (or record)
5. you can skip one of them above, but not both, where actions are enclosed in `{}`
* an optional ENG segment, `END{action}`
processing after end of input data
```
cat f | awk '1{print}' # prints all lines
cat f | awk '1' # prints all lines
cat f | awk '1' # prints all lines
cat f | awk '1;1;1' # prints each line three times
cat f | awk 'print' # syntax error (not in {})
cat f | awk '{x++}x%2' # prints odd lines
cat f | awk '++x%2' # prints odd lines
```

## AWK variables and flags
### built-in varialbe
* `$0` the entire line: if we want to reference the input line
* to _save the line_ into a _variable_
`awk '{x=$0}' < file`
* to print every line
`awk '1' < file`
`awk '{print} < file'`
`awk '{print $p} < file'`
* `$n` fidle n: similiar to `cut -f`, which can parse the line into fields. for instance, `$4` would indicate the 4^th^ field
_*note: `-f` field from each input line (seperated with "__\t__")._
\
==EXAMPLE==
print the first and the twelfth fields:
\
`awk '{print $1, $12}'`
if the items to print are separated by commas, then they will be output with a single space between.
`awk '{print $1""$12}'` or `awk '{print $1$12}'` or `awk '{print $1 $12}'`
if the items are __not__ separated by commas, then they will be output directly next to each other.
```tcsh=
yenubuntu:~/unix> cat file | awk '{print $1,$12}'
1 12
yenubuntu:~/unix> cat file | awk '{print $1$12}'
112
yenubuntu:~/unix> cat file | awk '{print $1 $12}'
112
```
* `NF` number of __Fields__ in current line (or record)
* `NR` number of lines (or __Records__) read so far

* `-f <filename>` use the file instead of a one-liner script
could be on line 1 `#/usr/bin/awk -f` to make the file an executable
* `-F "<symbol>"` use the symbol(s) in "x" for the _field separator_
\
==EXAMPLE==
```tcsh=
yenubuntu:~/unix> echo "hello:my:name:is:Alaa" > file
yenubuntu:~/unix> awk -F":" '{print $4}' file
is
```
```tcsh=
yenubuntu:~/unix> echo "a b c#d e" | tr "#" "\t" > file
yenubuntu:~/unix> cat file
a b c d e
yenubuntu:~/unix> awk '{print $4}' file
d
yenubuntu:~/unix> awk -F " " '{print $4}' file
d
yenubuntu:~/unix> awk -F "[ ]" '{print $4}' file
c d
yenubuntu:~/unix> awk -F "[\t]" '{print $2}' file
d e
yenubuntu:~/unix> awk -F "[\t]" '{print $1}' file
a b c
yenubuntu:~/unix>
yenubuntu:~/unix> awk -F "[ \t]" '{print $4}' file
c
yenubuntu:~/unix> awk -F "[ \t]*" '{print $4}' file
d
yenubuntu:~/unix> awk -F "[ \t]+" '{print $4}' file
d
yenubuntu:~/unix> awk '{print $4}' file
d
yenubuntu:~/unix>
```
_*note: `[]` in a regex._
* you might want to access arguments (of course, you <font color=red>CANNOT</font> use $1, $2, etc since these are usd for ++field++)
* built-in variables are `AGRC` and `ARGV` (same as C)
`awk 'your_code_here' filename`
then, `ARGC == 2` and `ARGV[0] == "awk", ARGV[1] == filename`
\
whereas, these may not like what we think: awk arguments are filenames not generic prarmeterc that you can define howevre we would like.
just as you could specify the `-F` flag, you can also change the field separator from within the awk program (but it will only apply to future input lines/records)
* `FS` input Field Separator, default `BEGIN{FS="[ \t]+"}`
Notice that the default consumes <font color=red>ALL</font> of the blank space between fields. So awk WON'T know HOW MANY spaces were in the input lines. unlesss you either:
1. override the FS default
2. directly inspect `$0`
* `RS` input Record Separator, default `BEGIN{RS="\n"}`
why is the line number called NR (Number of Record) instead of NL (Number of Lines)? because
1. you don't have to use the default -- you can change it with RS
2. and if you change it thet won't be input lines any more. so we instead use the generic word, "record"
==EXAMPLE==
```tcsh=
yenubuntu:~/unix> echo "A:B:C" > f2
yenubuntu:~/unix> cat f2 | awk -F: '1'
A:B:C
yenubuntu:~/unix> cat f2 | awk -F: '{$1=$1}1'
A B C
yenubuntu:~/unix> cat f2 | awk 'BEGIN{FS=":"}{$1=$1}1'
A B C
yenubuntu:~/unix> cat f2 | awk '{FS=":"}{$1=$1}1'
A:B:C
yenubuntu:~/unix> cat f2 f2 | awk '{FS=":"}{$1=$1}1'
A:B:C
A B C
yenubuntu:~/unix> cat f2 f2 | awk 'BEGIN{FS=":"}NR==2{$1=$1}1'
A:B:C
A B C
yenubuntu:~/unix> cat f2 f2 | awk 'BEGIN{FS=":"}NR==1{$1=$1}1'
A B C
A:B:C
yenubuntu:~/unix> cat f2 | awk 'BEGIN{RS=":"}1'
A
B
C
yenubuntu:~/unix> cat f2 | awk '{RS=":"}1'
A:B:C
yenubuntu:~/unix> cat f2 f2 | awk '{RS=":"}1'
A:B:C
A
B
C
yenubuntu:~/unix>
```
* line 4: `{$1=$1}` looks useless; however, it seems different on ++line 3++ and ++line 5++.
this _input FS_ was changed, but the __output__ field separator is still the __default__ (a space). Then, why the line 3 didn't change? Since the program ++__didn't allow__ awk to recompute $0++, but the line 5 did.
* line 13: `NR==2` only ++sometimes allows++ awk to recompute $0.
* line 23: yes there is an empty line here.
:::danger
but why?
:::
* `OFS` Output Field Separator, default `BEGIN{OPS=" "}`
the idea works a bit tricky. the issue is: `$0` does not automatically update to match with OFS, for example:
```tcsh=
yenubuntu:~/unix> echo "a\tb" | awk '1;{$1=$1}1'
a b
a b
```
* line 2: the default `OFS` is one space, which is, __`$0` does not automatically update__.
* the `;` is _necessary_ to indicate that the action that follows in NOT matched to the pattern that precedes it.
* line 3: however, it doesn't mean we cann't manually force a recalculation of $0, which is `{$1=$1}`.
* `ORS` Output Record Separator, default `BEGIN{ORS="\n"}`
the `OFS` and `ORS` can also be string not just single characters. (but, of course, they cannot be regex like `FS` and `RS` can.)

### output formatting
* `print` command usea a _simplistic_ format
* put _space_ wherever there are <font color=red>commas</font>,"`,`"
* insert a `\n` at the end
* `printf` allows highly-formatted output as C
* the usage is same as the C language (remember that awk is a programming language as well)
* for example:
`awk '{printf("apple is $%6.2f dollars.", $1)}'`
## operators
most of them are as same as C
* `=` assignment
* `==` equality operator
* `!=` inverse equality operator
* `~/!~` (use) __extended__ regex comparison
==EXAMPLE==
`awk '$1~"[.][0-9]+E"{print $1}'`
`awk '$0!~"(a|b)y"{print "nope"}'`
* `&&`, `||`, `!` logical operator
* `<`, `>`, `<=`, `>=` relational operators
* `+`, `-`, `*`, `/`, `%`, `^` math operators
* `^` exponent
* `space`(implict or explict) string concatenation
==EXAMPLE==
`awk '{print A $1}{A=$1}'`, where a space between `A` and `$1`
`awk '{print A$1}{A=$1}'`, but it is ++optional++, it will be _implictly assumed_ when no opeartor is given
\
`awk '{print A B}{A=$1}'`; however, this time the space is needed to prevent creating a new variable called "AB"
## good at data validation
==EXAMPLE==
Suppose I had a file of all of the salaries of the employees in a company and that I want to see if any of the entrees are broken:
\
salaries.txt
```
Alan Turing 9.40 40
Alfred Aho 3.50 40
Peter Weinberger 7.50 400
Barbara Liskov 7.50 40
Brian Kernighan $9.00 40
D. Knuth 8.50 35
Grace Hopper 40 9.10
Jon Von Neuman 9.80 35
L. Page 8.50 40 40
Sergey Brin 7.50 55
Tim Berners Lee 8.50 15
```
valadate.awk
```tcsh!
#!/usr/bin/awk -f
NF != 3 { print $0, "number of fields not equal to 3" }
$2 < 7.5{ print $0, "rate below minimum or incorrect" }
$2 > 10 { print $0, "rate exceeds $10 per hour"}
$3 < 0 { print $0, "negative hours worked" }
$3 > 60 { print $0, "too many hours worked" }
```
result:
```=
yenubuntu:~/unix> cat salaries.txt | awk -F"[\t]" -f validate.awk
Alfred Aho 3.50 40 rate below minimum or incorrect
Peter Weinberger 7.50 400 too many hours worked
Brian Kernighan $9.00 40 rate below minimum or incorrect
Grace Hopper 40 9.10 rate exceeds $10 per hour
L. Page 8.50 40 40 number of fields not equal to 3
```
noticing on the ++line 4++, the `-f2` is _broken_ (which hava a $ sign), but why the result isn't broken in this way?
## how awk handles strings vs numbers?
```tcah!=
yenubuntu:~/unix> echo|awk '{print 10+10,"10" "10"}'
20 1010
yenubuntu:~/unix> echo|awk '{print "10"+"10",10 10}'
20 1010
yenubuntu:~/unix> echo|awk '{print "55NotANum"+0,0+"NotANum"}'
55 0
yenubuntu:~/unix> echo|awk '{print("a"<"b"),("a">"A"),("a"<"ap")}'
1 1 1
yenubuntu:~/unix> echo|awk '{print("10"<"9"),(10<9),("$"<"7")}'
1 0 1
yenubuntu:~/unix> echo|awk '{print("$9"<"7.5"),("$9"<7.5)}'
1 1
yenubuntu:~/unix> echo 8@X | tr "@" "\n" | awk \
? '{print($1,$1+0,$1+"0",$1 "o",$NF-1,$(NF-1))}'
8 8 8 8o 7 8
X 0 0 Xo -1 X
```
* line 3: what happens when you use a numeric operator on string operands? what heppens when you use a string operator on numeric operands?
since awk _converts_ the operand types so that they __work with the operator__.
* line 5: what if the operand cannot be converted to a number?
that ++any number at the front++ of the operand is used, and if there is no number at the front, then it __converts to a 0__. hence, `0+"NotANum"` equals to `0+0`.
* line 7: according to the __ASCII__ (same as C). That is, the ASCII of `a`, `b`, `A` is 97, 97 and 65 respectively. then `"ap"` is a longer word, well __the longer is bigger__ principle.
* line 14:
* `$NF-1`: take the last field and subtract 1 from it
* `$(NF-1)`: take the 2^nd^-to-last field. but if `NF==1`, then `$0` is the full line.
## the pattern selection part of pattern/action pair
### complex cases
awk patterns are good for selecting specific lines from the input for further processing, for instance below:
* selection by comparison:
`$2 >= 8 {print}`
`$2 * $3 > 50 {printf("%6.2f for %s\n", $2 * $3, $1)}`
* selection by logical operation:
`$2 >= 8 || $3 >= 20`
* selection by text context:
`$1 == "Grace"`
`$1 ~ "Grace"`
`/Grace/` (which has the same meaning as it has in sed, regex. however, there is matching to lines containing the ++extended++ regex, which is equivalent to `$0 ~ "Grace"`)
### the BEGIN and END patterns
special pattern `BEGIN` matches before the first input line is read; `END` matched after ther last input line has been read.
this allows for initial and wrap-up processing
```
BEGIN { FS=","; print "NAME RATE HOURS"; print "" }
{ print }
END { print "total number of employees is", NR }
```
## computing with AWK
* Counting is easy to do with AWK:
`$3 > 15 { emp = emp + 1}`
`END { print emp, "employees worked more than 15 hrs"}`
* Computing Sums and Averages is also simple:
```
{ pay = pay + $2 * $3 }
END { print NR, "employees" #NR’s value is from the final line
print "total pay is", pay
print "average pay is", pay/NR }
```
* Printing the Last Input Line
Although NR retains its value after the last input line has been read, $0 does not (on some systems):
```
{ last = $0 }
END { print NR ":", last }
```
## handling text
Awk variables can hold strings of characters as well as numbers, and Awk conveniently translates back and forth as needed
* The following program finds the employee who is paid the most per hour
`$2 > maxrate { maxrate = $2; maxemp = $1 }`
`END { print "highest hourly rate:",maxrate,"for",maxemp }`
* String Concatenation: the space operator: New strings can be created by combining old ones
```
{ names = names $1 ", " }
END { print names }
```
---
remaing context of the course [here](https://docs.google.com/presentation/d/1pdrbL8v_4cbMk7EU7p1V31EecKI0bobt/edit?usp=share_link&ouid=109644673666755707985&rtpof=true&sd=true)