Two-file processing

Check matching fields in two files

Given these two CSV files:

$ cat file1
1,line1
2,line2
3,line3
4,line4
$ cat file2
1,line3
2,line4
3,line5
4,line6

To print those lines in file2 whose second column occurs also in the first file we can say:

$ awk -F, 'FNR==NR {lines[$2]; next} $2 in lines' file1 file2
1,line3
2,line4

Here, lines[] holds an array that gets populated when reading file1 with the contents of the second field of each line.

Then, the condition $2 in lines checks, for every line in file2, if the 2nd field exists in the array. If so, the condition is True and awk performs its default action, consisting in printing the full line.

If just one field was needed to be printed, then this could be the expression:

$ awk -F, 'FNR==NR {lines[$2]; next} $2 in lines {print $1}' file1 file2
1
2

Print awk variables when reading two files

I hope this example will help everyone to understand how awk internal variables like NR, FNR etc change when awk is processing two files.

awk '{print "NR:",NR,"FNR:",FNR,"fname:",FILENAME,"Field1:",$1}' file1 file2
NR: 1 FNR: 1 fname: file1 Field1: f1d1
NR: 2 FNR: 2 fname: file1 Field1: f1d5
NR: 3 FNR: 3 fname: file1 Field1: f1d9
NR: 4 FNR: 1 fname: file2 Field1: f2d1
NR: 5 FNR: 2 fname: file2 Field1: f2d5
NR: 6 FNR: 3 fname: file2 Field1: f2d9

Where file1 and file2 look like:

$ cat file1
f1d1 f1d2 f1d3 f1d4

$ cat file2
f2d1 f2d2 f2d3 f2d4

Notice how NR value keeps increasing among all files, while FNR resets on each file. This is why the expression NR==FNR always refer to the first file fed to awk, since only in first file is possible to have NR equal to FNR.