changed 3 years ago
Linked with GitHub

Awk workshop: QA follow-up

I have a 1000 files which the name of file is the name of MAG and each file have one column of KOs that are present in that MAG. is it possible to arrange them in a new file in such a way that the first column is the name of KOs and in next columns we have the name of MAGs (each file name) and then 1 for each KO if it is present in the file and 0 if it is not present with AWK? the KOs are not for sure the same in different MAGs

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Awk script

collect-KO.awk

#!/usr/bin/awk -f
# Make the output sorted
BEGIN{PROCINFO["sorted_in"]="@ind_num_asc"} 

# Take the file name from the first line of each file
FNR==1 {fname=$1; flist[++i]=fname; next}

# Collect KOs. if present mark as 1. 
# Not-defined KOs will be "" i.e. when multiplied by 1, further down, it will produce 0 
{ KO[$1][fname]= 1 }

# Print collected data with sorted KOs
# and columns with filenames in orser as appeared in the command line
END{
  # Print the first line (header)
  printf("\t")
  for(i=1;i<=length(flist);i++) printf("%s\t",flist[i])
  print""

  # Print KOs in sorted fashion 
  for(iKO in KO){
    printf("%s\t",iKO)
    for(i=1;i<=length(flist);i++)
      printf("%s\t",KO[iKO][flist[i]]*1)
    print""
  }
}

Example run

$ ./collect-KO.awk *.txt2
        1207509.faa.emapper.emapper.annotations_ko.txt  1208238.faa.emapper.emapper.annotations_ko.txt  1208274.faa.emapper.emapper.annotations_ko.txt2754412846.faa.emapper.emapper.annotations_ko.txt        2754412852.faa.emapper.emapper.annotations_ko.txt       2754412860.faa.emapper.emapper.annotations_ko.txt      2754412866.faa.emapper.emapper.annotations_ko.txt       2754412870.faa.emapper.emapper.annotations_ko.txt       2754412880.faa.emapper.emapper.annotations_ko.txt      2754412884.faa.emapper.emapper.annotations_ko.txt       2754412890.faa.emapper.emapper.annotations_ko.txt       2754412892.faa.emapper.emapper.annotations_ko.txt
K00001  0       0       0       0       0       0       0       0       0       1       0       0       
K00002  0       0       0       0       1       0       0       0       0       0       0       0
...

Comments

  • The columns will be sorted in the order on the command line. if *.txt2 wild card is used, the sorting is performed by the shell.
  • Each file is scanned, and each KOs is collected in KO with the filename as second key { KO[$1][fname]= 1 }
  • at the END all collected KOs are printed, and for missing entries KO[$1][fname] is ="" by definition. When printed the data is multiplied by 1 which in awk produces 1 and 0, respectively.
  • the scripr is using tabulators \t to align the output - should be easy to replace with fixed spacing, if necessary.

Contacts:


tags: UPPMAX, SNIC, awk
Select a repo