PYTHON TRAINING TEST - GS 06/2021 (hieunguyen@GS)

# PYTHON TRAINING TEST - GS 06/2021 (hieunguyen@GS) ## Part I. Introduction to Basic BASH and Python. By BASH script, we also mean any type of Linux commands or pre-built tools. Just use anything that can be used in the TERMINAL. - Q1. If a Python function doensn't have ```return``` to output or give back any value, what is the default output when calling that function? - Q2. In python dictionary, is it possible to have two identical keys which assign two different values? - Q3. Using the file "sample_SAM.sam" as the input for this question, write some BASH script commands to: Q3.1: Count the number of rows in the text file. Q3.2: Print out the first line. Q3.3: Print out the last line. Q3.4: Remove all lines that start with the symbol "@" and write the rest to a new file. Q3.5: Extract the 3rd, 4th and 8th columns and write to a new ".tsv" file with the same name as the input. Q3.6: Extract all rows that have the 3rd columns equal to "chr1" and write to a new ".txt" file with the same name as the input. - Q4: DATAFRAME. Because we have to work with lots of text-based-with-delimiter "\t" or "," data files, Python DataFrame is a useful tool to process these data. \\ This question will use the output of Q3.5 and the file named ```sample_VCF``` as inputs. Please do some quick researchs on how to use the ```pandas``` library in Python and complete the following tasks: Hint: you can use the following command to read a file into dataframe ```df```: ``` df = pandas.read_csv(path_to_file, sep = ___, header = None)``` The delimiter of the input file should be filled in ```sep = ___```. Q4.1: Read the file in Q3.5 to a ```pandas``` dataframe and rename the columns to ```CHROMOSOME```, ```START```, ```END```. Q4.2: Create a new column named ```COMBINED_INFO``` to this dataframe, the column ```COMBINE_INFO``` should contain the information from columns ```CHROMOSOME```, ```START```, ```END``` in this form: ```CHROMOSOME-START:END```. Q4.3: Create a new column named ```LENGTH``` to this dataframe, this columnd contains the result of ```END``` - ```START```. For example: | CHROMOSOME | START | END | COMBINED_INFO | LENGTH | |------------|-------|-----|------------------|--------| | chr1 | 10000 |20000|chr1:10000-20000 |10000 | | | | | | | ** From this question, use the ```sample_VCF.vcf``` file as input ** Q4.4: Before reading in the ```vcf``` file into ```pandas``` dataframe, remove all lines that start with the symbol "##" but keep those start with "#" (This part is similar to Q3.4). Read the result into the ```pandas``` dataframe. Q4.5: As you have noticed, the "INFO" column of the ```sample_VCF.vcf``` file contains many information separated by ";". Extract the part ```AF=``` and put in a new column named "AF". Remove the column "INFO" after that. For example: Input: |#CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | 1-BCRCB42_AD500-AD003| |-------|---------|-----|-----|-----|-------|---------|-------|---------|----------------------| |chr1 | 26731452| . |T | C |42 |PASS |SAMPLE=1-BCRCB42_AD500-AD003;TYPE=SNV;DP=92;VD=3;AF=0.0326;... | GT:DP:VD:AD:AF:RD:ALD | 0/1:92:3:89,3:0.0326:21,68:3,0 Output: |#CHROM | POS | ID | REF | ALT | QUAL | FILTER | FORMAT | 1-BCRCB42_AD500-AD003| AF | |-------|---------|-----|-----|-----|-------|---------|---------|----------------------|----| |chr1 | 26731452| . |T | C |42 |PASS | GT:DP:VD:AD:AF:RD:ALD | 0/1:92:3:89,3:0.0326:21,68:3,0 | 0.0326| Q4.6 (**BONUS**): The ```sample_VCF.vcf``` file contains all MUTATIONS called from a patient sample. Each row is a mutation. Information of this mutation is in the column ```INFO``` including GENE-ame. Could you extract all mutations that are in the GENE ```FAT4```? How many of them has ```FILTER == PASS```?