# PYTHON TRAINING TEST - GS 06/2021 (hieunguyen@GS)
## Part I. Introduction to Basic BASH and Python.
By BASH script, we also mean any type of Linux commands or pre-built tools. Just use anything that can be used in the TERMINAL.
- Q1. If a Python function doensn't have ```return``` to output or give back any value, what is the default output when calling that function?
- Q2. In python dictionary, is it possible to have two identical keys which assign two different values?
- Q3. Using the file "sample_SAM.sam" as the input for this question, write some BASH script commands to:
Q3.1: Count the number of rows in the text file.
Q3.2: Print out the first line.
Q3.3: Print out the last line.
Q3.4: Remove all lines that start with the symbol "@" and write the rest to a new file.
Q3.5: Extract the 3rd, 4th and 8th columns and write to a new ".tsv" file with the same name as the input.
Q3.6: Extract all rows that have the 3rd columns equal to "chr1" and write to a new ".txt" file with the same name as the input.
- Q4: DATAFRAME.
Because we have to work with lots of text-based-with-delimiter "\t" or "," data files, Python DataFrame is a useful tool to process these data. \\
This question will use the output of Q3.5 and the file named ```sample_VCF``` as inputs. Please do some quick researchs on how to use the ```pandas``` library in Python and complete the following tasks:
Hint: you can use the following command to read a file into dataframe ```df```:
``` df = pandas.read_csv(path_to_file, sep = ___, header = None)```
The delimiter of the input file should be filled in ```sep = ___```.
Q4.1: Read the file in Q3.5 to a ```pandas``` dataframe and rename the columns to ```CHROMOSOME```, ```START```, ```END```.
Q4.2: Create a new column named ```COMBINED_INFO``` to this dataframe, the column ```COMBINE_INFO``` should contain the information from columns ```CHROMOSOME```, ```START```, ```END``` in this form: ```CHROMOSOME-START:END```.
Q4.3: Create a new column named ```LENGTH``` to this dataframe, this columnd contains the result of ```END``` - ```START```.
For example:
| CHROMOSOME | START | END | COMBINED_INFO | LENGTH |
|------------|-------|-----|------------------|--------|
| chr1 | 10000 |20000|chr1:10000-20000 |10000 |
| | | | | |
** From this question, use the ```sample_VCF.vcf``` file as input **
Q4.4: Before reading in the ```vcf``` file into ```pandas``` dataframe, remove all lines that start with the symbol "##" but keep those start with "#" (This part is similar to Q3.4). Read the result into the ```pandas``` dataframe.
Q4.5: As you have noticed, the "INFO" column of the ```sample_VCF.vcf``` file contains many information separated by ";". Extract the part ```AF=``` and put in a new column named "AF". Remove the column "INFO" after that. For example:
Input:
|#CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | 1-BCRCB42_AD500-AD003|
|-------|---------|-----|-----|-----|-------|---------|-------|---------|----------------------|
|chr1 | 26731452| . |T | C |42 |PASS |SAMPLE=1-BCRCB42_AD500-AD003;TYPE=SNV;DP=92;VD=3;AF=0.0326;... | GT:DP:VD:AD:AF:RD:ALD | 0/1:92:3:89,3:0.0326:21,68:3,0
Output:
|#CHROM | POS | ID | REF | ALT | QUAL | FILTER | FORMAT | 1-BCRCB42_AD500-AD003| AF |
|-------|---------|-----|-----|-----|-------|---------|---------|----------------------|----|
|chr1 | 26731452| . |T | C |42 |PASS | GT:DP:VD:AD:AF:RD:ALD | 0/1:92:3:89,3:0.0326:21,68:3,0 | 0.0326|
Q4.6 (**BONUS**): The ```sample_VCF.vcf``` file contains all MUTATIONS called from a patient sample. Each row is a mutation. Information of this mutation is in the column ```INFO``` including GENE-ame. Could you extract all mutations that are in the GENE ```FAT4```? How many of them has ```FILTER == PASS```?