# ZD125835 ## Summary The overall phenotype of the human TCR sample is similar to that seen for mouse samples analyzed by Nur. The data shows cells that have either high Ab counts or high TCR UMI counts are excluded due to mainly due to "No confident contigs" or "ShadowCommonContig" filters which are indicative of filtering due to either doublets or high ambient. The other majority of the data has low TCR UMIs which again could be due to high ambient mRNA Further we see this unique phrnotype of several cells with 5 or more UMIs ## Data Data here: `/mnt/customer1/swops/ZD125835_PoorHumanBCR` Tool binary by Sreenath (Future use. Not used in analysis below) `/mnt/projectdata/janeway/tools/vdj_filter_summary` ## Using LB Are these cells T cells based on Ab data ? From CD3 expressuon, yes these look like valid T cells ![](https://i.imgur.com/ch09POZ.png) There seems to be a group of high Ab cells that were red for 5/6 Abs I checked. Are these duplicates or broken cells with nonspecific Ab signal ? If we split the cells in those with VDJ and those without: - yesVDJ is where TCR cell was called via vloupe - notVDJ is top cells with high UMI counts but no TCr cell call. - VDJ? are rest of cells called by GEx but not by TCR From above we see that the cluster of cells with high ab UMI are not called by TCR cell calling. ![](https://i.imgur.com/HZwHPOO.png) Made a category "Abnlevel->high" of cells with Ab > log15. See below. Shows that most high Ab are nopt called as TCR cells. ![](https://i.imgur.com/Tewsko7.png) Also grabbed top barcodes not called as cells with high UMI in VDJ data.From below, Doesnt look like the high ab UMI counts overlap the high UMI VDJ cells not called. ![](https://i.imgur.com/kHSr1sl.png) ## Looking into VDJ data via all_contig_annotationscsv file. Total contigs ``` wc -l vdj_t/all_contig_annotations.csv 61259 ``` Number of barcodes. This is similar to what we seen in the barcode rankplot. ``` cut -d "," -f 1 vdj_t/all_contig_annotations.csv | sort -u | wc -l 40812 ``` The fact that there are 66% (40k/60k) unique barcodes in all contigs shows that most of the data is single contig barcodes. This might be indicative of why there are large number of barcodes filtered out due to CELL filter (see below). Find out the number of productive contigs in each barvcode, sort by the count of prod contigs in each of these barcodes. ``` grep "true,true" fh/scratch/delete90/furlan_s/KW2/PBSC_Stim1/outs/multi/vdj_t/all_contig_annotations.csv | cut -d "," -f1 | sort | uniq -c | sort -k 1,1nr | head -n 30 23 CTACATTCATAAAGGT-1 20 CCTAAAGTCACCGGGT-1 19 CTACGTCAGACTAGAT-1 17 ACTTACTAGTGCGATG-1 17 GCTTCCAAGCCAACAG-1 17 GGAACTTGTAGGACAC-1 17 GTTCGGGGTTACAGAA-1 17 TACCTATGTGAAGGCT-1 17 TCCACACTCCATTCTA-1 17 TCTCTAACATGAGCGA-1 16 CGCCAAGCAGCATACT-1 16 CTGGTCTAGGACAGAA-1 16 GAACGGATCAGTCCCT-1 15 AACTGGTTCCTAGGGC-1 15 AGTTGGTGTCGGGTCT-1 15 ATCTGCCAGACTTTCG-1 15 ATTGGACCATGCTGGC-1 15 CAGGTGCGTAACGACG-1 15 CATGCCTGTGGAAAGA-1 15 GCTGCTTAGTAGGCCA-1 15 TCTCTAATCAGGCCCA-1 14 ACACTGAGTTTGGGCC-1 14 ACACTGATCCCTGACT-1 14 ATCACGATCTCTGTCG-1 14 ATCCGAAGTTACGCGC-1 14 ATTATCCGTACCGAGA-1 14 CAAGAAACAAGCCCAC-1 14 CCTATTACACATTAGC-1 14 CGCCAAGCACTTAACG-1 14 CTAACTTCAATCTGCA-1 ``` Plot the frequency of contigs got as above ``` # get it for human sample grep "true,true" fh/scratch/delete90/furlan_s/KW2/PBSC_Stim1/outs/multi/vdj_t/all_contig_annotations.csv | cut -d "," -f1 | sort | uniq -c | sort -k 1,1nr > prod_fl_contig_count_per_bc_PBSC_Stim1 # get the freq for mouse gp4 sample grep "true,true" all_contig_annotations_gp4.csv | cut -d "," -f 1 | sort | uniq -c | sort -k 1,1nr > prod_fl_contig_count_per_bc_gp4 # get it for 10x public datasets grep "true,true" /mnt/showroom/10x.files/samples/cell-vdj/5.0.0/vdj_v1_hs_nsclc_multi_5gex_t_b/vdj_t_all_contig_annotations.csv | cut -d "," -f 1 | sort | uniq -c | sort -k 1,1nr > cr5_vdj_v1_hs_nsclc_multi_5gex_t_b_tcr grep "true,true" /mnt/showroom/10x.files/samples/cell-vdj/5.0.0/vdj_v1_hs_nsclc_multi_5gex_t_b/vdj_b_all_contig_annotations.csv | cut -d "," -f 1 | sort | uniq -c | sort -k 1,1nr > cr5_vdj_v1_hs_nsclc_multi_5gex_t_b_bcr grep "true,true" /mnt/showroom/10x.files/samples/cell-vdj/6.1.2/5k_human_antiCMV_T_TBNK_manual_Multiplex/vdj_t_all_contig_annotations.csv | cut -d "," -f 1 | sort | uniq -c | sort -k 1,1nr > cr6p1p2_5k_human_antiCMV_T_TBNK_manual_Multiplex_tcr grep "true,true" /mnt/showroom/10x.files/samples/cell-vdj/6.1.2/5k_human_antiCMV_T_TBNK_connect_Multiplex/vdj_t_all_contig_annotations.csv | cut -d "," -f 1 | sort | uniq -c | sort -k 1,1nr > cr6p1p2_5k_human_antiCMV_T_TBNK_connect_Multiplex_tcr grep "true,true" cr_6p0p1_SC5v2_Melanoma_5Kcells_Connect_single_channel.csv | cut -d "," -f 1 | sort | uniq -c | sort -k 1,1nr > cr_6p0p1_SC5v2_Melanoma_5Kcells_Connect_single_channel.barcodes grep "true,true" cr_6p0p1_SC5v2_humanPBMCs_5Kcells_Connect_single_channel.csv | cut -d "," -f 1 | sort | uniq-c | sort -k 1,1nr > cr_6p0p1_SC5v2_humanPBMCs_5Kcells_Connect_single_channel.barcodes ``` Plot in R ``` gp4 <- read.table("~/Downloads/prod_fl_contig_count_per_bc_gp4",header=FALSE) pbsc_stim1 <- read.table("~/Downloads/prod_fl_contig_count_per_bc_PBSC_Stim1",header=FALSE) cr5_nsclc_tcr <- read.table("~/Downloads/cr5_vdj_v1_hs_nsclc_multi_5gex_t_b_tcr",header=FALSE) cr5_nsclc_bcr <- read.table("~/Downloads/cr5_vdj_v1_hs_nsclc_multi_5gex_t_b_bcr",header=FALSE) cr6_connect_tcr <- read.table("~/Downloads/cr6p1p2_5k_human_antiCMV_T_TBNK_connect_Multiplex_tcr") cr6_manual_tcr <- read.table("~/Downloads/cr6p1p2_5k_human_antiCMV_T_TBNK_manual_Multiplex_tcr") pbmc_6.0.1 <- read.table("~/Downloads/cr_6p0p1_SC5v2_humanPBMCs_5Kcells_Connect_single_channel.barcodes",header=FALSE) melanoma_6.0.1 <- read.table("~/Downloads/cr_6p0p1_SC5v2_Melanoma_5Kcells_Connect_single_channel.barcodes",header=FALSE) plot(density(pbmc_6.0.1[,1]),col="blue",main="human PBMC TCR 6.0.1") plot(density(melanoma_6.0.1[,1]),col="blue",main="Melanoma TCR 6.0.1" ``` ### Customer samples ![](https://i.imgur.com/KSFmoip.png) ![](https://i.imgur.com/th0fZXD.png) ### 10x samples not sorted cells ![](https://i.imgur.com/ILgiU6N.png) ![](https://i.imgur.com/1EbbX6H.png) ![](https://i.imgur.com/KuXrYaw.png) ![](https://i.imgur.com/Rzz2OxG.png) ### 10x samples with sorted cells for dextramer data ![](https://i.imgur.com/kFhwWUh.png) ![](https://i.imgur.com/zdX4gYt.png) Above figures show customer data has unexpectedly high contigs per cell. ## Run enclone ``` enclone TCR=./vdj_t/ NALL NOPRINT SUMMARY ``` ![](https://i.imgur.com/Pdxh2oL.png) ![](https://i.imgur.com/FGOMcvG.png) This also shows that a large number of cells are failing the "CELL" filter. ![](https://i.imgur.com/Kp7LJcx.png) What does it mean ? A cell is called if there is at least 1 productive contig. ## Look in the all_contig_annotations.csv again - Sort the csv file by UMI - Filter out barcodes called as CELLs. I observed that several contigs were not "high confidence" - Then in the csv, remove/hide contigs that are not productive. The filtering is to see if the main reason that top contigs are filtered out are because most of the contigs are not high confidence. Of the contigs remaining take top few and grep for the barcode of the top few. I chose barcodes with contig UMIs >90 ![](https://i.imgur.com/94bNmNz.png) In some case it wasnt needed but in some case, look in filter_diagnostics.json ``` SC_MULTI_CS/SC_MULTI_CORE/MULTI_GEM_WELL_PROCESSOR/VDJ_T_GEM_WELL_PROCESSOR/SC_VDJ_CONTIG_ASSEMBLER/ASSEMBLE_VDJ/fork0/join/files/ ``` Examples of what I find for each barcode above is below: 1) This barcode is made not high-confidence becuase it has 3 productive TRA contigs. Fails second criteria of high-confidence. CATCAGATCTACTTAC-1,false,CATCAGATCTACTTAC-1_contig_1,false,510,TRA,TRAV12-3,,TRAJ52,TRAC,true,true,,,,,,,,,,,CAMSEFAGGTSYGKLTF,TGTGCAATGAGCGAATTTGCTGGTGGTACTAGCTATGGAAAGCTGACATTT,,,11064,46,,, CATCAGATCTACTTAC-1,false,CATCAGATCTACTTAC-1_contig_2,false,512,TRB,TRBV20-1,,TRBJ2-1,TRBC2,true,true,,,,,,,,,,,CSASAGAEQFF,TGCAGTGCATCAGCGGGAGCTGAGCAGTTCTTC,,,34036,139,,, CATCAGATCTACTTAC-1,false,CATCAGATCTACTTAC-1_contig_3,false,554,TRA,TRAV5,,TRAJ23,TRAC,true,true,,,,,,,,,,,CAVLWNQGGKLIF,TGTGCAGTCCTTTGGAACCAGGGAGGAAAGCTTATCTTC,,,5284,22,,, CATCAGATCTACTTAC-1,false,CATCAGATCTACTTAC-1_contig_4,false,612,TRA,TRAV3,,TRAJ16,TRAC,true,true,,,,,,,,,,,CAVRPAGQKLLF,TGTGCTGTGAGACCCGCCGGCCAGAAGCTGCTCTTT,,,3894,20,,, Checking in the filter_diagnostics.json shows { "category": "cell_calling", "info": { "barcode": "CATCAGATCTACTTAC-1", "filter": { "name": "no_confident_contig", "details": {} } } }, 2) CGAGCCAGTAATCACC-1: The barcode failed the "common_clone_shadow" filter when searching ``` { "category": "cell_calling", "info": { "barcode": "CGAGCCAGTAATCACC-1", "filter": { "name": "common_clone_shadow", "details": { "multiplicity": 1, "max_multiplicity": 156, "param_max_kill": 3, "param_min_ratio_big": 50 } } } }, ``` 3) GTGTTAGTCAACGAAA-1 ``` { "category": "cell_calling", "info": { "barcode": "GTGTTAGTCAACGAAA-1", "filter": { "name": "common_clone_shadow", "details": { "multiplicity": 1, "max_multiplicity": 156, "param_max_kill": 3, "param_min_ratio_big": 50 } } } }, ``` 4) TTTCCTCTCGAATCCA-1 Two filters applied to this barcode: { "category": "cell_calling", "info": { "barcode": "TTTCCTCTCGAATCCA-1", "filter": { "name": "chimeric_contig", "details": { "cdr3_nt": "\u0003\u0002\u0001\u0002\u0001\u0001\u0000\u0002\u0001\u0000\u0002\u0001\u0001\u0000\u0000\u0002\u0003\u0000\u0002\u0003\u0000\u0002\u0002\u0002\u0002\u0000\u0002\u0000\u0001\u0001\u0001\u0000\u0002\u0003\u0000\u0001\u0003\u0003\u0001", "param_chimera_ratio": 100, "dominant_v_region_id": 236, "dominant_v_region_umis": 3491 } } } }, { "category": "cell_calling", "info": { "barcode": "TTTCCTCTCGAATCCA-1", "filter": { "name": "common_clone_shadow", "details": { "multiplicity": 1, "max_multiplicity": 55, "param_max_kill": 3, "param_min_ratio_big": 50 } } } }, 5) TAAACCGTCAGTTCGA-1 This barcode is interesting becuase this one has 3 productive and 1 non prod. But all 4 high confidence contigs. ``` grep "TAAACCGTCAGTTCGA-1" fh/scratch/delete90/furlan_s/KW2/PBSC_Stim1/outs/multi/vdj_t/all_contig_annotations.csv TAAACCGTCAGTTCGA-1,false,TAAACCGTCAGTTCGA-1_contig_1,true,658,TRA,TRAV8-6,,TRAJ11,TRAC,true,true,,,,,,,,,,,CAVSVLNSGYSTLTF,TGTGCTGTGAGTGTCCTGAATTCAGGATACAGCACCCTCACCTTT,,,6019,23,,, TAAACCGTCAGTTCGA-1,false,TAAACCGTCAGTTCGA-1_contig_2,true,507,TRB,TRBV12-3,,TRBJ2-1,TRBC2,true,true,,,,,,,,,,,CASSLLAGADNEQFF,TGTGCCAGCAGCTTACTAGCGGGCGCGGACAATGAGCAGTTCTTC,,,23096,95,,, TAAACCGTCAGTTCGA-1,false,TAAACCGTCAGTTCGA-1_contig_3,true,504,TRB,TRBV12-3,TRBD1,TRBJ2-7,TRBC2,true,true,,,,,,,,,,,CASGPGTGGYEQYF,TGTGCCAGCGGACCCGGGACAGGGGGCTACGAGCAGTACTTC,,,400,6,,, TAAACCGTCAGTTCGA-1,false,TAAACCGTCAGTTCGA-1_contig_4,true,304,None,,,TRAJ10,TRAC,false,false,,,,,,,,,,,,,,,473,3,,, ``` This barcode was not found with a record in filter_diagnostics. This indicates that the barcode was filtered out post assembly, likely in enclone. ``` enclone TCR=fh/scratch/delete90/furlan_s/KW2/PBSC_Stim1/outs/multi/vdj_t BARCODE=TAAACCGTCAGTTCGA-1 PER_CELL NALL LVARSP=filter ``` Interestingly I could not find the filter name in enclone ![](https://i.imgur.com/SgTfRdY.png) The only explanation that now remains is that all 3 contigs were independently marked as non-confident and which is why this barcode was not called as a cell. At this point I start trying to automate retrieving the filter names for the barcodes in the list. ``` grep -f ./barcodes_of_top_umi_contigs_nothc_withacomma -A 5 ./barcodes_of_top_umi_contigs_nothc fh/scratch/delete90/furlan_s/KW2/PBSC_Stim1/SC_MULTI_CS/SC_MULTI_CORE/MULTI_GEM_WELL_PROCESSOR/VDJ_T_GEM_WELL_PROCESSOR/SC_VDJ_CONTIG_ASSEMBLER/ASSEMBLE_VDJ/fork0/files/filter_diagnostics.json | grep "barcode\|name" | cut -d ":" -f 2,3 > barcodes_of_top_umi_contigs_nothc_filters ``` ![](https://i.imgur.com/LPoesje.png) Above shows most common reason are "no confident contig and common clone shadow" Look for most common filter in all barcodes ``` cut -d "," -f 1 fh/scratch/delete90/furlan_s/KW2/PBSC_Stim1/outs/multi/vdj_t/all_contig_annotations.csv | sed 1d | sort -u > all_barcodes grep -f all_barcodes -A 5 fh/scratch/delete90/furlan_s/KW2/PBSC_Stim1/SC_MULTI_CS/SC_MULTI_CORE/MULTI_GEM_WELL_PROCESSOR/VDJ_T_GEM_WELL_PROCESSOR/SC_VDJ_CONTIG_ASSEMBLER/ASSEMBLE_VDJ/fork0/files/filter_diagnostics.json | grep "barcode\|name" | cut -d ":" -f 2,3 > all_barcodes_filters perl -pe 's/-1\",\n/XX/' all_barcodes_filters > all_barcodes_filters_samerow cut -d " " -f 3 all_barcodes_filters_samerow | sort | uniq -c 19 "chimeric_contig", 421 "common_clone_shadow", 326 "common_clone_shadow_single_umi", 408149 "no_confident_contig", 374143 "no_contig_with_v_region", 22074 "non_dominant_junction", 23853 "not_enough_junction_support", 365931 "not_enough_reads_per_umi", 404175 "not_enough_umis_tcr_or_denovo", 1856 "weak_junction", ``` Take all barcodes with high Ab level (see Ablevel.csv file above) and check out the filters that applied to these barcodes. Below shows that most of the high Ab contigs are excluded because there was no confident contigs likely due to too many productive contigs....likely due to doublets or background. ``` sed 1d Ablevel.csv | cut -d "," -f 1 | sed s/-1// > Ablevel_bc grep -f ./Ablevel_bc all_barcodes_filters_samerow | cut -d " " -f 3 | sort | uniq -c 3 "common_clone_shadow", 1 "common_clone_shadow_single_umi", 205 "no_confident_contig", 49 "no_contig_with_v_region", 35 "non_dominant_junction", 5 "not_enough_junction_support", 42 "not_enough_reads_per_umi", 51 "not_enough_umis_tcr_or_denovo", ``` Yes anotehr stab at profiling filters applied to barcodes with high TCR UMI counts , this time a larger number of barcodes. Below indicates again high ambient in the sample. Most cells are not called due to lack of contfident contig and commoncloneshadow again implying shared contigs due to ambient. ``` sed s/-1// highumi_barcodes_uniq > highumi_barcodes_uniq_nogemwell grep -f highumi_barcodes_uniq_nogemwell all_barcodes_filters_samerow | cut -d " " -f 3 | sort | uniq -c 3 "chimeric_contig", 24 "common_clone_shadow", 78 "no_confident_contig", 1 "no_contig_with_v_region", 1 "not_enough_reads_per_umi", 1 "not_enough_umis_tcr_or_denovo", ``` ## Unanswered questions: For barcode "CATCAGATCTACTTAC-1" the diagnostics.json states that "category:cell calling" failed becuase of no hc contig. We know why the contigs were made low confidence. But in barcode "CGAGCCAGTAATCACC-1", we see that filter-diagnostics mention "CommonCloneShadow" filter used for this barcode. which means there were 2 or more High conf contigs in this barcode. however the final contig csv file indicates all contigs as not high conf. Does this mean that the confidence of the contigs is lowered if they fail these assembly filters ? IMO this is a circular behavior and confusing. ``` grep "CGAGCCAGTAATCACC-1" outs/multi/vdj_t/all_contig_annotations.csvCGAGCCAGTAATCACC-1,false,CGAGCCAGTAATCACC-1_contig_1,false,489,TRB,TRBV20-1,,TRBJ2-1,TRBC2,true,true,,,,,,,,,,,CSASAGAEQFF,TGCAGTGCATCAGCGGGAGCTGAGCAGTTCTTC,,,32261,128,,, CGAGCCAGTAATCACC-1,false,CGAGCCAGTAATCACC-1_contig_2,false,510,TRA,TRAV12-3,,TRAJ52,TRAC,true,true,,,,,,,,,,,CAMSEFAGGTSYGKLTF,TGTGCAATGAGCGAATTTGCTGGTGGTACTAGCTATGGAAAGCTGACATTT,,,9025,53,,, CGAGCCAGTAATCACC-1,false,CGAGCCAGTAATCACC-1_contig_3,false,472,TRB,TRBV28,,TRBJ2-1,TRBC2,true,true,,,,,,,,,,,CASSFLSLGHNEQFF,TGTGCCAGCAGTTTTCTTTCCCTAGGCCACAATGAGCAGTTCTTC,,,3155,14,,, CGAGCCAGTAATCACC-1,false,CGAGCCAGTAATCACC-1_contig_4,false,311,TRB,TRBV24-1,,TRBJ2-1,TRBC2,false,false,,,,,,,,,,,CATSDPSLWDEQFF,TGTGCCACCAGTGATCCGTCGTTGTGGGATGAGCAGTTCTTC,,,1453,13,,,CGAGCCAGTAATCACC-1,false,CGAGCCAGTAATCACC-1_contig_5,false,398,None,,,TRAJ39,,false,false,,,,,,,,,,,,,,,200,1,,, ``` ## Notes from meeting with Fredhutch on 09/06 Customer has activated T cells might have a higher propensity to lyse after sorting. Cell from a donor phesus product simulated invitro. Possible that these are delicate or fragile. Cell hashing was not done on human samples and not mouse samples.