--- title: "Some counts for debugging" author: "Nadine Bestard" date: "08/03/2022" output: html_document editor_options: markdown: wrap: 72 --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ```{r} library(here) #library(scRNAseq) ``` ### Reads per sample original fastq files for file in Raw_Data/19880WApool01__Clemastine_A_S4_L00* > do > zcat $file | wc -l > done 779724392 779724392 779724392 779724392 767253088 767253088 767253088 767253088 This is : 767,253,088 -\> divided by 4 -\> 191,813,272 reads 779,724,392 -\> divided by 4 -\> 194931098 reads Total is: 386,744,370 ### After alignment, reads in SAM file wc -l Data/STARsolo/19880WApool01__Clemastine_A_S4/Aligned.sortedByCoord.out.sam 302060016 and this 302,060,016 -\> 80,000,000 reads are "lost" (not aligned ### Details of alignment: % mapped to transcriptome cat /u/Datastore/CMVM/scs/groups/Williamsdata/Nina/Chimeras/Data/STARsolo/*/Solo.out/Gene/Summary.csv | grep "Reads Mapped to Transcriptome: Unique Genes,"Reads Mapped to Transcriptome: Unique Genes," Reads Mapped to Transcriptome: Unique Genes,0.522905 Reads Mapped to Transcriptome: Unique Genes,0.552624 Reads Mapped to Transcriptome: Unique Genes,0.474701 Reads Mapped to Transcriptome: Unique Genes,0.519849 Reads Mapped to Transcriptome: Unique Genes,0.483608 Reads Mapped to Transcriptome: Unique Genes,0.518918 Reads Mapped to Transcriptome: Unique Genes,0.495555 Reads Mapped to Transcriptome: Unique Genes,0.519888 Reads Mapped to Transcriptome: Unique Genes,0.515259 Reads Mapped to Transcriptome: Unique Genes,0.507141 ```{r} 0.522905*302060016 ``` Expected umi in transcriptome: reads in sam \* percentage mapped to transcriptome = 157,948,693 ? # In R When we import the object, only reads aligned to the transcriptome are counted ```{r} #sce <- readRDS(here("Processed/filter_70/sce_combined_object/sce_combined_filter_70.rds")) stats_clA <- readRDS(here("Processed/filter_70/19880WApool01__Clemastine_A_S4cell_species.rds")) ``` ```{r} head(stats_clA) sum(stats_clA$tot_umi) # from humans sum(stats_clA[stats_clA$species == "human",]$n_human_umi) #from mouse sum(stats_clA[stats_clA$species == "mouse",]$n_mouse_umi) ``` total umi is 53,895,419 human: 1,741,388 (all reads) 1,040,125 human reads in human cells mouse: 52,154,031 (all reads) 51,986,632 mouse reads in mouse cells ### Reads after filtering #### in the reads ids file Here I should be adding the "not-trasncriptome" reads, but not the "mixed cells" reads wc -l outs/filter_70/reads/19880WApool01__Clemastine_A_S4_human n_ids.txt 5651049 Clemastine_A_S4_human_ids.txt wc -l outs/filter_70/reads/19880WApool01__Clemastine_A_S4_mouse_ids.txt 5,651,049 human 145,440,477 mouse 151,091,526 total #### After filtering with seqtk ```{bash, eval=F} echo "mouse" #mouse reads for file in Data/SplitSpecies/*/fastqs/19880WApool01__Clemastine_A_S4/* do zcat $file | wc -l done echo -e "\n human" # human reads for file in Data/SplitSpecies/*/fastqs/19880WApool01__Clemastine_A_S4* do zcat $file | wc -l done ``` human 11408892 Clemastine_A_S4_L001_I1_001.fastq.gz 11408892 Clemastine_A_S4_L001_I2_001.fastq.gz 11408892 Clemastine_A_S4_L001_R1_001.fastq.gz 11408892 Clemastine_A_S4_L001_R2_001.fastq.gz 11195304 Clemastine_A_S4_L002_I1_001.fastq.gz 11195304 Clemastine_A_S4_L002_I2_001.fastq.gz 11195304 Clemastine_A_S4_L002_R1_001.fastq.gz 11195304 Clemastine_A_S4_L002_R2_001.fastq.gz mouse 293778396 Clemastine_A_S4_L001_I1_001.fastq.gz 293778396 Clemastine_A_S4_L001_I2_001.fastq.gz 293778396Clemastine_A_S4_L001_R1_001.fastq.gz 293778396Clemastine_A_S4_L001_R2_001.fastq.gz 287983512 Clemastine_A_S4_L002_I1_00.astq.gz 287983512 Clemastine_A_S4_L002_I2_001.fastq.gz 287983512 Clemastine_A_S4_L002_R1_001.fastq.gz 287983512 Clemastine_A_S4_L002_R2_001.fastq.gz human: 11408892 + 11195304 = 22,604,196 -\> /4 -\> 5,651,049 mouse: 293778396 + 287983512 = 581,761,908 total: 604,366,104 -\> /4 -\> 151,091,526 #### With seqfilter for file in $SC/Chimeras/Data/SplitSpecies/human/testseqfilter/human_19880WApool01__Clemastine_A_S4*R1*; do zcat $file | wc -l; done 9000792 8849292 human: 8849292 + 9000792 = 17,850,084 -\> /4 -\> 4,462,521 ### With BBMap Same results as seqtk, but automatically filters out duplicates.