Chimeras / Scripts / Compare_Filter_fastqs_Tools / Compare_bbmap_seqtk_seqfilter.Rmd
Compare_bbmap_seqtk_seqfilter.Rmd
Raw
---
title: "Some counts for debugging"
author: "Nadine Bestard"
date: "08/03/2022"
output: html_document
editor_options: 
  markdown: 
    wrap: 72
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r}
library(here)
#library(scRNAseq)
```

### Reads per sample original fastq files

        for file in Raw_Data/19880WApool01__Clemastine_A_S4_L00* 
        > do 
        > zcat $file | wc -l 
        > done
        779724392
        779724392
        779724392
        779724392
        767253088
        767253088
        767253088
        767253088

This is : 767,253,088 -\> divided by 4 -\> 191,813,272 reads

779,724,392 -\> divided by 4 -\> 194931098 reads

Total is: 386,744,370

### After alignment, reads in SAM file

    wc -l Data/STARsolo/19880WApool01__Clemastine_A_S4/Aligned.sortedByCoord.out.sam 302060016

and this 302,060,016 -\> 80,000,000 reads are "lost" (not aligned

### Details of alignment: % mapped to transcriptome

    cat /u/Datastore/CMVM/scs/groups/Williamsdata/Nina/Chimeras/Data/STARsolo/*/Solo.out/Gene/Summary.csv | grep "Reads Mapped to Transcriptome: Unique Genes,"Reads Mapped to Transcriptome: Unique Genes,"

    Reads Mapped to Transcriptome: Unique Genes,0.522905

    Reads Mapped to Transcriptome: Unique Genes,0.552624

    Reads Mapped to Transcriptome: Unique Genes,0.474701

    Reads Mapped to Transcriptome: Unique Genes,0.519849

    Reads Mapped to Transcriptome: Unique Genes,0.483608

    Reads Mapped to Transcriptome: Unique Genes,0.518918

    Reads Mapped to Transcriptome: Unique Genes,0.495555

    Reads Mapped to Transcriptome: Unique Genes,0.519888

    Reads Mapped to Transcriptome: Unique Genes,0.515259

    Reads Mapped to Transcriptome: Unique Genes,0.507141

```{r}
0.522905*302060016
```

Expected umi in transcriptome: reads in sam \* percentage mapped to
transcriptome = 157,948,693 ?

# In R

When we import the object, only reads aligned to the transcriptome are
counted

```{r}
#sce <- readRDS(here("Processed/filter_70/sce_combined_object/sce_combined_filter_70.rds"))
stats_clA <- readRDS(here("Processed/filter_70/19880WApool01__Clemastine_A_S4cell_species.rds"))

```

```{r}
head(stats_clA)
sum(stats_clA$tot_umi)
# from humans
sum(stats_clA[stats_clA$species == "human",]$n_human_umi)
#from mouse 
sum(stats_clA[stats_clA$species == "mouse",]$n_mouse_umi)
```

total umi is 53,895,419

human: 1,741,388 (all reads) 1,040,125 human reads in human cells

mouse: 52,154,031 (all reads) 51,986,632 mouse reads in mouse cells

### Reads after filtering

#### in the reads ids file

Here I should be adding the "not-trasncriptome" reads, but not the
"mixed cells" reads

    wc -l outs/filter_70/reads/19880WApool01__Clemastine_A_S4_human
    n_ids.txt
    5651049 Clemastine_A_S4_human_ids.txt

    wc -l outs/filter_70/reads/19880WApool01__Clemastine_A_S4_mouse_ids.txt

5,651,049 human

145,440,477 mouse

151,091,526 total

#### After filtering with seqtk

```{bash, eval=F}
echo "mouse"
#mouse reads
for file in Data/SplitSpecies/*/fastqs/19880WApool01__Clemastine_A_S4/*
 do
 zcat $file | wc -l 
done
echo -e "\n human"
# human reads 
for file in Data/SplitSpecies/*/fastqs/19880WApool01__Clemastine_A_S4*
 do
  zcat $file | wc -l 
done
```

    human
      
    11408892   Clemastine_A_S4_L001_I1_001.fastq.gz
    11408892   Clemastine_A_S4_L001_I2_001.fastq.gz
    11408892   Clemastine_A_S4_L001_R1_001.fastq.gz
    11408892   Clemastine_A_S4_L001_R2_001.fastq.gz
    11195304   Clemastine_A_S4_L002_I1_001.fastq.gz
    11195304   Clemastine_A_S4_L002_I2_001.fastq.gz
    11195304   Clemastine_A_S4_L002_R1_001.fastq.gz
    11195304   Clemastine_A_S4_L002_R2_001.fastq.gz

     mouse
     293778396 Clemastine_A_S4_L001_I1_001.fastq.gz
     293778396 Clemastine_A_S4_L001_I2_001.fastq.gz
     293778396Clemastine_A_S4_L001_R1_001.fastq.gz
      293778396Clemastine_A_S4_L001_R2_001.fastq.gz
    287983512 Clemastine_A_S4_L002_I1_00.astq.gz
    287983512 Clemastine_A_S4_L002_I2_001.fastq.gz
    287983512 Clemastine_A_S4_L002_R1_001.fastq.gz
    287983512  Clemastine_A_S4_L002_R2_001.fastq.gz 

human: 11408892 + 11195304 = 22,604,196 -\> /4 -\> 5,651,049

mouse: 293778396 + 287983512 = 581,761,908

total: 604,366,104 -\> /4 -\> 151,091,526

#### With seqfilter

    for file in $SC/Chimeras/Data/SplitSpecies/human/testseqfilter/human_19880WApool01__Clemastine_A_S4*R1*; do  zcat $file | wc -l; done
    9000792
    8849292

human: 8849292 + 9000792 = 17,850,084 -\> /4 -\> 4,462,521

### With BBMap

Same results as seqtk, but automatically filters out duplicates.