Scripts/Compare_Method1_Method2/Compare_Filtering_methods_metformin.Rmd · Chimeras

---
title: '[met] second method QC & ratio plots'
author: "Nina-Lydia Kazakou"
date: "08/05/2022"
output:
  html_document:
    code_folding: hide
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Summary


**Original filtering method:**

1.  Align raw fastq files to a human-moused mixed genome using [STARSolo](https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md) with added CB tag (CB/UB: corrected CellBarcode/UMI. Note, that these tags require sorted BAM output)

2.  Import and separate human and mouse barcodes were in R; classify species based on [proportion of UMI](https://bustools.github.io/BUS_notebooks_R/10xv2.html), with cutoff of 70%

3.  Extract the human CB list and use awk to filter within the CB position for them

4.  Produce human read id list

5.  Filter raw fastq files for the human read ids, using [seqtk](https://github.com/lh3/seqtk)

6.  Align the filtered raw fastq files with the human genome, using [CellRanger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger) 5.0.0

**Second filtering method:**

1.  Align the raw fastq files [only]{.ul} to the human genome, using CellRanger

2.  Import in R and use the previously generated barcode list to filter for human cells only

Here, for the first time, I will import the count matrices generated by CellRanger into R using Bioconductor packages. I usually use Seurat comands to import the data and the then, once I've populated the intial object with whatever I think might be necessary, I transform the object into a SingleCellExperiment, to use it for downstream QC analysis with the Scater package. CellRanger filtered matrices, include an initial cellCalling has been done by CellRanger. The algorythm filters out empty droplets. However, in this case the mixed-species data was already aligned using STARSolo, before the alignment with CellRanger, meaning that any empty droplets should have already been excluded and thus, I can use the raw matrices for downstream analayis.

CellRanger 5 was used for these data.

I have decided that for the analysis I am interesting in dissecting differences between each drug and their respective control, while I am not really interested in the comparison of the two drugs. For that reason, I will treat them as two different datasets.

Here, I will import the **metformin/ddH2O** reads.

# Import data Method 2

This includes all the cells (human and mouse) aligned to the human genome


### Load libraries

```{r message=FALSE, warning=FALSE}
library(here) # For reproducible paths
library(DropletUtils) 
library(ggplot2) #plots
library(dplyr) # maipulate df ( %>% )
library(scater) #add QC and isoutlier
library(org.Hs.eg.db) # To annotate the genenames
```

### Indicate the path to the raw matrices

```{r matrices}
mat_met_a <- here("Data", "SplitSpecies", "human", "Method2_Align_Raw_To_OneGenome", "CellRanger", "19880WApool01__Metformin_A_S1", "outs", "raw_feature_bc_matrix") 
mat_met_b <- here("Data", "SplitSpecies", "human", "Method2_Align_Raw_To_OneGenome", "CellRanger", "19880WApool01__Metformin_B_S2", "outs", "raw_feature_bc_matrix")
mat_met_c <- here("Data", "SplitSpecies", "human", "Method2_Align_Raw_To_OneGenome", "CellRanger", "19880WApool01__Metformin_C_S7", "outs", "raw_feature_bc_matrix") 
mat_ddH2O_a <- here("Data", "SplitSpecies", "human", "Method2_Align_Raw_To_OneGenome", "CellRanger", "19880WApool01__ddH2O_A_S3", "outs", "raw_feature_bc_matrix") 
mat_ddH2O_b <- here("Data", "SplitSpecies", "human", "Method2_Align_Raw_To_OneGenome", "CellRanger", "19880WApool01__ddH2O_B_S8", "outs", "raw_feature_bc_matrix")
```

### Load 10xCounts

```{r counts}
met_A <- read10xCounts(mat_met_a, version = "auto", col.names = TRUE)
met_B <-  read10xCounts(mat_met_b, version = "auto", col.names = TRUE)
met_C <-  read10xCounts(mat_met_c, version = "auto", col.names = TRUE)
ddH2O_A <-read10xCounts(mat_ddH2O_a, version = "auto", col.names = TRUE)
ddH2O_B <-read10xCounts(mat_ddH2O_b, version = "auto", col.names = TRUE)
```

# Filter data

Select only the cells that are human

### Indicate the path to the human barcode lists

```{r barcodes}
bc_met_A <- here("outs", "filter_70", "Human_Mouse_Separation","barcodes","19880WApool01__Metformin_A_S1_human_barcodes.txt")
bc_met_B <- here("outs", "filter_70", "Human_Mouse_Separation", "barcodes","19880WApool01__Metformin_B_S2_human_barcodes.txt")
bc_met_C <- here("outs", "filter_70", "Human_Mouse_Separation", "barcodes","19880WApool01__Metformin_C_S7_human_barcodes.txt")
bc_ddH2O_A <- here("outs",  "filter_70","Human_Mouse_Separation","barcodes","19880WApool01__ddH2O_A_S3_human_barcodes.txt")
bc_ddH2O_B <- here("outs", "filter_70", "Human_Mouse_Separation","barcodes","19880WApool01__ddH2O_B_S8_human_barcodes.txt")
```

```{r filter_human}
# import barcodes into a vector
barcodes_met_a_df <- read.delim(bc_met_A, header = FALSE)
barcodes_met_a_vector <- barcodes_met_a_df$V1

barcodes_met_b_df <- read.delim(bc_met_B, header = FALSE)
barcodes_met_b_vector <- barcodes_met_b_df$V1

barcodes_met_c_df <- read.delim(bc_met_C, header = FALSE)
barcodes_met_c_vector <- barcodes_met_c_df$V1

barcodes_ddH2O_a_df <- read.delim(bc_ddH2O_A, header = FALSE)
barcodes__ddH2O_a_vector <- barcodes_ddH2O_a_df$V1

barcodes_ddH2O_b_df <- read.delim(bc_ddH2O_B, header = FALSE)
barcodes__ddH2O_b_vector <- barcodes_ddH2O_b_df$V1


# add the "-1" aded in the sce to barcodes (careful if importing more than one sample)
barcodes_met_a <- paste0(barcodes_met_a_vector, "-1")
barcodes_met_b <- paste0(barcodes_met_b_vector, "-1")
barcodes_met_c <- paste0(barcodes_met_c_vector, "-1")
barcodes_ddH2O_a <- paste0(barcodes__ddH2O_a_vector, "-1")
barcodes_ddH2O_b <- paste0(barcodes__ddH2O_b_vector, "-1")
```

### Subset for human cells

```{r subset}
met_A <- met_A[, barcodes_met_a]
met_B <- met_B[, barcodes_met_b]
met_C <- met_C[, barcodes_met_c]
ddH2O_A <- ddH2O_A[, barcodes_ddH2O_a]
ddH2O_B <- ddH2O_B[, barcodes_ddH2O_b]
```

## Populate metadata

```{r initial_metadata}
met_A$Sample <- "19880WApool01__Metformin_A_S1"
met_A$Treatment <- "Metformin"
met_A$animalID <- "1936/4_1940/11"
met_A$Chip <- "chip_1"

met_B$Sample <- "19880WApool01__Metformin_B_S2"
met_B$Treatment <- "Metformin"
met_B$animalID <- "1937/1_1941/4"
met_B$Chip <- "chip_1"

met_C$Sample <- "19880WApool01__Metformin_C_S7"
met_C$Treatment <- "Metformin"
met_C$animalID <- "2073/4_2074/1"
met_C$Chip <- "chip_2"

ddH2O_A$Sample <- "19880WApool01__ddH2O_A_S3"
ddH2O_A$Treatment <- "ddH2O"
ddH2O_A$animalID <- "1935/3_1938/3"
ddH2O_A$Chip <- "chip_1"

ddH2O_B$Sample <- "19880WApool01__ddH2O_B_S8"
ddH2O_B$Treatment <- "ddH2O"
ddH2O_B$animalID <- "2064/4_2067/13"
ddH2O_B$Chip <- "chip_2"
```

```{r}
dim(met_A)
dim(met_B)
dim(met_C)
dim(ddH2O_A)
dim(ddH2O_B)

sum(dim(met_A)[2] + 
dim(met_B)[2] + 
dim(met_C)[2] + 
dim(ddH2O_A)[2] + 
dim(ddH2O_B)[2] )
```


```{r}
sce_m2 <- cbind( met_A, met_B, met_C, ddH2O_A, ddH2O_B)
dim(sce_m2)[2]
```

### Save object

```{r}
dir.create(here("Data", "Compare_Method1_Method2"))

saveRDS(sce_m2, here("Data", "Compare_Method1_Method2", "met_human_import_sce_method2_chimeras.rds"))
```


# Gene QC

We will delete any gene that is not expressed in at least one cell

```{r gene-qc}
(genes_beforeqc <- dim(sce_m2)[1])
keep_feature <- rowSums(counts(sce_m2) > 0) > 1

sum(keep_feature)

# delete the genes_beforeqc -dim(sce_m2)[1] genes and keep the sum(keep_feature) genes
sce_m2 <- sce_m2[keep_feature, ]
genes_beforeqc - dim(sce_m2)[1]

dim(sce_m2) 
```
## Add QC

```{r}
sce_m2 <- scater::addPerCellQC(sce_m2)
```


```{r eval=FALSE}
saveRDS(sce_m2, here("data", "Human_R_Analysis", "Metformin_ddH2O", "CellRanger-combined",  "met_human_sce_method2_geneQC_chimeras.rds"))
```


# Import data Method 1

### Indicate the path to the raw matrices

```{r matrices-1}
mat_met_a <- here("Data", "SplitSpecies", "human", "Method1", "CellRanger", "19880WApool01__Metformin_A_S1", "outs", "raw_feature_bc_matrix") 
mat_met_b <- here("Data", "SplitSpecies", "human", "Method1", "CellRanger", "19880WApool01__Metformin_B_S2", "outs", "raw_feature_bc_matrix")
mat_met_c <- here("Data", "SplitSpecies", "human", "Method1", "CellRanger", "19880WApool01__Metformin_C_S7", "outs", "raw_feature_bc_matrix") 
mat_ddH2O_a <- here("Data", "SplitSpecies", "human", "Method1", "CellRanger", "19880WApool01__ddH2O_A_S3", "outs", "raw_feature_bc_matrix") 
mat_ddH2O_b <- here("Data", "SplitSpecies", "human", "Method1", "CellRanger", "19880WApool01__ddH2O_B_S8", "outs", "raw_feature_bc_matrix")
```

### Load 10xCounts

```{r counts-1}
met_A <- read10xCounts(mat_met_a, version = "auto", col.names = TRUE)
met_B <-  read10xCounts(mat_met_b, version = "auto", col.names = TRUE)
met_C <-  read10xCounts(mat_met_c, version = "auto", col.names = TRUE)
ddH2O_A <-read10xCounts(mat_ddH2O_a, version = "auto", col.names = TRUE)
ddH2O_B <-read10xCounts(mat_ddH2O_b, version = "auto", col.names = TRUE)
```

## Populate metadata

```{r initial_metadata-1}
met_A$Sample <- "19880WApool01__Metformin_A_S1"
met_A$Treatment <- "Metformin"
met_A$animalID <- "1936/4_1940/11"
met_A$Chip <- "chip_1"

met_B$Sample <- "19880WApool01__Metformin_B_S2"
met_B$Treatment <- "Metformin"
met_B$animalID <- "1937/1_1941/4"
met_B$Chip <- "chip_1"

met_C$Sample <- "19880WApool01__Metformin_C_S7"
met_C$Treatment <- "Metformin"
met_C$animalID <- "2073/4_2074/1"
met_C$Chip <- "chip_2"

ddH2O_A$Sample <- "19880WApool01__ddH2O_A_S3"
ddH2O_A$Treatment <- "ddH2O"
ddH2O_A$animalID <- "1935/3_1938/3"
ddH2O_A$Chip <- "chip_1"

ddH2O_B$Sample <- "19880WApool01__ddH2O_B_S8"
ddH2O_B$Treatment <- "ddH2O"
ddH2O_B$animalID <- "2064/4_2067/13"
ddH2O_B$Chip <- "chip_2"
```

```{r dimensions-1}
dim(met_A)
dim(met_B)
dim(met_C)
dim(ddH2O_A)
dim(ddH2O_B)

sum(dim(met_A)[2] + 
dim(met_B)[2] + 
dim(met_C)[2] + 
dim(ddH2O_A)[2] + 
dim(ddH2O_B)[2] )
```


```{r sce}
sce_m1 <- cbind( met_A, met_B, met_C, ddH2O_A, ddH2O_B)
dim(sce_m1)[2]
```

### Save object

```{r save-1}
dir.create(here("Data", "Compare_Method1_Method2"))

saveRDS(sce_m1, here("Data", "Compare_Method1_Method2", "met_human_import_sce_method1_chimeras.rds"))
```



# Gene QC

We will delete any gene that is not expressed in at least one cell

```{r gene-qc-1}
(genes_beforeqc <- dim(sce_m1)[1])
keep_feature <- rowSums(counts(sce_m1) > 0) > 1

sum(keep_feature)

# delete the genes_beforeqc -dim(sce_m1)[1] genes and keep the sum(keep_feature) genes
sce_m1 <- sce_m1[keep_feature, ]
genes_beforeqc - dim(sce_m1)[1]

dim(sce_m1) 
```
## Add QC

```{r qc-1}
sce_m1 <- scater::addPerCellQC(sce_m1)
```

```{r eval=TRUE}
saveRDS(sce_m1, here("data", "Human_R_Analysis", "Metformin_ddH2O", "CellRanger-combined",  "met_human_sce_method1_geneQC_chimeras.rds"))
```



# Comparisons


## Dimensions and umis
```{r}
print("dimensions meth1, meth2")
dim(sce_m1)

dim(sce_m2)

print("sum and summary umis meth1, meth2")
sum(sce_m1$total)
sum(sce_m2$total)
summary(sce_m1$total)
summary(sce_m2$total)
print("sum and sumary genes meth1,  meth2")
sum(sce_m1$detected)
sum(sce_m2$detected)
summary(sce_m1$detected)
summary(sce_m2$detected)
```

## Compare method 1 ( not duplicates) with method 2 (subset in R)

### Different gene number

Search for missing genes in method 1
```{r}
not_meth2 <- !(rownames(sce_m1) %in% rownames(sce_m2))

not_meth1 <- !(rownames(sce_m2) %in% rownames(sce_m1))

sum(not_meth1, not_meth2)

# The 9 missing genes
missing_genes <- rownames(sce_m2)[not_meth1]
  # obtain full genenames
 mapIds(org.Hs.eg.db,
                     keys = missing_genes,
                     keytype = "ENSEMBL",
                     column = c("GENENAME"))
  mapIds(org.Hs.eg.db,
                     keys = missing_genes,
                     keytype = "ENSEMBL",
                     column = c("SYMBOL"))
```

And only keep the genes present in both methods to compare
```{r}
# intersect
both <- intersect(rownames(sce_m1), rownames(sce_m2))

sce_m1 <- sce_m1[both,]
sce_m2 <- sce_m2[both,]
```

### Cells

```{r compare-cells}
# compare cells
meth1_comp_df  <- colData(sce_m1)
second_comp_df <- colData(sce_m2)
df_compare <- merge(meth1_comp_df, second_comp_df, by = "Barcode", suffixes = c("_meth1","_meth2"))

# plot cells
plot(df_compare$total_meth1, df_compare$total_meth2, main = "Cells")
plot(df_compare$detected_meth1, df_compare$detected_meth2, main = "Cells")



```

### Genes

```{r compare, paged.print=TRUE}
# compare genes
df_compare_genes <- as.data.frame(rowSums(counts(sce_m1)))
df_compare_genes<- merge(as.data.frame(rowSums(counts(sce_m2))), df_compare_genes, by="row.names")
colnames(df_compare_genes) <- c("genes", "meth2", "meth1")
# add ratio between old and new method
df_compare_genes$ratio <- df_compare_genes$meth1/df_compare_genes$meth2

# add number cells afected
number_cells <- data.frame(num_cells_meth1 = rowSums(counts(sce_m1)>0) , num_cells_meth2 =rowSums(counts(sce_m2)>0)  )
number_cells$genes <- row.names(number_cells)
df_compare_genes <- merge(number_cells, df_compare_genes, by="genes")

# add names
df_compare_genes$gene_symbol <-   mapIds(org.Hs.eg.db,
                     keys = df_compare_genes$genes,
                     keytype = "ENSEMBL",
                     column = c("SYMBOL"))
df_compare_genes$gene_info <-   mapIds(org.Hs.eg.db,
                     keys = df_compare_genes$genes,
                     keytype = "ENSEMBL",
                     column = c("GENENAME"))

# plot genes
plot(df_compare_genes$meth2, df_compare_genes$meth1, main = "Genes") 
#exclude malat
df_compare_genes <- df_compare_genes[!(df_compare_genes$genes == "MALAT1"),]
# plot genes again
plot(df_compare_genes$meth2, df_compare_genes$meth1, main="delete the most expressed gene to see better the correlation", ylim=c(0,80000), xlim=c(0,80000))


# plot the ratio
hist(df_compare_genes$ratio, breaks=seq(0,2,0.01))
plot(df_compare_genes$ratio)
# with ggplot (to set the base if I need to modify)
df_compare_genes %>% arrange(ratio) %>% 
  mutate(genes=factor(genes, levels=genes)) %>%   # This trick update the factor levels
  ggplot(aes(y=ratio, x=genes)) + geom_point() + scale_y_continuous(breaks=seq(0,2,0.5)) + ggtitle( "gene counts meth1 / meth2")


is_outlier_gene_ratio <- isOutlier( as.numeric(df_compare_genes$ratio))
genes_outlier <- df_compare_genes[is_outlier_gene_ratio,]
# see the threshold
threshold <- attr(is_outlier_gene_ratio, "thresholds")

# how many would be more than 1.1?
sum(na.omit(df_compare_genes$ratio)>1.1)
#[1] 1
sum(na.omit(df_compare_genes$ratio)<0.9)
# 32


# top genes overeepresented in meth1
df_compare_genes <- arrange(df_compare_genes, ratio )
genes_top <- head(df_compare_genes,100)
head(df_compare_genes, 20)
tail(df_compare_genes)
```

There are a few genes underrepresented in method 1. These are genes with little information in the org.Hs database

even less genes underrepresented in method 2. These are affecting very few cells. 


<!-- ## Check whether these genes are present in the variable genes  -->
<!-- ```{r, eval=FALSE} -->
<!-- # All outliers -->
<!-- length(intersect(var_genes, genes_outlier$Genes)) # 630 -->

<!-- af_var_genes <- intersect(var_genes, genes_outlier$Genes) -->

<!-- write.csv(af_var_genes, here("outs", "filter_70", "Human_R_Analysis", "CellRanger-combined", "af_var_genes.csv"))  -->

<!-- # Outliers with ratio > 1.3 -->
<!-- top_outlier_ratio <- comp_genes_df[comp_genes_df$ratio > 1.3, ] -->

<!-- length(intersect(var_genes, top_outlier_ratio$Genes)) #562 -->

<!-- top_af_var_genes <- intersect(var_genes, top_outlier_ratio$Genes) -->

<!-- write.csv(top_af_var_genes, here("outs", "filter_70", "Human_R_Analysis", "CellRanger-combined", "top_af_var_genes.csv"))  -->

<!-- # or Just see the ratios of the variable genes  -->
<!-- length(intersect(var_genes, comp_genes_df$Genes)) # 1847 -->
<!-- ## It looks like the not all variable genes are in the comp_genes_df!  -->

<!-- var_genes_in_comp_genes_df <- intersect(var_genes, comp_genes_df$Genes) -->

<!-- var_genes_in_comp_genes <- comp_genes_df[comp_genes_df$Genes %in% var_genes_in_comp_genes_df,] -->

<!-- write.csv(var_genes_in_comp_genes, here("outs", "filter_70", "Human_R_Analysis", "CellRanger-combined", "var_genes_in_comp_genes.csv")) -->
<!-- ``` -->





<!-- ``` -->
<!-- # how many would be more than 1.1? -->
<!-- sum(df_compare_genes$ratio>1.1) -->
<!-- #[1] 1693 -->

<!-- # top genes overeepresented in old -->
<!-- df_compare_genes <- arrange(df_compare_genes, ratio ) -->
<!-- genes_top <- tail(df_compare_genes,100) -->

<!-- # save -->
<!-- write.csv(genes_outlier, "../outs/filter_70/ddH2O_A_UMI_test/genes_outlier.csv") -->
<!-- write.csv(genes_top, "../outs/filter_70/ddH2O_A_UMI_test/genes_top.csv") -->

<!-- # most expressed gene -->
<!-- tail(arrange(df_compare_genes, new),1) -->
<!-- #exclude it -->
<!-- df_compare_genes <- df_compare_genes[!(df_compare_genes$genes == "MALAT1"),] -->
<!-- # plot genes again -->
<!-- plot(df_compare_genes$new, df_compare_genes$old) -->

<!-- for(gene in genes_top$gene){ -->
<!-- old <- colSums(GetAssayData(ddH2O_old[row.names(ddH2O_old)==gene,], slot = "counts")) -->

<!-- new <- colSums(GetAssayData(ddH2O[row.names(ddH2O)==gene,], slot = "counts")) -->

<!-- ratio <- df_compare_genes$ratio[df_compare_genes$genes == gene] -->
<!-- num_cells_old <- df_compare_genes$num_cells_old[df_compare_genes$genes == gene] -->
<!-- print(plot(old, new , main = paste(gene, ", ratio =", round(ratio,2), ", num cells =", num_cells_old))) -->

<!-- } -->

<!-- ``` -->