Scripts/Human_R_Analysis/CellRangerQC_Stats_allConditions.Rmd · Chimeras

---
title: "CellRangerQC"
author: "Nina-Lydia Kazakou"
date: "01/06/2022"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# **Single-cell RNA sequencing of Chimeras**

hES-derived monolayer GFP+ oligodendroglia was injected into P2-P4 immunocompromised shiverer mice (500,000 human cells split between two sides of the CC). After 6weeks, the animal received either 300mg/kg Metformin treatment (or ddH20 vehicle) or 10mg/kg Clemastine treatment (or DMSO vehicle) for 21days. At 10weeks the animals were sacrificed through lethal injection, perfused with PBS for 2min and the brains were extacted. The CC was dissected out of the brains enzymatically processed before a live/dead sort. Live cells were used to generaye single-cell RNA sequencing libraries using the 10x v3.1 kits. Two mice we pulled for each sample. The libraries were sequenced by the edinburgh genomics and sent us .fastq files. These files were demultiplexed and aligned to a mixed-species (human & mouse) reference genome using STARSolo. A CB tag was used to identify the human read ids, based on which the .fastq files were filtered. The filtered .fastqs were demultiplexed and aligned to the human reference genome using CellRanger.

Here, I will assess the quality of the CellRanger run based on its output. All the quality measurements acquired by the CellRanger are calculated for the flteres_feature_bc_matrices. 

# Set-up

### Load libraries

```{r message=FALSE, warning=FALSE}
library(ggplot2)
library(tidyr)
library(ggsci)
library(here)
```

### Colour Palette

```{r load_palette}
mypal <- pal_npg("nrc", alpha = 0.7)(10)
mypal2 <-pal_tron("legacy", alpha = 0.7)(7)
mypal3<-pal_lancet("lanonc", alpha = 0.7)(9)
mypal4<-pal_simpsons(palette = c("springfield"), alpha = 0.7)(16)
mypal5 <- pal_rickandmorty(palette = c("schwifty"), alpha = 0.7)(6)
mypal6 <- pal_futurama(palette = c("planetexpress"), alpha = 0.7)(5)
mypal7 <- pal_startrek(palette = c("uniform"), alpha = 0.7)(5)
mycoloursP<- c(mypal, mypal2, mypal3, mypal4, mypal5, mypal6, mypal7)
```

### Load data

```{r}
data <- read.csv(here("outs", "filter_70", "Human_R_Analysis", "Cellranger_Quality_Stats_allConditions", "Cellranger_Quality_Stats.csv"))

names(data)
```

# Summary Statistics

### Mean number of cells captured

```{r}
mean(data$Estimated_Number_of_Cells)
```

### Mean number of reads per cell

```{r}
mean(data$Mean_Reads_per_Cell)
```

### Mean number of Reads

```{r}
mean(data$Number_of_Reads)
```

### Mean Reads Mapped to Intons

```{r}
mean(data$Reads_Mapped_Confidently_to_Intronic_Regions)
```

### Mean Reads Mapped to Exons

```{r}
mean(data$Reads_Mapped_Confidently_to_Exonic_Regions)
```

### Mean Reads Mapped Antisense

```{r}
mean(data$Reads_Mapped_Antisense_to_Gene)
```

### Mean Reads Mapped to Transcriptome

```{r}
mean(data$Reads_Mapped_Confidently_to_Transcriptome)
```

### Mean Fraction Reads

```{r}
mean(data$Fraction_Reads_in_Cells)
```

### Mean Median Genes per Cell

```{r}
mean(data$Median_Genes_per_Cell)
```

# Plots

```{r}
ggplot(data, aes(x = Sample, 
                  y = (Reads_Mapped_Confidently_to_Exonic_Regions * 100) / 
                      (Reads_Mapped_Confidently_to_Intronic_Regions * 100 +
                           Reads_Mapped_Confidently_to_Exonic_Regions * 100))) +
     geom_bar(stat="identity") +
     ylab("% of exonic reads") +
     xlab("Sample") +
     theme_minimal(14) +
     theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
     scale_fill_manual(values = c(mycoloursP[5], mycoloursP[4])) +
     geom_hline(
         yintercept = 0.5,
         linetype = "dashed", color = "salmon", size = 1) +
     geom_hline(
         yintercept = 0.7,
         linetype = "dashed", color = "salmon", size = 1) +
     theme(legend.position = "none")
```

```{r}
long_data <- gather(data, IntEx, percentage,
                   Reads_Mapped_Confidently_to_Intronic_Regions:Reads_Mapped_Confidently_to_Exonic_Regions,
                   factor_key = TRUE)

long_data$status <- ifelse(long_data$IntEx ==
  "Reads_Mapped_Confidently_to_Intronic_Regions",
   "intronic", "exonic") 
```

```{r, fig.width = 6, fig.height = 4, fig.fullwidth=TRUE}
ggplot(data = long_data, aes(
  x = Sample, y = percentage * 100,
  fill = status
)) +
  geom_bar(stat = "identity", position = "stack") +
  scale_fill_manual(values = c(mycoloursP[5], mycoloursP[4])) +
  theme_minimal() +
  ylab("percentage") +
  xlab("Sample") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))+
  geom_hline(
    yintercept = mean(data$Reads_Mapped_Confidently_to_Intronic_Regions)
    * 100,
    linetype = "dashed", color = "blue", size = 1 
  ) +
  geom_hline(
    yintercept = mean(data$Reads_Mapped_Confidently_to_Exonic_Regions)
    * 100,
    linetype = "dashed", color = "salmon", size = 1
  ) +
  geom_hline(
    yintercept = mean(data$Reads_Mapped_Confidently_to_Exonic_Regions +
      data$Reads_Mapped_Confidently_to_Intronic_Regions)
    * 100,
    linetype = "dashed", color = "yellow", size = 1
  )
```

```{r, fig.width = 6, fig.height = 4, fig.fullwidth=TRUE}
ggplot(data = long_data, aes(x = Sample, y = percentage, fill = status)) +
  geom_bar(stat = "identity", position = "fill") +
  scale_fill_manual(values = c(mycoloursP[5], mycoloursP[4])) +
  theme_minimal() +
  ylab("Proporton") +
  xlab("Sample") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  geom_hline(yintercept = 0.5, color = "black", size = 1)
```

```{r}
data$other <- 1 - (data$Reads_Mapped_Confidently_to_Exonic_Regions +
  data$Reads_Mapped_Confidently_to_Intronic_Regions)
ggplot(data = data, aes(x = Sample, y = other)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = mycoloursP[6]) +
  theme_minimal() +
  ylab("Other Reads (not intronic / exonic)") +
  xlab("Sample") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  geom_hline(yintercept = mean(data$other), color = "salmon", size = 1)
```

```{r}
ggplot(data = data, aes(x = Sample, y = Estimated_Number_of_Cells)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = mycoloursP[4:5]) +
  theme_minimal() +
  ylab("Number of Cells") +
  xlab("sample") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  geom_hline(
    yintercept = mean(data$Estimated_Number_of_Cells), color = "salmon",
    size = 1
  )
```

```{r}
ggplot(data = data, aes(x = Sample, y = Mean_Reads_per_Cell)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = mycoloursP[4:5]) +
  theme_minimal() +
  ylab("Mean reads per cell") +
  xlab("sample") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  geom_hline(yintercept = mean(data$Mean_Reads_per_Cell), color = "salmon", size = 1)
```

Metformin_B\_S2 seems to have a lower number of reads compared to the rest of the samples, even though it is the sample with most cells in it.

```{r}
ggplot(data = data, aes(x = Sample, y = Number_of_Reads)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = mycoloursP[4:5]) +
  theme_minimal() +
  ylab("Mean number of reads") +
  xlab("sample") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  geom_hline(yintercept = mean(data$Number_of_Reads), color = "salmon", size = 1)
```

```{r}
ggplot(data = data, aes(x = Sample, y = Fraction_Reads_in_Cells)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = mycoloursP[4:5]) +
  theme_minimal() +
  ylab("Fraction reads in cells") +
  xlab("SampleD") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  geom_hline(
    yintercept = mean(data$Fraction_Reads_in_Cells), color = "salmon",
    size = 1
  )
```

```{r}
ggplot(data = data, aes(x = Sample, y = Median_Genes_per_Cell)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = mycoloursP[4:5]) +
  theme_minimal() +
  ylab("Median genes per cell") +
  xlab("Sample") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  geom_hline(
    yintercept = mean(data$Median_Genes_per_Cell), color = "salmon",
    size = 1
  )
```

```{r}
sum(data$Estimated_Number_of_Cells)
```

This is a very crude first look at the quality of the samples in this study before applying any quality control. The mean number of human cells captured for each sample was 300 with a total number of 2991 human cells in the complete dataset. On average, 71,296.5 reads were captured for each human cell with a median gene count of 2,359.9 per cell.

Samples to be monitored in downstream analysis:

-   Metformin_B\_S2 seems to have a lower number of reads per cell compared to the rest of the samples, even though it is the sample with most cells in it. The mean number of reads though does look fine, so I will need to keep an eye on this sample in downstream analysis.

-   The mean number of reads for all three Clemastine samples and the ddH2O_B\_S8 seem to be lower than the rest of the samples, but this can be potential correlated with the fact that they contain fewer cells in total.

<details>
  <summary>Click to expand </summary>
```{r}
sessionInfo()
```

</details>