Scripts/Human_R_Analysis/Clemastine_DMSO/clem_Normalisation.Rmd · Chimeras

---
title: "clem_Normalisation"
author: "Nina-Lydia Kazakou"
date: '2022-07-15'
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Set-up

### Load libraries

```{r message=FALSE, warning=FALSE}
library(here) # For reproducible paths
library(SingleCellExperiment)
library(scater) # For QC and visualization
library(scran) # For Normalization
library(Matrix) # For log-transformation of the raw data
library(ggplot2) # For plots
library(ggsci)
```

# **Normalization by Deconvolution**

Normalization of single-cell RNA sequencing data is necessary to eliminate cell-specific biases prior to downstream analyses. Normalization corrects for cell-to-cell differences in capture efficiency, sequencing depth, and other technical confounders, ensuring that downstream comparisons of relative expression between cells are valid (Lun, Bach & Marioni, 2016). Correcting for systematic differences in sequencing coverage includes dividing all counts for each cell by a cell-specific scaling factor, called "size factor". The size factor (sj) represents the coverage, or sampling depth, of library j, and we will use the term common scale for quantities, such as qi, ρ(j), that are adjusted for coverage by dividing by sj (Anders & Huber 2010). This approach assumes that any cell specific bias affects all genes equally via scaling of the expected mean count for that cell. The size factor, which is calculated for each cells, represents the estimate of the relative bias in each cell. Division of cell counts by the size factor will exclude the bias per cell.

Here, I will use the scran package for normalization through deconvolution. The aim of the deconvolution strategy is to normalize on summed expression values from pools of cells. Summation across cells results in fewer zeroes, which means that normalization is less susceptible to errors. While normalization accuracy is improved, the estimated size factors are only relevant to the pools of cells. To obtain relevant estimates, the size factor for each pool is deconvolved into the size factors for its constituent cells. This ensures that cell-specific biases can be properly normalized (Lun, Bach & Marioni, 2016).

The deconvolution method consists of several key steps:

-   Defining a pool of cells

-   Summing expression values across all cells in the pool

-   Normalizing the cell pool against an average reference, using the summed expression values

-   Repeating this for many different pools of cells to construct a linear system

-   Deconvolving the pool-based size factors to their cell-based counterparts

Here, I will also log-transform the data, since differences in the log-values represent log-fold changes in expression. Log-transformation focuses on promotes contributions from genes with strong relative differences.

```{r matlog2}
# Adapted function from VISION to log tranform sparse matrix
matLog2 <- function(spmat, scale = FALSE, scaleFactor = 1e6) {
    if (scale == TRUE) {
        spmat <- t( t(spmat) / colSums(spmat)) * scaleFactor
    }
    if (is(spmat, "sparseMatrix")) {
        matsum <- summary(spmat)
        logx <- log2(matsum$x + 1)
        logmat <- sparseMatrix(i = matsum$i, j = matsum$j,
                               x = logx, dims = dim(spmat),
                               dimnames = dimnames(spmat))
    } else {
        logmat <- log2(spmat + 1)
    }
    return(logmat)
}
```

It is the first dataset I will apply log-transformation directly after normalization.

```{r Normalization}
if (!(file.exists(here("Data", "Human_R_Analysis", "Clemastine_DMSO", "clem_human_sce_norm.rds")))) {
  sce <- readRDS(here("Data", "Human_R_Analysis", "Clemastine_DMSO", "clem_human_sce_QC.rds"))
  # For reproducibility
  set.seed(100)
  # Quick clustering to pool samples together and deal with 0 counts
  quick_clusters <- quickCluster(sce)
  table(quick_clusters)
  # Calculate size factors
  sce <- computeSumFactors(sce, cluster = quick_clusters, min.mean = 0.1)
  # Check that there are not negative size factors
  summary(sizeFactors(sce))
  # Apply size factors and log transform them
  sce <- logNormCounts(sce)
  # Also log normalise the raw counts
  assay(sce, "logcounts_raw") <- matLog2(counts(sce))
  saveRDS(sce, here("Data", "Human_R_Analysis", "Clemastine_DMSO", "clem_human_sce_norm.rds"))
} else{
  sce <- here("Data", "Human_R_Analysis", "Clemastine_DMSO", "clem_human_sce_norm.rds")
}
```

# Variance: assessment of confounding factors and their impact

Variable-level metrics can be computed by the **getVarianceExplained()** function, which calculates the percentage of variance in the expression of each gene. Thus, the experimental factors that contribute to the expression variance can be determined.

### Before normalization

At this stage, most of the variance should be attributed to the sequencing depth of the experiment, i.e. the total number of umis and the total number of genes.

```{r var_before_norm}
# use the raw_counts
if (!(file.exists(here("Data", "Human_R_Analysis", "Clemastine_DMSO", "clem_human_variance_before_norm.rds")))) {
  # Calculate the matrix (time consuming step)
  var <- getVarianceExplained(
    sce,
    exprs_values = "logcounts_raw",
    variables = c(
      "Chip",
      "Treatment",
      "animalID",
      "subsets_mt_percent",
      "detected",
      "total"
    )
  )
  saveRDS(var, here("Data", "Human_R_Analysis", "Clemastine_DMSO", "clem_human_variance_before_norm.rds"))
  #If not just load created object
} else {
  var <- readRDS(here("Data", "Human_R_Analysis", "Clemastine_DMSO", "clem_human_variance_before_norm.rds"))
}
plotExplanatoryVariables(var)
```

The Density measure of the y axis, represents the density of the R-squared values across all genes.

### After normalization

```{r var_after_norm}
# Use the normalized_data
if (!(file.exists(here("Data", "Human_R_Analysis", "Clemastine_DMSO", "clem_human_variance_after_norm.rds"))))
  {
  var_norm <- getVarianceExplained(
    sce,
    variables = c(
      "Chip",
      "Treatment",
      "animalID",
      "subsets_mt_percent",
      "detected",
      "total"
    )
  )
  saveRDS(var_norm, here("Data", "Human_R_Analysis", "Clemastine_DMSO", "clem_human_variance_after_norm.rds"))
} else{
  var_norm <- readRDS(here("Data", "Human_R_Analysis", "Clemastine_DMSO", "clem_human_variance_after_norm.rds"))
}
plotExplanatoryVariables(var_norm)
```

In principal, this graph shows reduced variance, for instance in the metric "total" (total number of UMIs) or the metric "detected" (total number of genes).

# Dimensional Reduction

A very quick clustering, using only the most variable genes can be used to reduce the technical noise of the dataset.

PCA conscious another way to assess variance, demonstrating how most of the observed variance can be attributed to the sequencing depth (sum). Alternative non-linear reductions can also be used, like the UMAP or tSNE repressentations.

```{r pca}
raw <- runPCA(sce, exprs_values = "logcounts_raw")
plotPCA(raw, colour_by= "Chip", size_by="sum", point_size=1.5) + ggtitle("Before_Normalization")

sce <- runPCA(sce)
plotPCA(sce, colour_by= "Chip", size_by="sum", point_size=1.5) + ggtitle("After_Normalization")
```

```{r umap}
sce <- runUMAP(sce,  dimred="PCA")
```

```{r}
plotReducedDim(sce, colour_by= "Chip", point_size=1.5 , dimred = "UMAP") + ggtitle("UMAP")

plotReducedDim(sce, colour_by= "Treatment", point_size=1.5, dimred = "UMAP") + ggtitle("UMAP")

plotReducedDim(sce, colour_by= "Treatment", point_size=1.5, dimred = "UMAP", other_fields = "Treatment") +  ggtitle("UMAP")+ facet_wrap(~Treatment)
```

```{r tsne}
sce <- runTSNE(sce,  dimred="PCA")
```

```{r}
plotReducedDim(sce, colour_by= "Chip", point_size=1.5, dimred = "TSNE") + ggtitle("TSNE")

plotReducedDim(sce, colour_by= "Treatment", point_size=1.5, dimred = "TSNE") +  ggtitle("TSNE")

plotReducedDim(sce, colour_by= "Treatment", point_size=1.5, dimred = "TSNE", other_fields = "Treatment") + ggtitle("TSNE") + facet_wrap(~Treatment)
```

##### SessionInfo

<details>

<summary>

Click to expand

</summary>

```{r}
sessionInfo()
```

</details>