# Append Editing (ADPr-TAE) sequencing data analysis
This repository contains scripts for analysis and visualization of specific edits observed using Nanopore and Illumina (NGS) sequencing technologies. They were used in the following paper:

[**Targeted DNA ADP-ribosylation drives distinct editing outcomes in bacteria and eukaryotes (2024)**](https://www.helmholtz-hiri.de/en/research/organisation/teams/team/rna-synthetic-biology/)

[Constantinos Patinios, Darshana Gupta, Harris V. Bassett, Scott P. Collins, Charlotte Kamm, Anuja Kibe, Christophe Toussaint, Katie Vollen, Chengsong Zhao, Yanyan Wang, Thuan Nguyen, Alessandro Del Re, Irene Calvin, Tatjana Achmedov, Kathryn Polkoff, Angela Migur, Emmanuel Saliba, Nathan Crook, Anna Stepanova, Jose M. Alonso, Chase L. Beisel](https://www.helmholtz-hiri.de/en/research/organisation/teams/team/rna-synthetic-biology/)

## Data Accessibility
NGS and Nanopore raw sequencing data used in the paper are available at [SRA](https://www.ncbi.nlm.nih.gov/sra).

Example of processed data to run each of the scripts present in this repository are available in the **data** folder.

## Repository Structure
**data:** Directory containing example data.

**analysis:** Directory containing the R scripts.

**outputs:** This directory is created when running the scripts. It will contain the processed data and different tables/plots.

## Code Execution
#### 1- Download repository
Option 1: Download manually the repository as a ZIP archive and extract it locally on your computer

Option 2: Clone the repository
```shell
git clone https://github.com/saliba-lab/MBE_analysis.git
cd MBE_analysis/analysis
```


#### 2- Install R dependencies 
See Dependencies section.


#### 3- Description of R scripts in the **analysis** directory
Make sure to set the **analysis** directory as the working directory when running the scripts.

**Scripts 1 to 3** are related to data obtained by Illumina sequencing. Allele_frequency_table_around_sgRNA_ files (in .txt format) generated by [CRISPResso2](https://github.com/pinellolab/CRISPResso2) are used as inputs for running these scripts.
They also require sample specific metadata indicated in sample sheets (also located in the **analysis directory**). 
The metadata can refer to a minimal read count number for including a sample in the analysis (Threshold_read_counts), position of interest in the sequencing read to look for mutations (Mutation_Position) or replicate information (Replicate_group).

1. Script1: This script sums up %Reads containing nucleotides different from the reference, at a specified position. 

2. Script2: This script sums up %Reads containing a specific nucleotide for all positions along the length of the read. 

3. Script3: This script sums up %Reads containing nucleotides different from the reference, at two positions specified in the sample sheet and produce graphical representations for visualization.

 **Scripts 4 to 6** are related to data obtained by Nanopore sequencing.

4. Script4: Related to figure 1h and S3. This script reads BAM files, then takes two actions. First, it calculates and plots the fraction of unedited, edited, and ambiguous reads in a region. Second, it calculates and plots SNVs at a specific position in an otherwise unedited region, as a percent of all reads.

5. Script5: Related to figure 1i. This script reads a CSV file, then calculates and plots frequency of SNVs that exceed filtering criteria.

6. Script6: Related to figure S5. This script reads a CSV file, then takes two actions. First, it plots individual growth curves. Second, it plots final values of absorbance at 600 nm.

## Dependencies
List of R packages necessary to run the scripts.

- R                 4.0.3
- dplyr             1.0.10
- openxlsx          4.2.3
- ggplot2           3.4.0
- cowplot           1.1.1
- tidyverse         1.3.0
- GenomicAlignments 1.40.0