UC-MULTI: Integration of multi-omics data of microbial species

This page contains guidelines for data processing, analysis, and integration of genomics, transcriptomics, and proteomics data as explained in the overview of the NFDI4Microbiota use cases (https://nfdi4microbiota.de/latest/usecase). The aim is to have these guidelines contain a comprehensive overview of multi-omics data integration for data derivded from a variety of experiments. It is therefore important to receive feedback from the larger NFDI4Microbiota community and any other interested parties that wish to contribute. Please find the contribution guidelines here: https://github.com/NFDI4Microbiota/nfdi4microbiota-knowledge-base/blob/main/docs/_Getting-Started/02-contributing.md.

1 Genomics

1.1 Genome circularization

When comparing bacterial strains or species, it is important to work with as complete genomes as possible. We therefore recommend using state-of-the-art tools to circularize isolate genomes (if needed and if applicable). Unicycler (https://github.com/rrwick/Unicycler) is a tool that can circularize genomes using short-read assembly (e.g. Illumina sequencing), long-read assembly (e.g. Oxford Nanopore), or a hybrid assembly using both short and long reads.

1.2 Genome annotation

Genome annotation is at the center of multi-omics data integration, as we can only compare and draw information from identified genes and proteins. Tools such as Prokka (https://github.com/tseemann/prokka) are easy to install and use for gene annotation of the circular/complete genome.

2 Transcriptomics

To date, the most common method to investigate transcriptomics is RNA-Seq (https://en.wikipedia.org/wiki/RNA-Seq).

2.1 RNA-Seq read library preparation

2.1.1 Ribosomal RNA removal

The most common type of cellular RNA is ribosomal RNA, and if one wishes to explore metabolic or regulatory differences then it is important to remove all rRNA sequences before continuing with the analysis. We recommend to perform this step, even when the sequencing company has removed the rRNA before sequencing. One tool that sorts rRNA from other RNA types is SortMeRNA (https://github.com/sortmerna/sortmerna). The input is the RNA-Seq read libraries from the sequencing company. Because SortMeRNA aims to isolate the rRNA sequences, the desired sequences in our case are the "rejected" sequences.

2.1.2 Adjusting RNA-Seq read library sizes

To compare RNA-Seq libraries, it is important that the libraries are of similar sizes. When a set of libraries is much larger than others, this may skew the normalization during the alignment process. This can be adjusted by reducing the larger libraries to a size closer to the smaller libraries. We therefore recommend counting the reads of each library (https://www.biostars.org/p/139006/) and inspecting the library sizes, for example with a histogram. If there is a large difference in library sizes, the larger libraries can be reduced the desired size, for example, the average read size of the smaller libraries. Here, the user "rtlim" explains how to use Seqtk (https://github.com/lh3/seqtk) to reduce fastq files (https://www.biostars.org/p/142705/#142707).

2.2 RNA-Seq read alignment and normalization

Several tools exist for transcriptomics analysis, but we recommend using READemption (https://reademption.readthedocs.io/en/latest/), as this tool is easy to install and use. The input is the rRNA-removed (2.1.1), size-adjusted (2.1.2) RNA-Seq libraries, the complete/circular genome (1.1), and the gene annotations (1.2).

3 Proteomics

4 Integration

4.1 Genome-scale metabolic reconstruction

Genome-scale metabolic reconstructions can be created from the complete/circular genome sequnces or genome annotations, for example using the tool CarveMe (https://github.com/cdanielmachado/carveme), which creates a metabolic network from the amino acid sequences of the annotated genome. We can then combine this network with transcriptomics or proteomics data, or both using the well-established methods from the COBRA Toolbox (https://github.com/opencobra, https://github.com/opencobra/COBRA.tutorials/tree/1da9dd3ffcdcdf3b5de2cd25a83b4b80ba65a65e/dataIntegration). Context-specific models can highlight differences in metabolic potential of the different bacterial strains.

4.2 KEGG Orthology

If metabolic differences are not the target, then it is also possible to investigate cellular differences using KEGG (https://www.genome.jp/kegg/). Genome and protein amino acid sequences (1.2) can be uploaded to GhostKOALA (https://www.kegg.jp/ghostkoala/), which will aim to identify the corresponding KO genes. The resulting KO identifiers can then be entered into the KEGG API (https://www.kegg.jp/kegg/rest/keggapi.html) KEGG Mapper (https://www.genome.jp/kegg/mapper/) for further inspection of differences within KEGG modules or pathways.