This page contains guidelines for data processing, analysis, and integration of genomics, transcriptomics, and proteomics data as explained in the overview of the NFDI4Microbiota use cases (https://nfdi4microbiota.de/latest/usecase). The aim is to have these guidelines contain a comprehensive overview of multi-omics data integration for data derivded from a variety of experiments. It is therefore important to receive feedback from the larger NFDI4Microbiota community and any other interested parties that wish to contribute. Please find the contribution guidelines here: https://github.com/NFDI4Microbiota/nfdi4microbiota-knowledge-base/blob/main/docs/_Getting-Started/02-contributing.md.
When comparing bacterial strains or species, it is important to work with as complete genomes as possible. We therefore recommend using state-of-the-art tools to circularize isolate genomes (if needed and if applicable). Unicycler (https://github.com/rrwick/Unicycler) is a tool that can circularize genomes using short-read assembly (e.g. Illumina sequencing), long-read assembly (e.g. Oxford Nanopore), or a hybrid assembly using both short and long reads.
Genome annotation is at the center of multi-omics data integration, as we can only compare and draw information from identified genes and proteins. Tools such as Prokka (https://github.com/tseemann/prokka) are easy to install and use for gene annotation of the circular/complete genome.
To date, the most common method to investigate transcriptomics is RNA-Seq (https://en.wikipedia.org/wiki/RNA-Seq).
The most common type of cellular RNA is ribosomal RNA, and if one wishes to explore metabolic or regulatory differences then it is important to remove all rRNA sequences before continuing with the analysis. We recommend to perform this step, even when the sequencing company has removed the rRNA before sequencing. One tool that sorts rRNA from other RNA types is SortMeRNA (https://github.com/sortmerna/sortmerna). The input is the RNA-Seq read libraries from the sequencing company. Because SortMeRNA aims to isolate the rRNA sequences, the desired sequences in our case are the "rejected" sequences.
To compare RNA-Seq libraries, it is important that the libraries are of similar sizes. When a set of libraries is much larger than others, this may skew the normalization during the alignment process. This can be adjusted by reducing the larger libraries to a size closer to the smaller libraries. We therefore recommend counting the reads of each library (https://www.biostars.org/p/139006/) and inspecting the library sizes, for example with a histogram. If there is a large difference in library sizes, the larger libraries can be reduced the desired size, for example, the average read size of the smaller libraries. Here, the user "rtlim" explains how to use Seqtk (https://github.com/lh3/seqtk) to reduce fastq files (https://www.biostars.org/p/142705/#142707).
Several tools exist for transcriptomics analysis, but we recommend using READemption (https://reademption.readthedocs.io/en/latest/), as this tool is easy to install and use. The input is the rRNA-removed (2.1.1), size-adjusted (2.1.2) RNA-Seq libraries, the complete/circular genome (1.1), and the gene annotations (1.2).
Genome-scale metabolic reconstructions can be created from the complete/circular genome sequnces or genome annotations, for example using the tool CarveMe (https://github.com/cdanielmachado/carveme), which creates a metabolic network from the amino acid sequences of the annotated genome. We can then combine this network with transcriptomics or proteomics data, or both using the well-established methods from the COBRA Toolbox (https://github.com/opencobra, https://github.com/opencobra/COBRA.tutorials/tree/1da9dd3ffcdcdf3b5de2cd25a83b4b80ba65a65e/dataIntegration). Context-specific models can highlight differences in metabolic potential of the different bacterial strains.
If metabolic differences are not the target, then it is also possible to investigate cellular differences using KEGG (https://www.genome.jp/kegg/). Genome and protein amino acid sequences (1.2) can be uploaded to GhostKOALA (https://www.kegg.jp/ghostkoala/), which will aim to identify the corresponding KO genes. The resulting KO identifiers can then be entered into the KEGG API (https://www.kegg.jp/kegg/rest/keggapi.html) KEGG Mapper (https://www.genome.jp/kegg/mapper/) for further inspection of differences within KEGG modules or pathways.