# Introduction This iSoMAs package contains the proposed “iSoMAs” (iSoform expression and somatic Mutation Association) algorithm (function `iSoMAs`), an efficient computational pipeline based on principal component analysis (PCA) techniques for exploring the association between somatic single nucleotide variant (SNV) and gene isoform expression through both cis- and trans-regulatory mechanisms. We have shown that iSoMAs is more efficient and versatile than existing methods in identification of SNV-isoform expression associations. Since iSoMAs implements differential analysis along a small number of PC coordinates (always ≤50) rather than along thousands of original transcript axes directly, it dramatically outperforms the existing models in efficiency. Furthermore, iSoMAs searches for the SNV-associated isoform expression over the whole transcriptome simultaneously, indicating its versatility compared to previous studies. Most importantly, iSoMAs overcomes the limitation of the low mutation frequencies of most cancer genes by examining the association between an SNV and the meta-isoform expression quantified as the PC score, well reducing the false positive rate incurred by the traditional gene-by-gene association study. iSoMAs integrates sample-matched RNA-seq and DNA-seq data to study the association between gene somatic mutations and gene isoform expression. The iSoMAs workflow consists of two steps: In the first step, the high-dimensional gene isoform expression matrix (`d=59,866` derived from the `15,448` multi-isoform genes) for each cancer type is trimmed by mean.var.plot method (built in the Seurat toolkit) into a more informative expression matrix, which keeps only the most variable isoforms but is still high-dimensional (`d~3,500`). After that, the Principal Component Analysis (PCA) is performed to further reduce the dimension of the informative expression matrix into a much lower-dimensional PC score matrix (`d<=50`) by calculating a PC loading matrix. Each column of the PC loading matrix performs a particular linear combination of the top variable isoforms into a meta-isoform, with the combination coefficients stored in the corresponding column of the PC loading matrix. All the meta-isoforms comprise the coordinates of the new low-dimensional space. In the second step, a differential PC score analysis is conducted along each of the 50 PC-coordinates based on the mutation status of the studied gene by Wilcoxon rank-sum test. Following the differential PC score analysis, the significant genes (termed iSoMAs genes) are determined if the minimum of the 50 p-values [defined as `minP = min{P1,P2,...,P50}`] is smaller than a predefined threshold (i.e.,`minP<1e-3`). ```{r pressure, echo=FALSE, fig.cap="", fig.align="center", out.width = '95%'} knitr::include_graphics("iSoMAs_pipeline.png") ``` In this tutorial, we guide users to execute iSoMAs and visualize/interpret the iSoMAs results step-by-step. This tutorial was implemented on R version 4.3.1 (2023-06-16) on macOS Monterey 12.6.5. The latest developmental version of iSoMAs can be downloaded from GitHub and installed from source by `devtools::install_github('elnitskilab/iSoMAs')`. After installation, we load the package as below: ```{r setup} library(iSoMAs) ``` # Run iSoMAs Four datasets are required as iSoMAs inputs: a isoform expression matrix (`data.iso`), a gene somatic mutation table as a maf (mutation annotation format) (`data.maf`), two mapping files between gene names and their associated isoforms (`gene_to_iso`, `iso_to_gene`). iSoMAs automatically determines a qualified list of genes to test (`genes_to_test`) if not designated [In this demo, we test only the first 100 genes in the list of genes_to_test, and therefore you may see a little difference than the manuscript]. An example of the four datasets collected from TCGA-LUAD project have been incorporated into this package and can be used directly. Users can also download them from the following sites and load to the work space manually. data.iso [data.frame: 73,599 * 576]: https://drive.google.com/file/d/1pE4L7mkuUy_Ry6-9xMP7acx4ElSLibcD/view?usp=share_link data.maf [data.frame: 208,180 * 120]: https://drive.google.com/file/d/14inFBJhHItABaA0-1gs5z11gO7NeflY_/view?usp=share_link gene_to_iso [list of list: 29,181 elements], iso_to_gene [list: 73,599 elements]: https://drive.google.com/file/d/1_Ua0tCoLbtgIgn3Lfemz3OfEzz2hl4Z9/view?usp=share_link' ```{r warning=FALSE} res_isomas = iSoMAs(data.iso,data.maf,gene_to_iso,iso_to_gene, genes_test=NULL, stat.method="wilcox", ntop.gene=100, minP.sorted = T, filename = "res_iSoMAs_TCGA-LUAD.RData") ``` The output of iSoMAs `res_isomas` is a 21-entry list containing the input arguments, the PCA and differential PC score analysis results: ```{r} str(res_isomas,max.level = 1) ``` The first entry `pca` of the iSoMAs output is a SeuratObject generated by the `Seurat` package, in which `cell.embeddings` refers to the PC score matrix with samples in rows and PC coordinates in columns, and `feature.loadings` refers to the PC loading matrix with features (isoforms) in rows and PC coordinates in columns. PC coordinates are denoted as {PC_1, PC_2, ..., PC_50}. ```{r} str(res_isomas$pca,max.level = 2) print(res_isomas$pca@cell.embeddings[1:5,1:3]) print(res_isomas$pca@feature.loadings[1:5,1:3]) ``` The `pvals_all_sorted` lists the tested genes ranked by `minP`. ```{r warning=TRUE} nc = ncol(res_isomas$pvals_all_sorted) print(head(res_isomas$pvals_all_sorted[,c(1:4,(nc-1):nc)])) ``` # Visualize and interpret iSoMAs output In the following sections, we use simple examples to show how to interpret the iSoMAs output. Since TP53 was detected as a iSoMAs gene in most cancer types (18 out of 33). We will always use TP53 as an example if applicable. First, we extract the p-value table and determine the PC-coordinates along which TP53 was tested significant in the differential PC score analysis for subsequent use: ```{r warning=T} myGene.mut = "TP53" project = res_isomas$project pvals_all = res_isomas$pvals_all pvals_sig = get_pvals_sig(pvals_all,myGene = myGene.mut) PCs_sig = colnames(pvals_sig) ``` ## Plot the top iSoMAs genes detected in a specific cancer type For each gene tested, the `minP` along all 50 PC coordinates is first obtained.Then, all tested genes are ranked based on their `minP`. Only genes with `minP