## AML Data Processing Notebook

This notebook processes the AML dataset obtained from [GEO (GSE116256)](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE116256) and saves it as a single count matrix.

### Data Sources
- The count matrix and annotations were downloaded from [GEO (GSE116256)](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE116256).
- The compressed files are stored in:  
  `/scMEDAL_for_scRNAseq/Experiments/data/AML_data/zip_files`
- The processed count matrix is saved in:  
  `/scMEDAL_for_scRNAseq/Experiments/data/AML_data/adata_merged`


Environment: preprocess_and_plot_umaps_env



In [1]:
import sys
# Add the parent directory to the Python path
sys.path.append("../")
import os
# Now you can import from the parent directory
from paths_config import data_base_path

from scMEDAL.utils.preprocessing_utils import AML_data_reader
from scMEDAL.utils.utils import save_adata

data_base_path: /endosome/archive/bioinformatics/DLLab/src/AixaAndrade/gitfront/scMEDAL_for_scRNAseq/Experiments/AML/../data/AML_data
outputs_path: /endosome/archive/bioinformatics/DLLab/src/AixaAndrade/gitfront/scMEDAL_for_scRNAseq/Experiments/AML/../outputs/AML_outputs


In [3]:

# I downloaded the dataset count matrix and annotations from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE116256
# I saved the compressed files under /scMEDAL_for_scRNAseq/Experiments/data/AML_data/zip_files
# Path to the directory containing zip files
parent_path = os.path.join(data_base_path, "zip_files")
# 1.Read adata


AML_reader = AML_data_reader(parent_path)
# Get df_paths
df_paths = AML_reader.get_df_paths()
df_paths


         id Patient_group  counts
0   AML1012           AML       1
1   AML210A           AML       1
2    AML314           AML       2
3    AML328           AML       4
4    AML329           AML       3
5    AML371           AML       2
6   AML419A           AML       1
7   AML420B           AML       3
8    AML475           AML       2
9    AML556           AML       3
10  AML707B           AML       5
11  AML722B           AML       2
12   AML870           AML       2
13   AML916           AML       1
14  AML921A           AML       1
15   AML997           AML       2
16      BM1       control       1
17      BM2       control       1
18      BM3       control       1
19      BM4       control       1
20      BM5       control       2
21    MUTZ3      cellline       1
22      OCI      cellline       1


Unnamed: 0,matrix_path,id,file_note,accession_matrix_num,anno_path,accession_anno_num,Day,unique_id,Patient_group
0,/endosome/archive/bioinformatics/DLLab/src/Aix...,AML328,D0,GSM3587931,/endosome/archive/bioinformatics/DLLab/src/Aix...,GSM3587932,D0,AML328_D0,AML
1,/endosome/archive/bioinformatics/DLLab/src/Aix...,AML420B,D14,GSM3587955,/endosome/archive/bioinformatics/DLLab/src/Aix...,GSM3587956,D14,AML420B_D14,AML
2,/endosome/archive/bioinformatics/DLLab/src/Aix...,AML314,D0,GSM3587927,/endosome/archive/bioinformatics/DLLab/src/Aix...,GSM3587928,D0,AML314_D0,AML
3,/endosome/archive/bioinformatics/DLLab/src/Aix...,AML556,D15,GSM3587965,/endosome/archive/bioinformatics/DLLab/src/Aix...,GSM3587966,D15,AML556_D15,AML
4,/endosome/archive/bioinformatics/DLLab/src/Aix...,AML314,D31,GSM3587929,/endosome/archive/bioinformatics/DLLab/src/Aix...,GSM3587930,D31,AML314_D31,AML
5,/endosome/archive/bioinformatics/DLLab/src/Aix...,AML371,D34,GSM3587948,/endosome/archive/bioinformatics/DLLab/src/Aix...,GSM3587949,D34,AML371_D34,AML
6,/endosome/archive/bioinformatics/DLLab/src/Aix...,AML210A,D0,GSM3587925,/endosome/archive/bioinformatics/DLLab/src/Aix...,GSM3587926,D0,AML210A_D0,AML
7,/endosome/archive/bioinformatics/DLLab/src/Aix...,AML707B,D41,GSM3587975,/endosome/archive/bioinformatics/DLLab/src/Aix...,GSM3587976,D41,AML707B_D41,AML
8,/endosome/archive/bioinformatics/DLLab/src/Aix...,AML475,D0,GSM3587959,/endosome/archive/bioinformatics/DLLab/src/Aix...,GSM3587960,D0,AML475_D0,AML
9,/endosome/archive/bioinformatics/DLLab/src/Aix...,AML707B,D113,GSM3587971,/endosome/archive/bioinformatics/DLLab/src/Aix...,GSM3587972,D113,AML707B_D113,AML


In [4]:
# Create a dict of adata objects
adata_dict = AML_reader.create_adata_dict(df_paths)
print(f"Created {len(adata_dict)} AnnData objects.")




Created 43 AnnData objects.


In [5]:
merged_adata = AML_reader.merge_adata_objects(adata_dict)



In [6]:
merged_adata

AnnData object with n_obs × n_vars = 41090 × 27899
    obs: 'Cell', 'NumberOfReads', 'AlignedToGenome', 'AlignedToTranscriptome', 'TranscriptomeUMIs', 'NumberOfGenes', 'CyclingScore', 'CyclingBinary', 'MutTranscripts', 'WtTranscripts', 'PredictionRF2', 'PredictionRefined', 'CellType', 'Score_HSC', 'Score_Prog', 'Score_GMP', 'Score_ProMono', 'Score_Mono', 'Score_cDC', 'Score_pDC', 'Score_earlyEry', 'Score_lateEry', 'Score_ProB', 'Score_B', 'Score_Plasma', 'Score_T', 'Score_CTL', 'Score_NK', 'NanoporeTranscripts', 'id', 'Day', 'unique_id', 'Patient_group'
    var: 'Gene'

In [7]:
merged_adata.obs

Unnamed: 0,Cell,NumberOfReads,AlignedToGenome,AlignedToTranscriptome,TranscriptomeUMIs,NumberOfGenes,CyclingScore,CyclingBinary,MutTranscripts,WtTranscripts,...,Score_B,Score_Plasma,Score_T,Score_CTL,Score_NK,NanoporeTranscripts,id,Day,unique_id,Patient_group
0,AML328-D0_AAAAACAGAAGT,24994,15391,7477,1236,581,-0.351,no,,,...,0.042,0.009,0.132,0.184,0.447,,AML328,D0,AML328_D0,AML
1,AML328-D0_AAAACCGCTACT,55122,34633,17252,3394,1238,-0.409,no,,,...,0.071,0.020,0.070,0.052,0.037,,AML328,D0,AML328_D0,AML
2,AML328-D0_AAAACCGGCTTT,43393,26813,16148,2649,1243,-0.401,no,,,...,0.062,0.052,0.046,0.032,0.034,,AML328,D0,AML328_D0,AML
3,AML328-D0_AAAAGCTTATCA,25085,15404,9483,1582,633,-0.378,no,,,...,0.060,0.009,0.491,0.167,0.080,,AML328,D0,AML328_D0,AML
4,AML328-D0_AAAAGTCCCCGT,54911,33226,20545,3280,1376,-0.629,no,,,...,0.012,0.003,0.006,0.010,0.009,,AML328,D0,AML328_D0,AML
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41085,AML328-D171_CCATCATCCACC,26414,19298,13674,2516,939,-0.257,no,,,...,0.058,0.016,0.464,0.135,0.109,,AML328,D171,AML328_D171,AML
41086,AML328-D171_TTTTATCATTCT,27460,20073,9433,1651,878,-0.524,no,,,...,0.066,0.023,0.281,0.226,0.213,,AML328,D171,AML328_D171,AML
41087,AML328-D171_AAGATGTAGCGT,12394,8787,6543,1331,504,-0.348,no,,,...,0.088,0.019,0.136,0.074,0.041,,AML328,D171,AML328_D171,AML
41088,AML328-D171_CTGTAGCTCCTA,19172,13904,10351,1792,745,-0.394,no,,,...,0.107,0.015,0.121,0.045,0.052,,AML328,D171,AML328_D171,AML


In [14]:
# Change to True to save adata
# I saved the cmerged count matrix under /scMEDAL_for_scRNAseq/Experiments/data/AML_data/adata_merged
save_data = False
if save_data:
    #save merged adata
    save_adata(merged_adata,output_path=data_base_path+"/adata_merged")

Created folder: /archive/bioinformatics/DLLab/AixaAndrade/data/Genomic_data/VanGallen_2019/adata_merged
