HackAI-Dell-Nvidia-Challenge / final_notebook.ipynb
final_notebook.ipynb
Raw
from Bio import Entrez
import os
import json

from datasets import load_dataset


from llama_index.core import TreeIndex, SimpleDirectoryReader

from llama_index.llms.ollama import Ollama

from llama_index.embeddings.ollama import OllamaEmbedding

## load llm
llm = Ollama(model="llama3.2", request_timeout=180.0)

## load embedding model
embed_model = OllamaEmbedding(model_name="llama3.2")



## get list of species

from Bio import Entrez
import random

Entrez.email = "posimreddy.anishkumar@example.com"  # Always tell NCBI who you are

def fetch_species_list(taxon_id, num_species=25):
    handle = Entrez.esearch(db="taxonomy", term=f"txid{taxon_id}[Subtree]", retmax=10000)
    record = Entrez.read(handle)
    species_ids = record["IdList"]
    
    # Randomly sample species if there are more than requested
    if len(species_ids) > num_species:
        species_ids = random.sample(species_ids, num_species)
    
    species_list = []
    for taxid in species_ids:
        handle = Entrez.efetch(db="taxonomy", id=taxid, retmode="xml")
        records = Entrez.read(handle)
        if records[0]['Rank'] == 'species':
            species_list.append(records[0]['ScientificName'])
    
    return species_list

# List of major taxonomic groups with their NCBI Taxonomy IDs
taxonomic_groups = {
    "Mammals": 40674,
    "Birds": 8782,
    "Reptiles": 8504,
    "Amphibians": 8292,
    "Fish": 7898,
    "Insects": 50557,
    "Plants": 33090,
    "Fungi": 4751,
    "Bacteria": 2,
    "Viruses": 10239
}

all_species = []

for group, taxid in taxonomic_groups.items():
    print(f"Fetching species from {group}...")
    species = fetch_species_list(taxid, num_species=25)
    all_species.extend(species)
    print(f"Fetched {len(species)} species from {group}")

print(f"Total species fetched: {len(all_species)}")

# Save the species list to a file
with open("species_list.txt", "w") as f:
    for species in all_species:
        f.write(f"{species}\n")

print("Species list saved to species_list.txt")

Fetching species from Mammals...
Fetched 17 species from Mammals
Fetching species from Birds...
Fetched 15 species from Birds
Fetching species from Reptiles...
Fetched 22 species from Reptiles
Fetching species from Amphibians...
Fetched 25 species from Amphibians
Fetching species from Fish...
Fetched 25 species from Fish
Fetching species from Insects...
Fetched 24 species from Insects
Fetching species from Plants...
Fetched 23 species from Plants
Fetching species from Fungi...
Fetched 23 species from Fungi
Fetching species from Bacteria...
Fetched 23 species from Bacteria
Fetching species from Viruses...
Fetched 25 species from Viruses
Total species fetched: 222
Species list saved to species_list.txt
with open('species_list.txt', 'r') as file:
    # Read the file contents
    data = file.read()
    # Split the contents into a list
    species_list = data.splitlines()


for k in species_list:
    print(k)
Oligoryzomys aff. microtis MN76206
Microcebus sp. Antanosy
Wiedomys cerradensis
Saccostomus sp. 13822
Rhinolophus cf. siamensis sensu Tu et al. 2017
Niviventer sp. ABBM166-05
Typhlomys chapensis
Crocidura cf. neglecta EBD31634M
Scotozous dormeri
Rhinolophus cf. macrotis Phia Oac VTT-2017
Hypsugo bemainty
Niviventer sacer
Neophocaena sp. ZSIWGRC_3626
Surdisorex polulus
Alticola strelzovi
Pipistrellus sp. Be_2137_9
Hipposideros cf. abae ZMMU S-189528
Charadrius leschenaultii
Phylloscartes kronei
Spinus santaecrucis
Urosticte ruficrissa
Diglossa duidae
Platysteira peltata
Anabazenops fuscus
Scytalopus chocoensis
Horornis fortipes
Euphonia luteicapilla
Gymnobucco peli
Picumnus squamulatus
Threnetes niger
Philemon kisserensis
Platyrinchus flavigularis
Leptophis cupreus
Hydrophis parviceps
Tympanocryptis cf. lineata LPS-2008
Phelsuma comorensis
Phelsuma punctulata
Phelsuma pasteuri
Hemidactylus foudaii
Hemidactylus whitakeri
Macrocalamus tweediei
Phyllodactylus tuberculosus
Brookesia decaryi
Craspedocephalus macrolepis
Anolis sp. AB-2015
Tropidurus azurduyae
Hemiphyllodactylus sp. IN 5
Cyrtodactylus loriae
Demansia psammophis
Anolis cf. alocomyos GK-2015
Philothamnus macrops
Grandidierina rubrocaudata
Phrynocephalus ahvazicus
Amphisbaena gonavensis
Pelophylax cf. bedriagae 'Cilician West'
Polypedates cf. mutus KUHE 32448
Pristimantis sp. MZUTI 3764
Mixophyes fleayi
Leucostethus sp. CZPD UV 5280
Aphantophryne cf. pansa 3 CJF-2021
Pristimantis sp. 3 aff. cruentus AB419
Cornufer trossulus
Ptychadena cf. mossambica BMNH 2018.5754
Pristimantis aff. malkini 1 AR-2023
Megophrys sp. ENS 16749
Astylosternus perreti
Occidozyga cf. laevis KU 328890
Dendropsophus garagoensis
Physalaemus sp. ZUEC 22695
Cornufer exedrus
Microhyla aurantiventris
Centrolene huilense
Leptopelis parkeri
Raorchestes theuerkaufi
Adenomera diptyx group sp. ABGD cluster 12
Boana cf. geographica AF-2016
Atelopus exiguus
Occidozyga cf. laevis KU 323828
Pristimantis samaniegoi
Planiliza sp. BIF1343
Gobiidae sp. HB-250417-38A
Sebastes sp. THL_58_10
Coregonus sp. Fish748
Hypseleotris sp. e MM-2022
Lithoxus aff. planquettei GF06-504
Papyrocranus cf. afer IHB049
Spratelloides cf. delicatulus G1B-050616-4
Chaunax sp. NMMB-P25201
Halichoeres scapularis complex sp. k62
Oxycheilinus sp. S0027_003
Gerres cf. oyena G1A-250417-5A
Aphyosemion sp. JFA85
Danio aff. dangila RC0561
Labeo sp.
Petulanos sp. USNM FISH 449161
Leptojulis urostigma
Coregonus sp. Fish683
Plectrogenium kamoharai
Acipenser sp. YT-2021
Pimelodella aff. cristata GF06-436
homodiploid Cyprinus carpio x Megalobrama amblycephala F5
Rhadinoloricaria cf. condei ALB-2022
Champsodon sp. CBM:ZF:21217
Pelasgus cf. epiroticus Ps_Alc133
Coarctotermes sp. E MLW-2023a
Naupactus sp.
Automeris bahamata
Euryparyphes bolivari
Stephostethus sp.
Seirotrana sp.
Phyllocnistis sp. 7 YL-2023a
Acrotelsella obscura
Zophophilus sp.
Cicadellinae sp.
Miresa sagitovae
Anomophysis coxalis
Plecoptera recta
Monochamus millegranus
Macrotoma sp.
Persis foveatis
Sciobia lusitanica
Anastrepha fraterculus complex sp. Brazil-2
Metriogryllacris amitarum
Retipenna sp. 1 JYW-2024a
Orthaga aenescens
Psychomyia extensa
Ameletus sp. 2021-044
Paectes sp. CR1
Radlkoferotoma berroi
Palisota repens
Ziziphus elegans
Astragalus perianus
Astragalus kuschkensis
Monnina padifolia
Microsorum siamense
Anisocycla sp.
Periploca floribunda
Eulophia bicallosa
Distimake flagellaris
Puccinellia subspicata
Astragalus iskanderi
Sphagnum sp. 39762
Poupartia silvatica
Rosa pisiformis
Lupinus holosericeus
Astragalus rubtzovii
Carteria sp. XW-7-4
Acacia granitica
Cassia hintonii
Astragalus gobicus
Aeschynomene gazensis
Battarreoides diguetii
Melanogaster sp. 'CA04'
Mycopan sp. 'OH01'
Sarocladiaceae sp. ARC_SEC3
Hypotarzetta sp.
Polyporales sp. 'PNW01'
Elaphomyces sp. El01
Parahypoxylon ruwenzoriense
Calonectria exiguispora
Aptrootia sp. VP-2023a
Morchella sp. Mel-47
Trichoderma sp. 'CA01'
Clavaria sp. 'californica-01'
Mesophelliaceae sp. ND-2024c
Russula shigatseensis
Rhizopogon argillaceus
Xanthoconium sp. 'aff. maculosus'
Plectania sp. JL-2024a
Entomocorticium macrovesiculatum
Adustoporia sp. 'sinuosa-CA01'
Trichoderma sp. PKRF1
Pseudocercospora sp. 'taxon JZG-2024-042'
Hysterangium sp. 'CA03'
Streptomyces sp. NPDC096079
Nostocales cyanobacterium Esc15.1
Nostoc sp. P8395
Gammaproteobacteria bacterium MCH_2_109
Aphanizomenonaceae cyanobacterium CYA1
Shigella sp. PNUSAE172874
Erwinia sp. P6884
Nostoc sp. P8769
Streptomyces sp. NPDC048669
Fodinicurvata sp. EGI_FJ10296
Actinokineospora sp. NPDC004072
Nonomuraea sp. NPDC001684
uncultured Syntrophales bacterium
Roseibium sp. SCPC14
Anaeromyxobacter sp. 1620B
Streptomyces sp. NPDC006147
Klebsiella sp. JB_Kp009
Nostoc sp. P8846
Nostoc sp. P8309
Streptomyces sp. NPDC048275
Streptomyces sp. NPDC004051
Nostoc sp. P9090
Shigella sp. FJ201001
Drosophila Sunshine bunyavirus
Pseudomonas phage vB_PaWP1
Pseudomonas phage vB_PaeP-F1Pa
Whitefly negevirus 1
Sopalaj virus
Taphozous bat picornavirus 3
Pacmanvirus lupus
Pekapeka alphacoronavirus 1
Streptomyces phage Enygma
Microcystis phage Mae-JY09
Agrobacterium phage Alfirin
Yersinia phage vB_YpM_MHG38
Lactococcus phage D6867
Mycobacterium phage Discoknowium
Klebsiella phage phi1_146037
Parvovirus ficedula5006
Faxonius propinquus nudivirus
Sopanyl virus
Micrococcus phage Olihed
torque teno Delphinidae virus 44
Ripabris virus
Ripadyr virus
Ophiocordyceps sinensis narnavirus 4
Coleura bat reovirus
Chrocamilt virus

## fetch species taxonomy

Entrez.email = "posimreddy.anishkumar@example.com"  # Always tell NCBI who you are

def fetch_and_save_taxonomy(species_list, output_dir):
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    for species in species_list:
        filename = f"{species.replace(' ', '_')}.txt"
        filepath = os.path.join(output_dir, filename)

        handle = Entrez.esearch(db="taxonomy", term=species)
        record = Entrez.read(handle)
        if not record["IdList"]:
            print(f"No taxonomy found for {species}")
            continue
        taxid = record["IdList"][0]
        
        handle = Entrez.efetch(db="taxonomy", id=taxid, retmode="xml")
        records = Entrez.read(handle)

        with open(filepath, 'w') as f:
            f.write(f"Species: {species}\n")
            f.write(f"Taxonomy ID: {records[0]['TaxId']}\n")
            f.write(f"Rank: {records[0]['Rank']}\n")
            f.write(f"Lineage: {records[0]['Lineage']}\n")
            
            f.write("\nClassification:\n")
            for item in records[0]['LineageEx']:
                f.write(f"{item['Rank']}: {item['ScientificName']}\n")
            
            f.write("\nOther Names:\n")
            if 'OtherNames' in records[0]:
                for name in records[0]['OtherNames']:
                    if isinstance(name, dict) and 'Name' in name:
                        f.write(f"{name.get('NameClass', 'Unknown')}: {name['Name']}\n")

        print(f"Saved information for {species} to {filepath}")

# Example usage
# species_list = ["Homo sapiens", "Escherichia coli", "Drosophila melanogaster"]
output_dir = "species_data"
fetch_and_save_taxonomy(species_list, output_dir)

Saved information for Miniopterus bat coronavirus 1 to species_data/Miniopterus_bat_coronavirus_1.txt
Saved information for Miniopterus bat coronavirus HKU8 to species_data/Miniopterus_bat_coronavirus_HKU8.txt
Saved information for Rhinolophus bat coronavirus HKU2 to species_data/Rhinolophus_bat_coronavirus_HKU2.txt
## load all species
documents = SimpleDirectoryReader(
    "species_data"
).load_data()
## initialize Tree Index
index = TreeIndex.from_documents(
    documents, 
    llm=llm,
    show_progress = True,
    embed_model=embed_model,
    progress_bar = tqdm
)

Parsing nodes:   0%|          | 0/224 [00:00<?, ?it/s]



Generating summaries:   0%|          | 0/23 [00:00<?, ?it/s]



Generating summaries:   0%|          | 0/3 [00:00<?, ?it/s]
## Initialize Query Engine
from llama_index.core import StorageContext, load_index_from_storage

# rebuild storage context
storage_context = StorageContext.from_defaults(persist_dir="tree_index")

# load index
index = load_index_from_storage(storage_context, llm = llm)



query_engine = index.as_query_engine(llm = llm)

## Save Index
# index.storage_context.persist(persist_dir="tree_index")
#q1
response = query_engine.query("What is the lowest common ancestor of Whitefly negevirus 1 and Sopalaj virus")
print(response)

It appears that both viruses belong to a group called Leviviricetes. Within this group, there's another classification called unclassified Leviviricetes. This suggests that the lowest common ancestor might be within this unclassified subgroup.
#q2
response = query_engine.query("What is the most diverse order or family within Fungi")
print(response)

The Boletales order is notable for having a high level of diversity in terms of its classification and distribution among fungi. This order comprises numerous families with distinct characteristics and habitats, resulting in a broad range of species within its ranks. Some families, such as the Suillineae suborder, exhibit specialized traits that set them apart from other fungal groups.
#q3
response = query_engine.query("Which taxonomic groups have the highest proportion of monotypic genera (genera with only one species)?")
print(response)

Among the taxonomic groups listed, Dipnotetrapodomorpha has a notable number of monotypic genera. However, it is not clear if this group has the highest proportion of monotypic genera compared to others.

Dipnotetrapodomorpha, in particular, contains the genus Boana, which consists only of one species, Boana geographica AF-2016.
#q4
response = query_engine.query("How does the taxonomic structure differ between major groups (e.g., plants vs. animals)?")
print(response)

The taxonomic structure of living organisms differs significantly between major groups. In general, both plants and animals follow a hierarchical system based on their classification, but with some notable differences.

For plants, which include the Periploca floribunda in question, the hierarchy typically starts with the individual organism and progresses to increasingly broader categories, such as species, genera, families, orders, classes, phyla, divisions, kingdoms, and domains. In contrast, the hierarchy for animals tends to start with a more general classification, often using trichotomies (three-level systems) or dichotomies (two-level systems).

In plants, each rank in the taxonomy represents an increasingly broader group of organisms that share common characteristics. For example, a species is a group of related organisms that can interbreed and produce fertile offspring, while a genus includes multiple species that share similar characteristics.

On the other hand, animals are classified using a hierarchical system that starts with phyla (divisions) and then proceeds to classes, orders, families, genera, and species. This structure allows for a more detailed classification of organisms based on their evolutionary relationships and shared characteristics.

In summary, while both plants and animals follow a hierarchical taxonomic structure, the specific ranks and grouping schemes differ between these two groups, reflecting their unique biological characteristics and evolutionary histories.
#q5
response = query_engine.query("What is the depth of the taxonomic tree for different branches of Bacteria, and what might this indicate about evolutionary history or taxonomic effort?")
print(response)

The classification hierarchy suggests that each branch represents a distinct level in the taxonomy. The number of ranks at each branch indicates the level of specificity.

Starting from the top, the superkingdom "Bacteria" has no rank below it, suggesting it is an overarching category that encompasses all bacteria. 

Moving down, the phylum "Pseudomonadota" appears to be a distinct grouping within the bacterial kingdom, with its own class and order implied but not present in this classification.

The class "Gammaproteobacteria" then further sub-divides into two branches: one with no rank below it (i.e., unclassified Gammaproteobacteria), suggesting that within this larger group, there may be a mix of closely related and distantly related species.

This hierarchical structure might suggest that the evolutionary history of these bacteria is complex and characterized by many branching events. The lack of a specific order or division within the class could indicate that taxonomic effort has focused on grouping species based on shared genetic characteristics rather than their evolutionary relationships.
#q6
response = query_engine.query("Are there patterns in the distribution of species across different taxonomic ranks that might indicate areas needing further research?")
print(response)

Research has shown that certain taxonomic groups exhibit uneven distributions across ranks, often revealing areas where knowledge gaps exist. For instance, the presence of viruses at lower ranks like "no rank" suggests uncharted territories that could benefit from exploration. Similarly, the distribution of families and subfamilies within a higher-order group may indicate regions requiring more in-depth analysis to fully understand their relationships and characteristics.
#q7
response = query_engine.query("Which group has the most unresolved taxonomic disputes at the genus level?")
print(response)

The Orthoptera order has a notable number of genera with disputed taxonomy.
#q8
response = query_engine.query("How does the taxonomic structure correlate with genetic or phylogenetic data?")
print(response)

The relationship between taxonomic classification and genetic or phylogenetic data is one of convergence and validation. Taxonomy is based on the evolutionary history of an organism, which can be reflected in its genetic makeup. The higher the level of taxonomic classification, the more ancient and fundamental characteristics are being described.

Phylogenetic analysis often supports the relationships established through taxonomy, particularly at lower taxonomic levels such as species or genus. However, there may be discrepancies at higher levels due to factors like incomplete lineage sorting, where a trait present in one group is lost in another, or gene duplication events that create new lineages.

The correlation between taxonomic structure and genetic data can also depend on the specific characteristics being examined. For example, morphological traits are often more directly linked to genetic patterns than behavioral or developmental traits. Nevertheless, advances in molecular biology have significantly enhanced our ability to understand the evolutionary history of organisms through comparative genomics and phylogenetic analysis.
#q9
response = query_engine.query("Which group has the most complex taxonomic structure (i.e., the most levels between kingdom and species)?")
print(response)

The group with the most complex taxonomic structure is the Eukaryota.
#q10
response = query_engine.query("How does the proportion of aquatic vs. terrestrial species compare across relevant groups")
print(response)

The distribution of aquatic and terrestrial species appears to be relatively even at certain levels of classification. Some groups exhibit a clear bias towards aquatic environments, while others show a stronger affinity for land-dwelling habitats.

For example, the class Mammalia, which includes Surdisorex polulus, seems to have a mix of both aquatic and terrestrial species. However, when looking at more specific clades like order Eulipotyphla or family Soricidae, there is still no clear indication that one environment is more represented than the other.

It's worth noting that certain groups, such as those within the class Mammalia, have evolved adaptations for both aquatic and terrestrial lifestyles, making it challenging to draw a strict distinction between aquatic and terrestrial species.