# SkillGraph: Full Skill Catalog

78 bioinformatics skills, 483 pipeline transitions (254 literature-backed, 229 type-inferred, 127 confirmed in ground truth).

Built and maintained by Pipette.bio (https://pipette.bio) — agentic bioinformatics for wet-lab biologists.

## How to use programmatically

- **MCP server (recommended for AI agents):** https://skillgraph.pipette.bio/mcp
- **Web explorer:** https://skillgraph.pipette.bio
- **GitHub:** https://github.com/variomeanalytics/bioinformatics-agent-skills

MCP tools available: `get_skill`, `list_skills`, `search_skills`, `get_transitions`, `find_path`, `get_graph_stats`.

## Skills

### admet-prediction — Admet Prediction
Predict ADMET properties and drug-likeness for compounds using ADMET-AI (119 endpoints) and RDKit descriptors. Includes Lipinski Rule of 5, PAINS filtering, and comprehensive pharmacokinetic profiling.
**Tools:** ADMET, Lipinski, SwissADME, admetSAR, pkCSM
**Inputs:** SDF, SMILES
**Outputs:** ADMET_RESULTS, SCORES
**Triggers:** ADMET, drug-likeness, Lipinski, pharmacokinetics, toxicity prediction, absorption, distribution, metabolism, excretion, drug properties, drugability, ADMET-AI, drug screening, compound filtering, PAINS

### alphafold-query — Alphafold Query
Query AlphaFold Database for AI-predicted protein structures. AlphaFold provides high-confidence 3D structure predictions for over 200 million proteins, covering most known protein sequences.
**Tools:** AlphaFold
**Inputs:** GENE_NAME, UNIPROT_ID
**Outputs:** CONFIDENCE_DATA, PDB
**Triggers:** alphafold, protein structure, 3d structure, structure prediction, pdb, protein folding, structural biology, protein model

### archr — Archr
Perform single-cell ATAC-seq analysis using ArchR (R): from fragments to peaks, clusters, and motif enrichment.
**Tools:** ArchR, archr
**Inputs:** BAM, FRAGMENT_FILE
**Outputs:** ARROW, GENE_LIST, PLOT
**Triggers:** Use when user asks for: scATAC-seq with ArchR, single-cell ATAC analysis, chromatin accessibility, ATAC clustering, peak calling single-cell

### atacseq — Atacseq
**Tools:** ATAC-seq, chromVAR
**Inputs:** BAM, FASTQ
**Outputs:** BED, COUNT_MATRIX, FRAGMENT_FILE, GENE_LIST, NARROWPEAK
**Triggers:** Use when user asks for: ATAC-seq analysis, chromatin accessibility, open chromatin, transposase-accessible chromatin, Tn5, bulk ATAC

### bacterial-pangenome — Bacterial Pangenome
Analyze pan-genome across multiple bacterial isolates to identify core, accessory, and unique genes.
**Tools:** PIRATE, Panaroo, Roary, panaroo, roary
**Inputs:** FASTA, GFF
**Outputs:** GENE_LIST, MATRIX, NEWICK
**Triggers:** Use when user asks for: pan-genome, pangenome, core genome, accessory genome, compare isolates, roary, gene presence absence, multiple isolates comparison

### batch-correction — Batch Correction
**Tools:** ComBat:-seq, Harmony, LIGER, MNN, Scanorama, combat, fastMNN, harmony, scanorama
**Inputs:** COUNT_MATRIX, H5AD, RDS
**Outputs:** COUNT_MATRIX, H5AD, RDS
**Triggers:** batch correction, batch effect, Harmony, fastMNN, scVI, ComBat, integration, remove batch, technical variation, limma removeBatchEffect

### binding-site-detection — Binding Site Detection
Identify and rank druggable binding pockets on protein surfaces using Fpocket. Returns pocket locations with druggability scores, volumes, and center coordinates suitable for docking box definition.
**Tools:** DoGSiteScorer, SiteMap, fpocket
**Inputs:** PDB
**Outputs:** BINDING_SITE, COORDINATES, PDB
**Triggers:** binding site, binding pocket, druggable pocket, active site, fpocket, pocket detection, docking site, find binding site, druggability, where to dock

### bisulfite-seq — Bisulfite Seq
**Tools:** Bismark, RRBS, WGBS, methylKit
**Inputs:** FASTQ
**Outputs:** BED, COUNT_MATRIX, GENE_LIST, METHYLATION_DATA

### cbioportal-query — Cbioportal Query
Query cBioPortal for cancer genomics data. cBioPortal provides access to large-scale cancer genomics datasets, including TCGA, with mutation, copy number, and expression data.
**Tools:** cBioPortal
**Inputs:** GENE_NAME
**Outputs:** CLINICAL_DATA, EXPRESSION_MATRIX, MUTATION_DATA
**Triggers:** cbioportal, cancer genomics, somatic mutations, cancer mutations, tumor mutations, TCGA, cancer study, oncology data

### cell-type-annotation — Cell Type Annotation
Automatically annotate cell types in single-cell RNA-seq data using reference-based methods (SingleR, CellTypist) or marker-based approaches. Works with Seurat or Scanpy objects.
**Tools:** CellTypist, SingleR, scType, singler
**Inputs:** CLUSTER_DATA, H5AD, RDS
**Outputs:** CELL_LABELS, H5AD, PLOT, RDS
**Triggers:** Use when user asks for: cell type annotation, annotate clusters, identify cell types, SingleR, CellTypist, celldex, label transfer, cell identity, what are these cells, annotate cells

### cellchat — Cellchat
**Tools:** CellChat, NicheNet
**Inputs:** CELL_LABELS, CLUSTER_DATA, H5AD, RDS
**Outputs:** GENE_LIST, INTERACTION_DATA, PLOT
**Triggers:** cellchat, cell-cell communication, ligand-receptor, cell communication, signaling network, intercellular communication, CellChat

### cellphonedb — Cellphonedb
**Tools:** CellPhoneDB
**Inputs:** CELL_LABELS, COUNT_MATRIX, H5AD
**Outputs:** GENE_LIST, INTERACTION_DATA, PLOT
**Triggers:** cellphonedb, cell-cell communication python, ligand receptor python, CellPhoneDB, cell communication python

### chembl-query — Chembl Query
Search the ChEMBL database for bioactivity data, known inhibitors, approved drugs, and target information using the ChEMBL REST API. No API key required.
**Tools:** ChEMBL
**Inputs:** COMPOUND_ID, GENE_NAME
**Outputs:** BIOACTIVITY_DATA, SDF, SMILES
**Triggers:** ChEMBL, bioactivity, IC50, EC50, Ki, known inhibitor, compound activity, target bioactivity, assay data, known drugs for target, active compounds, potency data

### chipseq — Chipseq
Perform ChIP-seq analysis: from FASTQ to peaks, annotations, motifs, and visualizations.
**Tools:** HOMER, MACS2, MACS3, SICER, homer, macs2, macs3
**Inputs:** BAM
**Outputs:** BED, COUNT_MATRIX, GENE_LIST, NARROWPEAK
**Triggers:** Use when user asks for: ChIP-seq analysis, peak calling, MACS2, transcription factor binding, histone ChIP, ChIP-seq QC

### citeseq — Citeseq
CITE-seq combines scRNA-seq with surface protein detection via antibody-derived tags (ADTs). Seurat handles this as a multimodal object with RNA + ADT assays. Clusters are automatically labeled with cell type names using SingleR.
**Tools:** CITE-seq, TotalVI
**Inputs:** COUNT_MATRIX, H5AD
**Outputs:** CLUSTER_DATA, H5AD
**Triggers:** CITE-seq, ADT, antibody-derived tags, protein + RNA, multimodal single-cell, surface markers, WNN, weighted nearest neighbor

### clinical-variant — Clinical Variant
Deterministic clinical variant analysis for rare disease diagnosis. This skill uses a Python script that ensures reproducible, consistent results.
**Tools:** ClinGen, Franklin, InterVar, VarSome
**Inputs:** VARIANT_INFO, VCF
**Outputs:** CLINICAL_DATA, REPORT
**Triggers:** Use when user asks for: clinical variant analysis, variant prioritization, ACMG classification, rare disease diagnosis, exome/genome interpretation, pathogenic variant identification, clinical significance assessment

### clinicaltrials-query — Clinicaltrials Query
Query ClinicalTrials.gov for clinical study information. The database contains 400,000+ studies from 220 countries, including interventional trials, observational studies, and expanded access programs.
**Tools:** ClinicalTrials.gov
**Inputs:** DISEASE, DRUG_DATA, GENE_NAME
**Outputs:** TRIAL_DATA
**Triggers:** clinical trial, clinicaltrials, nct, clinical study, drug trial, treatment trial, intervention study, recruiting trials

### clinvar-query — Clinvar Query
Query NCBI ClinVar database for clinical interpretations of genetic variants. ClinVar aggregates variant-disease relationships with evidence from clinical laboratories.
**Tools:** ClinVar
**Inputs:** GENE_NAME, VARIANT_ID, VCF
**Outputs:** CLINICAL_DATA, VARIANT_INFO
**Triggers:** clinvar, variant pathogenicity, clinical significance, germline variants, pathogenic variants, ACMG classification, variant interpretation

### cnv-analysis — Cnv Analysis
Detect copy number variations (amplifications, deletions) from WGS, WES, or targeted sequencing data. Supports both germline CNV detection and somatic copy number alterations (SCNAs) in cancer samples.
**Tools:** CNVkit, Control-FREEC, GISTIC2, Sequenza
**Inputs:** BAM, VCF
**Outputs:** CNV_RESULTS, GENE_LIST, PLOT
**Triggers:** Use when user asks for: CNV analysis, copy number variation, copy number calling, amplification, deletion, CNV detection, SCNA, copy number alterations, ploidy, segmentation, CNVkit, GATK CNV

### compound-analysis — Compound Analysis
Analyze and compare small molecules using RDKit: compute molecular descriptors, generate fingerprints, run similarity/substructure searches, cluster compounds, and apply medicinal chemistry filters.
**Tools:** DeepChem, Mordred
**Inputs:** SDF, SMILES
**Outputs:** DESCRIPTORS, PLOT, SDF
**Triggers:** compound analysis, molecular descriptors, fingerprint, Tanimoto similarity, substructure search, molecular weight, LogP, compound comparison, chemical similarity, SMILES analysis, structural alerts, drug-likeness filter, compound profiling

### cosmic-query — Cosmic Query
Query the COSMIC (Catalogue Of Somatic Mutations In Cancer) database for somatic mutation data. COSMIC is the world's largest source of expert-curated somatic mutation data in human cancer.
**Tools:** COSMIC database
**Inputs:** GENE_NAME, VARIANT_ID
**Outputs:** MUTATION_DATA, VARIANT_INFO
**Triggers:** COSMIC, somatic mutations, cancer mutations, mutation frequency, cancer census genes, tumour mutations, COSV, COSM

### crispr-screen — Crispr Screen
**Tools:** BAGEL2, CRISPResso2, MAGeCK
**Inputs:** COUNT_MATRIX, FASTQ
**Outputs:** DE_RESULTS, GENE_LIST
**Triggers:** CRISPR screen, CRISPR knockout, CRISPRko, CRISPRi, CRISPRa, gene essentiality, sgRNA, guide RNA, MAGeCK, BAGEL2, screen hits, dropout screen, enrichment screen, fitness screen, genetic screen

### data-inspection — Data Inspection
**Tools:** FastQC, IGV, MultiQC, samtools
**Inputs:** BAM, COUNT_MATRIX, FASTQ, H5AD, VCF
**Outputs:** METADATA, QC_REPORT
**Triggers:** inspect data, read files, describe data, preview data, understand data, file structure, check data, explore data, data summary, what's in this file, show me the data, examine files, look at the data, read the files, describe the files, check the files

### dbsnp-query — Dbsnp Query
Query NCBI dbSNP database for SNP and small variant information. dbSNP is the primary repository for short genetic variations, including SNPs, indels, and microsatellites.
**Tools:** dbSNP
**Triggers:** dbsnp, rsid, rs number, snp, single nucleotide polymorphism, variant lookup, rs id, reference SNP

### denovo-assembly — Denovo Assembly
Assemble a single bacterial/archaeal isolate genome from paired-end reads with polishing and quality assessment.
**Tools:** Canu, Flye, Hifiasm, MEGAHIT, SPAdes, Trinity, canu, flye, megahit, spades, trinity, wtdbg2
**Inputs:** FASTA, FASTQ
**Outputs:** ASSEMBLY, FASTA
**Triggers:** Use when user asks for: de novo assembly, genome assembly, assemble isolate, single genome assembly, polish assembly, BUSCO, assembly QC

### differential-expression — Differential Expression
**Tools:** DESeq2, deseq2, edgeR, edger, limma, limma-voom, limma:-voom
**Inputs:** COUNT_MATRIX
**Outputs:** DE_RESULTS, EXPRESSION_MATRIX, GENE_LIST, PLOT
**Triggers:** differential expression, DESeq2, limma, DE analysis, fold change, log2FC, padj, FDR, volcano plot, MA plot, differentially expressed genes, DEG

### dtu-analysis — Dtu Analysis
Detect changes in transcript/isoform proportions between conditions. Unlike differential gene expression (DGE), DTU identifies when the relative usage of isoforms changes even if total gene expression remains constant.
**Tools:** DEXSeq, DRIMSeq, IsoformSwitchAnalyzeR, dexseq, drimseq
**Inputs:** COUNT_MATRIX, TPM_MATRIX
**Outputs:** DE_RESULTS, GENE_LIST
**Triggers:** Use when user asks for: differential transcript usage, DTU, isoform switching, exon usage, DEXSeq, DRIMSeq, alternative splicing analysis, transcript proportions, isoform ratios

### ensembl-query — Ensembl Query
Query Ensembl REST API for gene annotations, coordinates, sequences, and variant information. Ensembl provides comprehensive genome annotations for vertebrates and model organisms.
**Tools:** Ensembl! VEP
**Triggers:** ensembl, gene annotation, gene coordinates, transcript, exon, gene lookup, genome annotation, ensembl id, ENSG, ENST

### flux-analysis — Flux Analysis
**Tools:** COBRA Toolbox, COBRApy, flux balance analysis
**Inputs:** MODEL, SBML
**Outputs:** FLUX_DATA, GENE_LIST
**Triggers:** flux balance analysis, FBA, FVA, metabolic model, COBRA, constraint-based, metabolic network, reaction matrix, flux analysis, metabolic flux, genome-scale model, GEM, metabolic reconstruction, INIT, iMAT, subsystem analysis, pathway usage, reaction prevalence, escher, metabolic map

### fusion-detection — Fusion Detection
Detect gene fusions from RNA-seq data using STAR-Fusion, Arriba, or other fusion callers. Identifies oncogenic fusions like BCR-ABL, EML4-ALK, and novel fusion events in cancer samples.
**Tools:** Arriba, FusionCatcher, STAR-Fusion, arriba, fusioncatcher, star-fusion
**Inputs:** BAM, FASTQ
**Outputs:** FUSION_LIST, GENE_LIST
**Triggers:** Use when user asks for: fusion detection, gene fusion, fusion genes, STAR-Fusion, Arriba, chimeric transcripts, fusion transcripts, BCR-ABL, EML4-ALK, oncogenic fusions, translocation detection

### geo-query — Geo Query
Query NCBI Gene Expression Omnibus (GEO) for gene expression datasets. GEO is a public repository for high-throughput gene expression, genomics, and functional genomics data.
**Tools:** GEO:DataSets
**Triggers:** geo, gene expression omnibus, GSE, GSM, expression data, microarray, RNA-seq datasets, transcriptomics data, download expression

### gnomad-query — Gnomad Query
Query the Genome Aggregation Database (gnomAD) for population allele frequencies. gnomAD v4 contains data from 807,162 individuals across diverse populations, essential for variant filtering and interpretation.
**Tools:** gnomAD
**Inputs:** GENE_NAME, VARIANT_ID
**Outputs:** FREQUENCY_DATA, VARIANT_INFO
**Triggers:** gnomad, allele frequency, population frequency, minor allele frequency, MAF, population genetics, variant frequency, rare variants

### golden-gate-assembly — Golden Gate Assembly
Design and simulate Golden Gate or Gibson DNA assemblies. Handles restriction enzyme selection, overhang design, assembly simulation, and junction primer design using pydna and Biopython.
**Tools:** Gibson Assembly, Golden Gate
**Inputs:** PRIMER_LIST, SEQUENCE
**Outputs:** CONSTRUCT, SEQUENCE
**Triggers:** Use when user asks for: Golden Gate assembly, Golden Gate cloning, BsaI, BbsI, BpiI, Type IIS restriction, modular cloning, MoClo, overhang design, assembly simulation, pydna, Gibson assembly, DNA assembly, construct design, scarless cloning

### gwas-analysis — Gwas Analysis
Perform genome-wide association study from VCF genotype data and phenotype file. Includes QC filtering, population stratification correction via PCA, association testing, and visualization (Manhattan/QQ plots).
**Tools:** BOLT-LMM, GCTA, LocusZoom, PLINK12, REGENIE, SAIGE, plink
**Inputs:** PHENOTYPE, PLINK_BED, VCF
**Outputs:** GENE_LIST, GWAS_RESULTS, PLOT
**Triggers:** GWAS analysis, genome-wide association, association study, VCF GWAS, PLINK association, SNP phenotype association, case-control GWAS, quantitative trait GWAS

### gwas-query — Gwas Query
Query the NHGRI-EBI GWAS Catalog for genome-wide association study results. The catalog contains all published GWAS with SNP-trait associations reaching genome-wide significance (p < 5×10⁻⁸).
**Triggers:** gwas, genome-wide association, gwas catalog, snp association, trait association, genetic association, phenotype genetics

### isoform-analysis — Isoform Analysis
Quantify transcript-level expression using Salmon or kallisto, and perform transcript-level differential expression analysis. Enables detection of isoform switching and alternative splicing events.
**Tools:** RSEM, Salmon, kallisto, rsem, salmon
**Inputs:** BAM, FASTQ
**Outputs:** COUNT_MATRIX, TPM_MATRIX
**Triggers:** Use when user asks for: isoform analysis, transcript quantification, Salmon, kallisto, transcript-level expression, alternative splicing, isoform switching, TPM, transcript abundance

### kegg-query — Kegg Query
Query KEGG (Kyoto Encyclopedia of Genes and Genomes) and Reactome pathway databases for pathway information, diagrams, and gene-pathway mappings. Both APIs are free and require no authentication.
**Triggers:** KEGG, Reactome, pathway lookup, pathway diagram, pathway map, metabolic pathway, signaling pathway, pathway search, gene-to-pathway

### lead-optimization — Lead Optimization
Generate and evaluate structural analogs of a lead compound using RDKit: R-group enumeration, bioisosteric replacement, matched molecular pairs, and property-guided optimization. Produces analog libraries for docking and ADMET profiling.
**Tools:** FEP, Free Energy Perturbation, Scaffold hopping
**Inputs:** ADMET_RESULTS, SDF, SMILES
**Outputs:** SCORES, SDF, SMILES
**Triggers:** lead optimization, SAR, structure-activity relationship, modify compound, improve potency, improve selectivity, analog generation, R-group enumeration, scaffold hopping, bioisostere, compound optimization, medicinal chemistry

### ligand-preparation — Ligand Preparation
Convert SMILES to docking-ready 3D structures using RDKit, Gypsum-DL (tautomer/protomer enumeration), and Meeko (PDBQT for Vina). Handles protonation states, conformer generation, and format conversion.
**Tools:** Meeko, Open Babel, RDKit
**Inputs:** MOL2, SDF, SMILES
**Outputs:** MOL2, PDBQT, SDF
**Triggers:** ligand preparation, prepare ligand, SMILES to 3D, SMILES to PDBQT, generate conformer, protonate ligand, tautomer enumeration, prepare for docking, convert SMILES to PDB, 3D structure from SMILES, Gypsum-DL, meeko

### md-simulation — Md Simulation
Run short molecular dynamics simulations using GROMACS or OpenMM. Includes system setup (solvation, ions), energy minimization, equilibration, and production MD with trajectory analysis (RMSD, RMSF, radius of gyration).
**Tools:** AMBER! score, CHARMM, GROMACS, NAMD, OpenMM, namd
**Inputs:** GRO, PDB, PRMTOP
**Outputs:** ENERGY_DATA, PDB, TRAJECTORY
**Triggers:** molecular dynamics, MD simulation, GROMACS, OpenMM, protein dynamics, equilibration, production run, NPT, NVT, RMSD, RMSF, trajectory analysis, protein flexibility, solvate protein, energy minimization

### metagenome-binning — Metagenome Binning
Assemble metagenomic reads and bin contigs into metagenome-assembled genomes (MAGs).
**Tools:** CONCOCT, DAS Tool, MaxBin2, MetaBAT2, maxbin2, metabat2
**Inputs:** ASSEMBLY, BAM, FASTA
**Outputs:** BIN_DATA, FASTA
**Triggers:** Use when user asks for: metagenome binning, MAGs, metagenome-assembled genomes, metabat, checkm, bin contigs, recover genomes from metagenome, metagenomic assembly

### microbiome-profiling — Microbiome Profiling
Identify and characterize viral sequences from metagenomic data. Supports both viral species identification (clinical/diagnostic) and novel virus discovery (virome profiling).
**Tools:** Bracken, DADA2, Kraken2, LEfSe, MetaPhlAn234, PICRUSt2, QIIME2, dada2, kraken2, metaphlan, mothur, phyloseq, qiime
**Inputs:** FASTA, FASTQ
**Outputs:** BIOM, FASTA, GENE_LIST, OTU_TABLE, TAXONOMY
**Triggers:** Use when user asks for: viral metagenomics, virome, virsorter, checkv, find viruses, bacteriophage, phage discovery, viral contigs, vOTUs, viral agents, viral species identification

### molecular-docking — Molecular Docking
Dock small molecules into protein binding sites using AutoDock Vina or SMINA. Produces ranked binding poses with affinity scores (kcal/mol).
**Tools:** : Vina:, AutoDock Vina, AutoDock! Vina, GOLD docking, Glide, MOE, SMINA, autodock_vina
**Inputs:** PDB, PDBQT, SDF, SMILES
**Outputs:** DOCKING_POSES, PDB, PDBQT, SCORES, SDF
**Triggers:** molecular docking, dock ligand, dock compound, protein-ligand docking, AutoDock Vina, SMINA, docking score, binding affinity, pose prediction, dock drug, docking study

### multi-omics-integration — Multi Omics Integration
**Tools:** DIABLO, MOFA, WNN, iCluster, mixOmics
**Inputs:** COUNT_MATRIX, EXPRESSION_MATRIX, H5AD
**Outputs:** FACTOR_DATA, GENE_LIST, PLOT
**Triggers:** multi-omics, multimodal, RNA+ATAC, RNA+protein, CITE-seq, WNN, MOFA2, DIABLO, totalVI, data integration, joint analysis

### nanopore-analysis — Nanopore Analysis
QC, filter, align, assemble, and polish long-read nanopore sequencing data. Handles ONT reads from MinION, GridION, and PromethION.
**Tools:** Guppy, Medaka, NanoFilt, NanoPlot, guppy, medaka
**Inputs:** FAST5, FASTQ, POD5
**Outputs:** BAM, FASTQ
**Triggers:** Use when user asks for: nanopore, ONT, Oxford Nanopore, long-read sequencing, MinION, PromethION, FAST5, POD5, Flye, Medaka, NanoPlot, NanoFilt, chopper, long-read alignment, long-read assembly, nanopore QC, pycoQC

### oligo-design — Oligo Design
Design primers, oligo pools, and probes using primer3-py with thermodynamic validation. Includes secondary structure prediction with ViennaRNA and off-target screening.
**Tools:** Primer3, PrimerBLAST
**Inputs:** FASTA, SEQUENCE
**Outputs:** PRIMER_LIST, SEQUENCE
**Triggers:** Use when user asks for: oligo design, oligo pool, primer design, primer3, oligonucleotide, probe design, Tm calculation, melting temperature, hairpin check, dimer check, oligo synthesis, mutagenic primers, degenerate primers, ViennaRNA, secondary structure oligo

### openfda-query — Openfda Query
Query OpenFDA for drug safety and regulatory data. OpenFDA provides access to FDA databases including adverse event reports (FAERS), drug labeling, recalls, and device information.
**Triggers:** openfda, fda, adverse events, drug safety, faers, drug reactions, side effects, drug recalls, medication errors

### opentargets-query — Opentargets Query
Query Open Targets Platform for drug target-disease associations. Open Targets integrates evidence from genetics, genomics, transcriptomics, drugs, and literature to prioritize therapeutic targets.
**Tools:** Open Targets
**Inputs:** DISEASE, GENE_NAME
**Outputs:** DRUG_DATA, GENE_LIST
**Triggers:** opentargets, drug targets, target identification, drug discovery, disease associations, therapeutic targets, druggability

### pathway-enrichment — Pathway Enrichment
**Tools:** DAVID, Enrichr, GOseq, GSEA, KEGG, Reactome, ReactomePA, clusterProfiler, cluster_profiler, clusterprofiler, david, enrichr, fgsea, g:Profiler, gsea, topGO, topgo
**Inputs:** DE_RESULTS, GENE_LIST
**Outputs:** ENRICHMENT_RESULTS, PLOT
**Triggers:** pathway enrichment, GO enrichment, KEGG, gene ontology, functional enrichment, ORA, GSEA

### pdb-query — Pdb Query
Search and download protein structures from the RCSB Protein Data Bank using the rcsb-api Python client. Query by protein name, UniProt ID, organism, resolution, ligand, or PDB ID.
**Tools:** PDB
**Inputs:** GENE_NAME, PDB_ID
**Outputs:** PDB, STRUCTURE_DATA
**Triggers:** PDB, protein data bank, crystal structure, protein structure download, fetch structure, RCSB, X-ray structure, cryo-EM structure, get PDB file, structure search

### phylogenetic-tree — Phylogenetic Tree
Build phylogenetic trees from DNA or protein sequences: alignment, trimming, tree inference, and visualization.
**Tools:** BEAST2, FastTree, IQ-TREE, MrBayes, PhyML, RAxML, iq-tree
**Inputs:** ALIGNMENT, ASSEMBLY, FASTA
**Outputs:** NEWICK, PLOT, TREE
**Triggers:** Use when user asks for: phylogenetic tree, phylogeny, evolutionary tree, build tree, mafft, iqtree, raxml, fasttree, sequence alignment, newick, tree visualization

### protein-ligand-interaction — Protein Ligand Interaction
Analyze non-covalent interactions between a protein and ligand using PLIP (from PDB/docking structures) and ProLIF (interaction fingerprints across poses). Identifies hydrogen bonds, hydrophobic contacts, salt bridges, pi-stacking, and water bridges.
**Tools:** LigPlot, PLIP, ProLIF
**Inputs:** DOCKING_POSES, PDB, SDF
**Outputs:** INTERACTION_DATA, PDB, PLOT, SDF, SMILES
**Triggers:** protein-ligand interaction, binding interaction, hydrogen bonds, hydrophobic contacts, interaction fingerprint, PLIP, ProLIF, binding analysis, contact analysis, interaction diagram, pi-stacking

### proteomics — Proteomics
Analyze mass spectrometry-based proteomics data. Perform quality control, normalization, differential protein abundance analysis, and visualization. Works with output from MaxQuant, MSFragger, Spectronaut, or other quantification tools.
**Tools:** DIA-NN, MSFragger, MSstats, MaxQuant, Perseus, Proteome Discoverer, Spectronaut, maxquant, msfragger
**Inputs:** MZML, RAW_MS
**Outputs:** GENE_LIST, PLOT, PROTEIN_LIST
**Triggers:** Use when user asks for: proteomics analysis, differential protein expression, protein abundance, MaxQuant analysis, mass spectrometry, MS data, protein quantification, TMT, label-free quantification, LFQ, MSstats, DEP

### pubchem-query — Pubchem Query
Search PubChem for compounds by name, SMILES, CID, or formula using PubChemPy. Retrieve compound structures, properties, synonyms, and safety data.
**Tools:** PubChem
**Inputs:** COMPOUND_ID, SMILES
**Outputs:** PROPERTY_DATA, SDF, SMILES
**Triggers:** PubChem, compound lookup, CID, get SMILES, compound name to SMILES, drug structure, chemical properties, compound search, find compound, InChI, molecular formula lookup

### pubmed-query — Pubmed Query
**Triggers:** pubmed, literature, papers, publications, research articles, citations, pmid, scientific literature, journal articles

### rna-velocity — Rna Velocity
**Tools:** scVelo, velocyto
**Inputs:** H5AD, LOOM
**Outputs:** H5AD, PLOT, VELOCITY_DATA
**Triggers:** RNA velocity, scVelo, velocity, spliced unspliced, cell dynamics, velocity vectors, cell state transitions, dynamical modeling

### rnaseq-alignment — Rnaseq Alignment
**Tools:** HISAT2, HTSeq, STAR!-Fusion, StringTie, Subread, TopHat2, featureCounts, featurecounts, hisat2, htseq, star, stringtie, tophat
**Inputs:** FASTA, FASTQ
**Outputs:** BAM, COUNT_MATRIX, SAM
**Triggers:** RNA-seq, Salmon, HISAT2, STAR, FASTQ to counts, gene counts, transcript quantification, trim reads, adapter trimming, BAM alignment

### scanpy — Scanpy
**Tools:** Scanpy, scanpy
**Inputs:** COUNT_MATRIX, H5AD
**Outputs:** CLUSTER_DATA, GENE_LIST, H5AD, PLOT
**Triggers:** single-cell, scRNA-seq, scanpy, cell clustering, UMAP, h5ad, celltypist, cell type annotation

### scenic — Scenic
**Tools:** SCENIC, pySCENIC
**Inputs:** EXPRESSION_MATRIX, H5AD, RDS
**Outputs:** GENE_LIST, REGULON_DATA
**Triggers:** scenic, pyscenic, single cell gene regulatory network, GRN, transcription factor activity, regulon, AUCell, GRNBoost2, cisTarget, TF activity

### seurat — Seurat
**Tools:** Seurat, seurat
**Inputs:** COUNT_MATRIX, H5AD
**Outputs:** CLUSTER_DATA, GENE_LIST, PLOT, RDS
**Triggers:** seurat, R single-cell, scRNA-seq R, RDS, SeuratObject, FindClusters, DimPlot, FindMarkers, RunUMAP

### shared — Shared

### signac — Signac
Perform single-cell ATAC-seq analysis using Signac (R): Seurat extension for chromatin accessibility analysis.
**Tools:** Signac, signac
**Inputs:** FRAGMENT_FILE, H5AD
**Outputs:** GENE_LIST, H5AD, PLOT
**Triggers:** Use when user asks for: scATAC-seq with Signac, ATAC with Seurat, chromatin accessibility Seurat, multiome ATAC

### somatic-variants — Somatic Variants
Call somatic (cancer) mutations from tumor-normal paired samples or tumor-only samples using GATK Mutect2 or Strelka2. Identifies SNVs, indels, and generates filtered high-confidence somatic variants.
**Tools:** MuSE, Mutect2, SomaticSniper, Strelka2, VarScan2, mutect2, strelka2, varscan2
**Inputs:** BAM, CRAM, VCF
**Outputs:** BCF, GENE_LIST, MAF, VCF
**Triggers:** Use when user asks for: somatic variants, tumor mutations, cancer variants, Mutect2, Strelka, tumor-normal, somatic calling, cancer genomics, tumor variants, somatic mutations, oncology variants

### spatial-atacseq — Spatial Atacseq
Analyze spatial ATAC-seq data (e.g., DBiT-seq) combining chromatin accessibility with spatial coordinates. Uses ArchR for ATAC analysis and Seurat for spatial visualization.
**Triggers:** spatial ATAC-seq, spatial chromatin accessibility, DBiT-seq, spatial epigenomics, tixel ATAC, tissue ATAC-seq

### spatial-transcriptomics — Spatial Transcriptomics
Analyze spatial transcriptomics data using Seurat: from raw counts to clusters, spatially variable genes, and cell type deconvolution.
**Tools:** MERFISH, STdeconvolve, Slide-seq, SpatialDE, Visium, squidpy
**Inputs:** COUNT_MATRIX, H5AD
**Outputs:** H5AD, PLOT, SPATIAL_DATA
**Triggers:** spatial transcriptomics, Visium, spatial RNA-seq, Slide-seq, Xenium, CosMx, MERFISH, spatial gene expression, tissue transcriptomics, 10X spatial

### starsolo — Starsolo
**Tools:** Cell Ranger, STARsolo, cell_ranger, starsolo
**Inputs:** FASTQ
**Outputs:** COUNT_MATRIX, H5AD
**Triggers:** STARsolo, single-cell FASTQ, scRNA-seq alignment, 10x Chromium FASTQ, Drop-seq FASTQ, Smart-seq2 FASTQ, cell ranger alternative, barcode demultiplex, single-cell count matrix from FASTQ, UMI counting

### string-query — String Query
Query the STRING database (Search Tool for the Retrieval of Interacting Genes/Proteins) for protein-protein interaction networks and functional associations. STRING integrates experimental, computational, and text-mining evidence for protein interactions.
**Tools:** STRING:db database network
**Inputs:** GENE_LIST, PROTEIN_LIST
**Outputs:** GENE_LIST, NETWORK_DATA, PLOT
**Triggers:** STRING, protein-protein interaction, PPI, interaction network, protein network, functional association, protein partners, interactome

### structural-variants — Structural Variants
**Tools:** Delly2, GRIDSS, LUMPY, Manta, delly
**Inputs:** BAM, CRAM
**Outputs:** BEDPE, VCF
**Triggers:** structural variant, SV calling, deletion, duplication, inversion, translocation, breakpoint, delly, CNV large, structural rearrangement, SV detection, genome rearrangement

### survival-analysis — Survival Analysis
**Tools:** Cox:PH regression proportional, Kaplan-Meier, lifelines, survminer
**Inputs:** CLINICAL_DATA, EXPRESSION_MATRIX, GENE_LIST
**Outputs:** PLOT, SURVIVAL_RESULTS
**Triggers:** survival analysis, Kaplan-Meier, KM curve, Cox regression, hazard ratio, log-rank test, time to event, overall survival, progression-free survival, OS, PFS, survival curve, clinical outcomes

### trajectory-analysis — Trajectory Analysis
**Tools:** Monocle23, PAGA, Palantir, Slingshot, monocle, monocle3, monocle_3, slingshot
**Inputs:** CLUSTER_DATA, H5AD, RDS
**Outputs:** H5AD, PLOT, PSEUDOTIME, RDS
**Triggers:** trajectory, pseudotime ordering, monocle3, slingshot, developmental trajectory, cell fate, differentiation path, lineage tracing, branching trajectory

### ucsc-query — Ucsc Query
Query UCSC Genome Browser for genomic sequences, annotations, and track data. UCSC provides comprehensive genome assemblies for multiple species with rich annotation tracks.
**Triggers:** ucsc, genome browser, genomic sequence, genome assembly, track data, genome coordinates, chromosome sequence, liftover

### uniprot-query — Uniprot Query
Query UniProt database for comprehensive protein sequence and functional information. UniProt is the central hub for protein data, including sequence, function, domains, and cross-references.
**Tools:** UniProt
**Triggers:** uniprot, protein, protein sequence, protein function, protein structure, swissprot, trembl, protein database, protein annotation

### variant-annotation — Variant Annotation
Annotate VCF files with functional consequences, population frequencies, and clinical significance using SnpEff. Adds gene names, impact predictions, allele frequencies, and pathogenicity scores.
**Tools:** :Ensembl VEP, ANNOVAR, SnpEff, annovar, snpeff
**Inputs:** BCF, VCF
**Outputs:** ANNOTATED_VCF, GENE_LIST, TSV, VARIANT_INFO, VCF
**Triggers:** Use when user asks for: variant annotation, annotate VCF, VEP, SnpEff, functional annotation, variant effect, consequence, add gene names, clinical annotation, pathogenicity, gnomAD annotation

### variant-calling — Variant Calling
Call germline variants from aligned BAM files using GATK or Freebayes, producing filtered and normalized VCF files.
**Tools:** DeepVariant, FreeBayes, GATK! CNV, HaplotypeCaller, bcftools, deepvariant, freebayes, gatk, haplotypecaller
**Inputs:** BAM, CRAM
**Outputs:** BAM, BCF, VCF
**Triggers:** variant calling, SNP calling, call variants, BAM to VCF, GATK HaplotypeCaller, freebayes, germline variants, SNPs and indels

### viral-metagenomics — Viral Metagenomics
Identify, quality-assess, and quantify viral sequences from metagenomic data.
**Tools:** CheckV, VIBRANT, VirSorter
**Inputs:** ASSEMBLY, FASTA
**Outputs:** FASTA, TAXONOMY
**Triggers:** Use when user asks for: viral metagenomics, virome, virsorter, checkv, find viruses, bacteriophage, phage discovery, viral contigs, vOTUs

### virtual-screening — Virtual Screening
Screen a library of compounds against a protein target using AutoDock Vina. Includes library preparation (SMILES to 3D), batch docking, scoring, ranking, and filtering.
**Tools:** ChemDiv, ZINC, virtual screening
**Inputs:** PDB, SDF, SMILES
**Outputs:** RANKED_LIST, SCORES, SDF
**Triggers:** virtual screening, screen compound library, screen compounds, high-throughput docking, library screening, drug screening, hit identification, compound ranking, screen molecules against target

### wgcna — Wgcna
**Tools:** WGCNA, wgcna
**Inputs:** COUNT_MATRIX, EXPRESSION_MATRIX
**Outputs:** GENE_LIST, MODULE_DATA, PLOT
**Triggers:** WGCNA, gene co-expression network, module detection, hub genes, eigengene, soft threshold, module-trait, gene network modules, weighted correlation network

### wgs-alignment — Wgs Alignment
**Tools:** BWA:-MEM2, Bowtie2, bowtie, bowtie2, bwa, bwa-mem, bwa-mem2, minimap2
**Inputs:** FASTA, FASTQ
**Outputs:** BAM, CRAM, SAM
**Triggers:** WGS, WES, whole genome sequencing, whole exome sequencing, BWA, BWA-MEM2, genomic alignment, DNA-seq, exome sequencing, DNA alignment, FASTQ to BAM

## Top pipeline transitions (by paper count)

- **wgs-alignment** → **variant-calling** (BAM, CRAM) — 765 papers, literature+type ✓
- **denovo-assembly** → **wgs-alignment** (FASTA) — 624 papers, literature+type
- **differential-expression** → **pathway-enrichment** (DE_RESULTS, GENE_LIST) — 590 papers, literature+type ✓
- **rnaseq-alignment** → **differential-expression** (COUNT_MATRIX) — 457 papers, literature+type ✓
- **differential-expression** → **wgcna** (EXPRESSION_MATRIX) — 409 papers, literature+type ✓
- **wgcna** → **pathway-enrichment** (GENE_LIST) — 397 papers, literature+type ✓
- **nanopore-analysis** → **wgs-alignment** (FASTQ) — 332 papers, literature+type
- **wgs-alignment** → **chipseq** (BAM) — 288 papers, literature+type
- **variant-calling** → **variant-annotation** (BCF, VCF) — 260 papers, literature+type ✓
- **denovo-assembly** → **phylogenetic-tree** (ASSEMBLY, FASTA) — 192 papers, literature+type ✓
- **starsolo** → **seurat** (COUNT_MATRIX, H5AD) — 188 papers, literature+type ✓
- **nanopore-analysis** → **denovo-assembly** (FASTQ) — 187 papers, literature+type
- **denovo-assembly** → **rnaseq-alignment** (FASTA) — 185 papers, literature+type
- **microbiome-profiling** → **wgs-alignment** (FASTA) — 180 papers, literature+type
- **rnaseq-alignment** → **isoform-analysis** (BAM) — 172 papers, literature+type ✓
- **isoform-analysis** → **differential-expression** (COUNT_MATRIX) — 171 papers, literature+type
- **wgs-alignment** → **atacseq** (BAM) — 148 papers, literature+type
- **wgs-alignment** → **isoform-analysis** (BAM) — 132 papers, literature+type
- **microbiome-profiling** → **denovo-assembly** (FASTA) — 130 papers, literature+type
- **wgs-alignment** → **somatic-variants** (BAM, CRAM) — 126 papers, literature+type
- **variant-calling** → **somatic-variants** (BAM, VCF) — 125 papers, literature+type ✓
- **seurat** → **pathway-enrichment** (GENE_LIST) — 122 papers, literature+type ✓
- **denovo-assembly** → **microbiome-profiling** (FASTA) — 122 papers, literature+type
- **batch-correction** → **differential-expression** (COUNT_MATRIX) — 112 papers, literature+type
- **seurat** → **batch-correction** (RDS) — 111 papers, literature+type ✓
- **chipseq** → **differential-expression** (COUNT_MATRIX) — 111 papers, literature+type ✓
- **variant-annotation** → **clinvar-query** (VCF) — 109 papers, literature+type
- **rnaseq-alignment** → **variant-calling** (BAM) — 108 papers, literature+type
- **string-query** → **pathway-enrichment** (GENE_LIST) — 106 papers, literature+type ✓
- **wgcna** → **string-query** (GENE_LIST) — 103 papers, literature+type ✓