Skip to content

pleiadian53/agentic-spliceai

Repository files navigation

Agentic-SpliceAI

Context-Aware Novel Isoform Discovery for Drug Target Identification

Agentic-SpliceAI is an agentic AI system with hierarchical multi-task prediction for discovering novel RNA isoforms β€” disease-specific, variant-induced, and tissue-specific splice variants that go beyond canonical annotations. Originally refactored from Meta-SpliceAI, it has evolved into a self-sustained compound AI system with extensible foundation model predictors, multimodal evidence fusion, agentic AI validation, and a meta-learning framework (M1-M4) targeting progressively harder splice prediction problems.

The system combines three key architectural ideas:

  • Multi-task learning: Shared multimodal representation (10 modalities, 116 features) with task-specific model heads (M1-M4)
  • Hierarchical prediction: M1 (canonical) β†’ M2 (alternative) β†’ M3 (novel discovery) β†’ M4 (perturbation-induced) β€” each level tackles a harder problem with different label regimes
  • Agentic validation: LLM-powered agents for literature mining, expression evidence, clinical interpretation, and recursive self-improvement

Workflow: Prediction to Discovery

Agentic-SpliceAI Workflow: Prediction to Discovery

Conceptual workflow schematic β€” generated by Google Nano Banana 2

Precise workflow diagram (Mermaid)
graph TD
    %% Color scheme matching the 5-layer workflow bands
    classDef input fill:#e1f5fe,stroke:#0288d1,stroke-width:2px,color:#1a1a1a
    classDef base fill:#f3e5f5,stroke:#8e24aa,stroke-width:2px,color:#1a1a1a
    classDef meta fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#1a1a1a
    classDef agents fill:#e8f5e9,stroke:#388e3c,stroke-width:2px,color:#1a1a1a
    classDef output fill:#ffebee,stroke:#d32f2f,stroke-width:2px,color:#1a1a1a

    %% Layer 1: Input & Preparation
    subgraph L1 ["β‘  INPUT & PREPARATION"]
        direction LR
        A1[(Genomic Data<br/>GTF, FASTA, MANE)]:::input
        A2[(Variant Data<br/>VCF)]:::input
        B["Data Preparation Pipeline (CLI)"]:::input
        A1 --> B
        A2 --> B
    end

    %% Layer 2: Base Prediction Layer
    subgraph L2 ["β‘‘ BASE PREDICTION LAYER"]
        direction LR
        C[Canonical Predictions<br/>MANE Baseline]:::base
        D[Splice-Site Prediction Engines<br/>Pluggable Architecture]:::base
        D1[SpliceAI /<br/>OpenSpliceAI]:::base
        D2[Foundation Models<br/>e.g. Evo2 Fine-Tuning]:::base
        D3[Future<br/>Predictors]:::base
        C --> D
        D --> D1
        D --> D2
        D --> D3
    end

    %% Layer 3: Meta Layer Integration
    subgraph L3 ["β‘’ META LAYER INTEGRATION"]
        direction LR
        E1[Variant Context]:::meta
        E2[Disease Context]:::meta
        E3[Tissue Context]:::meta
        F[Multimodal Deep Learning<br/>Meta-Models]:::meta
        G[Context-Aware Adaptive Predictions<br/>& Novel Site Detection]:::meta
        E1 --> F
        E2 --> F
        E3 --> F
        F --> G
    end

    %% Layer 4: Agentic Workflow Layer
    subgraph L4 ["β‘£ AGENTIC WORKFLOW LAYER"]
        direction LR
        H{Autonomous AI<br/>Agent Orchestrator}:::agents
        I[Literature Agent<br/>PubMed, arXiv]:::agents
        J[Expression Agent<br/>GTEx, TCGA]:::agents
        K[Clinical Agent<br/>ClinVar, COSMIC]:::agents
        L[Conservation Agent<br/>PhyloP]:::agents
        M[Structural Agent<br/>AlphaFold, Foldseek]:::agents
        H -->|Evidence Mining| I
        H -->|RNA-seq Validation| J
        H -->|Disease Mapping| K
        H -->|Cross-species| L
        H -->|Structure Prediction| M
        I --> S
        J --> S
        K --> S
        L --> S
        M --> S
        S[Nexus Research Agent<br/>Comprehensive Reports]:::agents
    end

    %% Layer 5: Outcomes & Discovery
    subgraph L5 ["β‘€ OUTCOMES & DISCOVERY"]
        direction LR
        N1[Novel Isoform Discovery<br/>Drug Targets]:::output
        N2[Clinical-Grade Variant<br/>Interpretation β€” VUS]:::output
        N3[Tissue-Specific<br/>Biomarkers]:::output
    end

    %% Inter-layer flow
    B --> D
    D1 --> F
    D2 --> F
    D3 --> F
    G --> H

    %% Self-improvement feedback loop
    S -.->|Self-Improvement Feedback| F

    %% Workflow to outcomes
    S --> N1
    S --> N2
    S --> N3
Loading

🎯 Vision: From Splice Prediction to Drug Discovery

The Ultimate Goal: Novel Isoform Discovery

The Challenge: Current gene annotations (MANE, RefSeq) only capture ~10% of biologically active splice sites. The remaining 90% includes:

  • 🦠 Disease-specific isoforms (cancer, neurological, cardiac)
  • 🧬 Variant-induced splicing (pathogenic mutations, VUS)
  • πŸ§ͺ Tissue-specific isoforms (brain, immune, developmental)
  • πŸ’Š Druggable novel targets (oncogenes, splice modulators)

Our Solution: Context-aware adaptive prediction through multimodal meta-learning discovers novel isoforms:

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#1e3a8a','primaryTextColor':'#fff','lineColor':'#3b82f6','fontSize':'15px','fontFamily':'ui-sans-serif, system-ui, sans-serif'}}}%%

graph LR
    A["<b>πŸ“Š Annotations</b><br/>MANE/RefSeq<br/><i>~10% of sites</i>"]:::canonical
    B["<b>🧬 Base Layer</b><br/>SpliceAI β€’ OpenSpliceAI<br/>Evo2 β€’ SpliceBERT"]:::foundation
    C["<b>🎯 Meta Layer</b><br/>10-Modality Fusion<br/>M1-M4 Models"]:::metalayer
    D["<b>πŸ” Discovery</b><br/>Delta Scoring<br/>Isoform Assembly"]:::discovery
    E["<b>πŸ€– Agentic Layer</b><br/>Literature β€’ RNA-seq<br/>Clinical Validation"]:::agentic
    F["<b>πŸ’Š Novel Isoform<br/>Catalog</b><br/>Drug Targets<br/>Precision Medicine"]:::output

    A --> B --> C --> D --> E --> F

    classDef canonical fill:#1e3a8a,stroke:#1e40af,stroke-width:2px,color:#ffffff
    classDef foundation fill:#0891b2,stroke:#0e7490,stroke-width:2px,color:#ffffff
    classDef metalayer fill:#7c3aed,stroke:#6d28d9,stroke-width:2px,color:#ffffff
    classDef discovery fill:#059669,stroke:#047857,stroke-width:2px,color:#ffffff
    classDef agentic fill:#dc2626,stroke:#b91c1c,stroke-width:2px,color:#ffffff
    classDef output fill:#d97706,stroke:#b45309,stroke-width:3px,color:#ffffff
Loading

Key Innovation: The Foundation-Adaptor Framework uses multimodal deep learning to refine foundation model predictions with context, discovering the 90% of splice sites beyond canonical annotations!

Why This Matters for Drug Discovery

Traditional Approach:

  • Target canonical proteins
  • Miss disease-specific isoforms
  • Limited therapeutic options

Agentic-SpliceAI Approach:

  • Discover disease-specific isoforms
  • Identify druggable splice variants
  • Enable isoform-selective therapeutics
  • Expand druggable genome by 10-100x

πŸš€ The Agentic-SpliceAI Advantage

1. Pluggable Base Layer

Model-agnostic splice prediction: any predictor can plug into the base layer β€” classical (SpliceAI, OpenSpliceAI), foundation-model-derived (SpliceBERT + trained classifier head), or future models β€” as long as its output satisfies the per-nucleotide 3-class scoring protocol (neither / acceptor / donor).

  • BasePredictor protocol + plugin registry: decorator for built-ins, YAML manifest for external/foundation-model-derived predictors
  • Same CLI (agentic-spliceai-base), same downstream consumers β€” swapping a predictor is a registration concern, not an integration one
  • Three predictors currently registered; the set is open-ended. Adding a new predictor requires no changes to meta layer, variant analysis, or any other downstream application
  • See src/agentic_spliceai/applications/base_layer/protocol.py for the protocol and examples/foundation_models/ for prototype trainers (SpliceBERT, Evo2)

2. Adaptive Meta-Learning (Foundation-Adaptor Framework)

Multimodal deep learning: Refine predictions using context-aware meta-models

  • Foundation: Base model predictions (canonical knowledge)
  • Adaptor: Multimodal feature fusion (base scores, conservation, epigenetic marks, chromatin accessibility, RNA-seq junction evidence, RBP eCLIP binding, DNA sequence, genomic context, gene annotations) β€” see Feature Catalog and examples/features/
  • Context embedding: Patient variants, disease state, tissue type
  • Self-improvement: Learn from validation feedback continuously

3. Context-Aware Prediction

Beyond static annotations: Discover isoforms specific to:

  • Patient genetic backgrounds (variant-induced splicing)
  • Disease states (cancer, neurological, cardiac)
  • Tissue/cell types (brain, immune, developmental)
  • Environmental conditions (stress, treatment response)

4. Autonomous Validation

Agentic AI workflows:

  • πŸ”¬ Validate with literature, RNA-seq, and clinical databases
  • πŸ“š Research biological context and functional impact
  • 🧠 Synthesize evidence from multiple sources
  • πŸ”„ Iterate through multi-agent pipelines

See:


🎯 Key Features

🧬 Base Layer β€” Pluggable Splice Prediction

Component Description
BasePredictor Protocol Single contract that makes the base layer model-agnostic: per-nucleotide 3-class scores (neither / acceptor / donor) aligned to genomic positions. Any predictor satisfying this protocol can be registered and served.
Plugin Registry Decorator-based in-process registration for built-ins + YAML manifest (configs/predictors.yaml) for foundation-model-derived checkpoints and external models. Adding a predictor requires no downstream code changes.
Classical Models SpliceAI (TF, GRCh37/Ensembl), OpenSpliceAI (PyTorch, GRCh38/MANE) β€” wrapped as thin adapters over the existing BaseModelRunner.
Foundation-Model-Derived Predictors SpliceBERT + dilated-CNN classifier head (trained via examples/foundation_models/07a) registered in the same catalog as classical models. Frozen-head and end-to-end fine-tuning pipelines for SpliceBERT, Evo2, HyenaDNA under foundation_models/.
10-Modality Feature Fusion 116 features across base scores, conservation, epigenetics, chromatin accessibility (ATAC-seq + DNase-seq), junction reads, RBP binding, DNA sequence, genomic context, annotations, and optional foundation model embeddings (Evo2, SpliceBERT) β€” see Feature Catalog
YAML-Driven Configs 4 profiles (default, full_stack, isoform_discovery, meta_m3_novel) β€” add/drop modalities per modeling objective

🧠 Meta Layer β€” Context-Aware Adaptive Prediction

Component Description
M1-M4 Model Variants Four progressively harder tasks: canonical (M1-S), alternative (M2-S), novel discovery (M3-S), perturbation-induced (M4-S) β€” see Model Variants and Naming Convention
Position-Level (M-P)* XGBoost baseline with Tree SHAP (M1-P: 99.74% accuracy, PR-AUC 0.999, FN -62% / FP -68% vs base-only)
Sequence-Level (M-S)* 2-stream dilated CNN (367K params) with logit-space residual blend and per-class learned temperature (M1-S: 99.99% accuracy, PR-AUC 0.9954, FPs -15.5% vs base) β€” same I/O protocol as base models
M2 Series Alternative splice site detection β€” Eval-Ensembl-Alt shows M2-S achieves PR-AUC 0.965 on alternative sites (base: 0.749) β€” see M2 Formulations and Naming Convention
Variant Effect (M4) Per-variant delta scoring with splice consequence prediction β€” validated on 13 disease-gene variants, cryptic site positions match RNA-seq within 2bp β€” see examples/variant_analysis/
Smart Checkpointing Per-chromosome parquet saves, disk-backed gene cache, HDF5 shard packing, --resume support

πŸ€– NEW: Agentic Workflow Enhancements

Feature Description
Literature Validation Agent Cross-reference predictions with PubMed, arXiv, and splicing databases
Expression Evidence Agent Query GTEx, ENCODE, and tissue-specific expression data
Clinical Annotation Agent Check ClinVar, SpliceVarDB, and disease associations
Research Report Generator Comprehensive PDF reports with citations and biological context
Self-Improving Pipeline Learn from validation feedback to refine predictions

πŸ“Š Splice Analysis Tools

  • 🧬 Domain-Specific Analysis - Predefined templates for common splice site analyses
  • πŸ€– AI-Powered Insights - LLM-generated visualizations with biological context
  • πŸ“Š Publication-Ready Charts - High-quality plots using matplotlib/seaborn
  • πŸ”¬ Exploratory Research - Ask custom questions about your splice site data
  • πŸš€ REST API - FastAPI service for integration with other tools

πŸ“š Nexus Research Agent

  • Literature Search - Automated research on splicing mechanisms
  • Research Reports - Comprehensive reports with LaTeX equations and citations
  • Multi-Source Integration - arXiv, PubMed, Europe PMC, Wikipedia
  • Publication-Quality Output - PDF generation with proper formatting
  • Iterative Refinement - Multi-agent pipeline (Planner β†’ Researcher β†’ Writer β†’ Editor)

πŸ—οΈ Architecture

Layer Purpose Output Status
Base Layer Canonical splice prediction (MANE) Baseline scores for ~10% of sites βœ… Complete
Feature Engineering Multimodal evidence fusion 116 feature columns (10 modalities) βœ… Complete
Foundation Models Evo2/SpliceBERT classification Per-nucleotide embeddings πŸ”¬ Experimental
Meta Layer Context-aware prediction (M1-M4) Novel sites (90% beyond MANE) πŸ”„ Active
Agentic Layer Multi-source validation + reports Validated isoforms + drug targets πŸ“‹ Planned

See: Architecture β€” Multi-Layer Pipeline for the full diagram, directory structure, and delta score analysis


πŸš€ Quick Start

Bioinformatics Lab UI

Interactive web tools for splice site analysis:

# Start the Bioinformatics Lab (port 8005)
mamba run -n agentic-spliceai python -m server.bio.app
# Browse: http://localhost:8005/

Pages: Gene Browser (/) | Genome View (/genome/{gene}) | Metrics Dashboard (/metrics)

Splice Analysis

Option 1: REST API Service (Recommended)

Start the service:

# Splice prediction API (port 8004)
agentic-spliceai-server

# Or run directly:
mamba run -n agentic-spliceai python -m server.splice_service.splice_service

Access the API:

Example API call:

curl -X POST http://localhost:8004/analyze/template \
  -H "Content-Type: application/json" \
  -d '{
    "dataset_path": "data/splice_sites_enhanced.tsv",
    "analysis_type": "high_alternative_splicing",
    "model": "gpt-4o-mini"
  }'

Option 2: Python Library

from agentic_spliceai import create_dataset
from agentic_spliceai.splice_analysis import generate_analysis_insight
from openai import OpenAI

# Load dataset
dataset = create_dataset("data/splice_sites_enhanced.tsv")

# Generate analysis
client = OpenAI()
result = generate_analysis_insight(
    dataset=dataset,
    analysis_type="high_alternative_splicing",
    client=client,
    model="gpt-4o-mini"
)

# Save and execute code
with open("analysis.py", "w") as f:
    f.write(result["chart_code"])

# Execute to generate chart
exec(result["chart_code"])

Nexus Research Agent

Generate comprehensive research reports on splicing topics:

# Generate research report on splicing mechanisms
nexus "Alternative Splicing Mechanisms in Cancer" --pdf

# Research specific splicing topics
nexus "SpliceAI Deep Learning Architecture" \
  --model openai:gpt-4o \
  --length comprehensive

# Quick literature review
nexus "Recent advances in splice site prediction" \
  --model openai:gpt-4o-mini \
  --length brief

Python API:

from nexus.agents.research import ResearchAgent
from nexus.core.config import Config

# Initialize research agent
config = Config()
agent = ResearchAgent(config)

# Generate research report
result = agent.research(
    topic="Splice Site Recognition by U1 snRNP",
    length="standard",
    generate_pdf=True
)

print(f"Report saved to: {result['output_path']}")

Use Cases:

  • Research latest splicing mechanisms before analysis
  • Generate literature reviews for grant proposals
  • Stay updated on splice prediction methods
  • Validate analysis approaches with current research
  • Generate comprehensive background sections

Foundation Model Experiments

Explore foundation model embeddings for splice site prediction (experimental sub-project):

# Check hardware feasibility
python examples/foundation_models/01_resource_check.py

# Run full pipeline with synthetic data (no GPU needed, <30s)
python examples/foundation_models/02_synthetic_training_pipeline.py

# Orchestrate real pipeline (dry-run first)
python examples/foundation_models/05_run_pipeline.py --dry-run
python examples/foundation_models/05_run_pipeline.py --local-only  # synthetic data
python examples/foundation_models/05_run_pipeline.py --execute      # real GPU (costs $)

Cloud deployment (SkyPilot + RunPod):

# Extract Evo2 embeddings on A40 GPU
sky launch foundation_models/configs/skypilot/extract_embeddings_a40.yaml

# Train exon classifier
sky launch foundation_models/configs/skypilot/train_classifier_a40.yaml

Hardware requirements:

Task M1 Mac (16GB) A40 (48GB) A100 (80GB)
Evo2 7B embeddings ~100 bp/s (INT8) ~10K bp/s ~10K bp/s
Classifier training CPU only Full precision Full precision
Evo2 40B Not feasible Tight Comfortable

See: foundation_models/README.md for detailed setup

Splice Site Prediction

Predict splice sites using state-of-the-art models:

# CLI: Predict for genes
agentic-spliceai-predict --genes BRCA1 TP53 UNC13A

# CLI: Predict for chromosome
agentic-spliceai-predict --chromosomes 21 --base-model openspliceai

Python API:

from agentic_spliceai.splice_engine import predict_splice_sites

# Simple prediction
results = predict_splice_sites(genes=["BRCA1", "TP53"])
positions = results["positions"]

# High-confidence predictions
import polars as pl
high_conf = positions.filter(pl.col("donor_score") > 0.9)

Use Cases:

  • Predict splice sites for genes of interest
  • Genome-wide splice site analysis
  • Validate predictions against annotations
  • Generate training data for meta-models

See: Splice Prediction Guide for complete documentation

πŸ“¦ Installation

Prerequisites

# Python 3.12 (requires >= 3.11)
python --version

# Create environment for agentic-spliceai
mamba env create -f environment.yml
mamba activate agentic-spliceai

Install Dependencies

# Install package in development mode
cd agentic-spliceai
pip install -e ".[dev]"

Note: This project is designed to run independently with its own environment and dependencies.

Set Up Environment

# Copy environment template
cp .env.example .env

# Edit .env and add your OpenAI API key
# OPENAI_API_KEY=sk-...

See: Architecture for full directory structure; API Endpoints for REST API reference, configuration, and data format

πŸŽ“ Learning Resources

Documentation

Examples (Progressive Learning Paths)

Base Layer (examples/base_layer/) β€” 5 scripts:

  1. Single gene prediction β†’ 2. Chromosome prediction β†’ 3. Evaluation β†’ 4. Chunked workflows β†’ 5. Genome precomputation

Feature Engineering (examples/features/) β€” 4 scripts:

  1. Base score features (43 columns) β†’ 2. Multi-modal (annotation + genomic) β†’ 3. Configurable modalities β†’ 4. Genome-scale workflow

Foundation Models (examples/foundation_models/) β€” 5 scripts:

  1. Hardware feasibility check β†’ 2. Synthetic pipeline (no GPU) β†’ 3. Evo2 embedding extraction β†’ 4. Classifier training β†’ 5. End-to-end orchestrator

Data Preparation (examples/data_preparation/) β€” Ground truth generation, data validation

Related Projects

  • Meta-SpliceAI - Original research implementation with base and meta layers
  • Agentic AI Lab - Nexus Research Agent and agentic workflows

🀝 Contributing

Splice Agent is designed to be extensible. Contributions welcome!

Add new analysis templates:

  1. Add template to splice_analysis.py::ANALYSIS_TEMPLATES
  2. Include SQL query, chart prompt, and biological context
  3. Test with sample data
  4. Submit PR

Add new data sources:

  1. Implement ChartDataset interface in data_access.py
  2. Add format detection logic
  3. Test with real data
  4. Submit PR

πŸ“„ License

MIT License - see LICENSE file for details

πŸ™ Acknowledgments

πŸ“ž Support

πŸš€ Roadmap

Phase Description Status
1-3 Base Layer + Data Prep + Workflows βœ… Complete
2.5 Bioinformatics Lab UI βœ… Complete
4 Feature Engineering (10 modalities, 116 columns) βœ… Complete
5 Foundation Models (Evo2, SpliceBERT) πŸ”¬ Experimental
6 Meta Layer β€” M1-S (PR-AUC 0.9954), M2-S trained (PR-AUC 0.965 alt sites) πŸ”„ Active
7 Agentic Validation Layer πŸ“‹ Planned
8 Variant Analysis β€” Phase 1A+1B done, ClinVar + saturation scan next πŸ”„ Active
9 Isoform Discovery 🎯 Ultimate Goal

See:

  • Full Roadmap β€” detailed phase breakdowns, deliverables, success metrics
  • Application Ledger β€” maturity-tracked view of what currently runs (complements the phase-level roadmap)

Ready to analyze splice sites? Start with the Quick Start guide or explore the documentation!

About

Context-aware novel isoform discovery through multimodal evidence fusion and agentic AI validation for precision medicine

Topics

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors