Context-Aware Novel Isoform Discovery for Drug Target Identification
Agentic-SpliceAI is an agentic AI system with hierarchical multi-task prediction for discovering novel RNA isoforms β disease-specific, variant-induced, and tissue-specific splice variants that go beyond canonical annotations. Originally refactored from Meta-SpliceAI, it has evolved into a self-sustained compound AI system with extensible foundation model predictors, multimodal evidence fusion, agentic AI validation, and a meta-learning framework (M1-M4) targeting progressively harder splice prediction problems.
The system combines three key architectural ideas:
- Multi-task learning: Shared multimodal representation (10 modalities, 116 features) with task-specific model heads (M1-M4)
- Hierarchical prediction: M1 (canonical) β M2 (alternative) β M3 (novel discovery) β M4 (perturbation-induced) β each level tackles a harder problem with different label regimes
- Agentic validation: LLM-powered agents for literature mining, expression evidence, clinical interpretation, and recursive self-improvement
Conceptual workflow schematic β generated by Google Nano Banana 2
Precise workflow diagram (Mermaid)
graph TD
%% Color scheme matching the 5-layer workflow bands
classDef input fill:#e1f5fe,stroke:#0288d1,stroke-width:2px,color:#1a1a1a
classDef base fill:#f3e5f5,stroke:#8e24aa,stroke-width:2px,color:#1a1a1a
classDef meta fill:#fff3e0,stroke:#f57c00,stroke-width:2px,color:#1a1a1a
classDef agents fill:#e8f5e9,stroke:#388e3c,stroke-width:2px,color:#1a1a1a
classDef output fill:#ffebee,stroke:#d32f2f,stroke-width:2px,color:#1a1a1a
%% Layer 1: Input & Preparation
subgraph L1 ["β INPUT & PREPARATION"]
direction LR
A1[(Genomic Data<br/>GTF, FASTA, MANE)]:::input
A2[(Variant Data<br/>VCF)]:::input
B["Data Preparation Pipeline (CLI)"]:::input
A1 --> B
A2 --> B
end
%% Layer 2: Base Prediction Layer
subgraph L2 ["β‘ BASE PREDICTION LAYER"]
direction LR
C[Canonical Predictions<br/>MANE Baseline]:::base
D[Splice-Site Prediction Engines<br/>Pluggable Architecture]:::base
D1[SpliceAI /<br/>OpenSpliceAI]:::base
D2[Foundation Models<br/>e.g. Evo2 Fine-Tuning]:::base
D3[Future<br/>Predictors]:::base
C --> D
D --> D1
D --> D2
D --> D3
end
%% Layer 3: Meta Layer Integration
subgraph L3 ["β’ META LAYER INTEGRATION"]
direction LR
E1[Variant Context]:::meta
E2[Disease Context]:::meta
E3[Tissue Context]:::meta
F[Multimodal Deep Learning<br/>Meta-Models]:::meta
G[Context-Aware Adaptive Predictions<br/>& Novel Site Detection]:::meta
E1 --> F
E2 --> F
E3 --> F
F --> G
end
%% Layer 4: Agentic Workflow Layer
subgraph L4 ["β£ AGENTIC WORKFLOW LAYER"]
direction LR
H{Autonomous AI<br/>Agent Orchestrator}:::agents
I[Literature Agent<br/>PubMed, arXiv]:::agents
J[Expression Agent<br/>GTEx, TCGA]:::agents
K[Clinical Agent<br/>ClinVar, COSMIC]:::agents
L[Conservation Agent<br/>PhyloP]:::agents
M[Structural Agent<br/>AlphaFold, Foldseek]:::agents
H -->|Evidence Mining| I
H -->|RNA-seq Validation| J
H -->|Disease Mapping| K
H -->|Cross-species| L
H -->|Structure Prediction| M
I --> S
J --> S
K --> S
L --> S
M --> S
S[Nexus Research Agent<br/>Comprehensive Reports]:::agents
end
%% Layer 5: Outcomes & Discovery
subgraph L5 ["β€ OUTCOMES & DISCOVERY"]
direction LR
N1[Novel Isoform Discovery<br/>Drug Targets]:::output
N2[Clinical-Grade Variant<br/>Interpretation β VUS]:::output
N3[Tissue-Specific<br/>Biomarkers]:::output
end
%% Inter-layer flow
B --> D
D1 --> F
D2 --> F
D3 --> F
G --> H
%% Self-improvement feedback loop
S -.->|Self-Improvement Feedback| F
%% Workflow to outcomes
S --> N1
S --> N2
S --> N3
The Challenge: Current gene annotations (MANE, RefSeq) only capture ~10% of biologically active splice sites. The remaining 90% includes:
- π¦ Disease-specific isoforms (cancer, neurological, cardiac)
- 𧬠Variant-induced splicing (pathogenic mutations, VUS)
- π§ͺ Tissue-specific isoforms (brain, immune, developmental)
- π Druggable novel targets (oncogenes, splice modulators)
Our Solution: Context-aware adaptive prediction through multimodal meta-learning discovers novel isoforms:
%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#1e3a8a','primaryTextColor':'#fff','lineColor':'#3b82f6','fontSize':'15px','fontFamily':'ui-sans-serif, system-ui, sans-serif'}}}%%
graph LR
A["<b>π Annotations</b><br/>MANE/RefSeq<br/><i>~10% of sites</i>"]:::canonical
B["<b>𧬠Base Layer</b><br/>SpliceAI ⒠OpenSpliceAI<br/>Evo2 ⒠SpliceBERT"]:::foundation
C["<b>π― Meta Layer</b><br/>10-Modality Fusion<br/>M1-M4 Models"]:::metalayer
D["<b>π Discovery</b><br/>Delta Scoring<br/>Isoform Assembly"]:::discovery
E["<b>π€ Agentic Layer</b><br/>Literature β’ RNA-seq<br/>Clinical Validation"]:::agentic
F["<b>π Novel Isoform<br/>Catalog</b><br/>Drug Targets<br/>Precision Medicine"]:::output
A --> B --> C --> D --> E --> F
classDef canonical fill:#1e3a8a,stroke:#1e40af,stroke-width:2px,color:#ffffff
classDef foundation fill:#0891b2,stroke:#0e7490,stroke-width:2px,color:#ffffff
classDef metalayer fill:#7c3aed,stroke:#6d28d9,stroke-width:2px,color:#ffffff
classDef discovery fill:#059669,stroke:#047857,stroke-width:2px,color:#ffffff
classDef agentic fill:#dc2626,stroke:#b91c1c,stroke-width:2px,color:#ffffff
classDef output fill:#d97706,stroke:#b45309,stroke-width:3px,color:#ffffff
Key Innovation: The Foundation-Adaptor Framework uses multimodal deep learning to refine foundation model predictions with context, discovering the 90% of splice sites beyond canonical annotations!
Traditional Approach:
- Target canonical proteins
- Miss disease-specific isoforms
- Limited therapeutic options
Agentic-SpliceAI Approach:
- Discover disease-specific isoforms
- Identify druggable splice variants
- Enable isoform-selective therapeutics
- Expand druggable genome by 10-100x
Model-agnostic splice prediction: any predictor can plug into the base layer β classical (SpliceAI, OpenSpliceAI), foundation-model-derived (SpliceBERT + trained classifier head), or future models β as long as its output satisfies the per-nucleotide 3-class scoring protocol (neither / acceptor / donor).
BasePredictorprotocol + plugin registry: decorator for built-ins, YAML manifest for external/foundation-model-derived predictors- Same CLI (
agentic-spliceai-base), same downstream consumers β swapping a predictor is a registration concern, not an integration one - Three predictors currently registered; the set is open-ended. Adding a new predictor requires no changes to meta layer, variant analysis, or any other downstream application
- See
src/agentic_spliceai/applications/base_layer/protocol.pyfor the protocol andexamples/foundation_models/for prototype trainers (SpliceBERT, Evo2)
Multimodal deep learning: Refine predictions using context-aware meta-models
- Foundation: Base model predictions (canonical knowledge)
- Adaptor: Multimodal feature fusion (base scores, conservation, epigenetic marks, chromatin accessibility, RNA-seq junction evidence, RBP eCLIP binding, DNA sequence, genomic context, gene annotations) β see Feature Catalog and
examples/features/ - Context embedding: Patient variants, disease state, tissue type
- Self-improvement: Learn from validation feedback continuously
Beyond static annotations: Discover isoforms specific to:
- Patient genetic backgrounds (variant-induced splicing)
- Disease states (cancer, neurological, cardiac)
- Tissue/cell types (brain, immune, developmental)
- Environmental conditions (stress, treatment response)
Agentic AI workflows:
- π¬ Validate with literature, RNA-seq, and clinical databases
- π Research biological context and functional impact
- π§ Synthesize evidence from multiple sources
- π Iterate through multi-agent pipelines
See:
- Applications β public ledger of matured application bundles (maturity dashboard, driving examples, evaluation)
- Use Cases β From Discovery to Therapeutics β translational pathway, clinical scenarios, drug discovery impact
| Component | Description |
|---|---|
BasePredictor Protocol |
Single contract that makes the base layer model-agnostic: per-nucleotide 3-class scores (neither / acceptor / donor) aligned to genomic positions. Any predictor satisfying this protocol can be registered and served. |
| Plugin Registry | Decorator-based in-process registration for built-ins + YAML manifest (configs/predictors.yaml) for foundation-model-derived checkpoints and external models. Adding a predictor requires no downstream code changes. |
| Classical Models | SpliceAI (TF, GRCh37/Ensembl), OpenSpliceAI (PyTorch, GRCh38/MANE) β wrapped as thin adapters over the existing BaseModelRunner. |
| Foundation-Model-Derived Predictors | SpliceBERT + dilated-CNN classifier head (trained via examples/foundation_models/07a) registered in the same catalog as classical models. Frozen-head and end-to-end fine-tuning pipelines for SpliceBERT, Evo2, HyenaDNA under foundation_models/. |
| 10-Modality Feature Fusion | 116 features across base scores, conservation, epigenetics, chromatin accessibility (ATAC-seq + DNase-seq), junction reads, RBP binding, DNA sequence, genomic context, annotations, and optional foundation model embeddings (Evo2, SpliceBERT) β see Feature Catalog |
| YAML-Driven Configs | 4 profiles (default, full_stack, isoform_discovery, meta_m3_novel) β add/drop modalities per modeling objective |
| Component | Description |
|---|---|
| M1-M4 Model Variants | Four progressively harder tasks: canonical (M1-S), alternative (M2-S), novel discovery (M3-S), perturbation-induced (M4-S) β see Model Variants and Naming Convention |
| Position-Level (M-P)* | XGBoost baseline with Tree SHAP (M1-P: 99.74% accuracy, PR-AUC 0.999, FN -62% / FP -68% vs base-only) |
| Sequence-Level (M-S)* | 2-stream dilated CNN (367K params) with logit-space residual blend and per-class learned temperature (M1-S: 99.99% accuracy, PR-AUC 0.9954, FPs -15.5% vs base) β same I/O protocol as base models |
| M2 Series | Alternative splice site detection β Eval-Ensembl-Alt shows M2-S achieves PR-AUC 0.965 on alternative sites (base: 0.749) β see M2 Formulations and Naming Convention |
| Variant Effect (M4) | Per-variant delta scoring with splice consequence prediction β validated on 13 disease-gene variants, cryptic site positions match RNA-seq within 2bp β see examples/variant_analysis/ |
| Smart Checkpointing | Per-chromosome parquet saves, disk-backed gene cache, HDF5 shard packing, --resume support |
| Feature | Description |
|---|---|
| Literature Validation Agent | Cross-reference predictions with PubMed, arXiv, and splicing databases |
| Expression Evidence Agent | Query GTEx, ENCODE, and tissue-specific expression data |
| Clinical Annotation Agent | Check ClinVar, SpliceVarDB, and disease associations |
| Research Report Generator | Comprehensive PDF reports with citations and biological context |
| Self-Improving Pipeline | Learn from validation feedback to refine predictions |
- 𧬠Domain-Specific Analysis - Predefined templates for common splice site analyses
- π€ AI-Powered Insights - LLM-generated visualizations with biological context
- π Publication-Ready Charts - High-quality plots using matplotlib/seaborn
- π¬ Exploratory Research - Ask custom questions about your splice site data
- π REST API - FastAPI service for integration with other tools
- Literature Search - Automated research on splicing mechanisms
- Research Reports - Comprehensive reports with LaTeX equations and citations
- Multi-Source Integration - arXiv, PubMed, Europe PMC, Wikipedia
- Publication-Quality Output - PDF generation with proper formatting
- Iterative Refinement - Multi-agent pipeline (Planner β Researcher β Writer β Editor)
| Layer | Purpose | Output | Status |
|---|---|---|---|
| Base Layer | Canonical splice prediction (MANE) | Baseline scores for ~10% of sites | β Complete |
| Feature Engineering | Multimodal evidence fusion | 116 feature columns (10 modalities) | β Complete |
| Foundation Models | Evo2/SpliceBERT classification | Per-nucleotide embeddings | π¬ Experimental |
| Meta Layer | Context-aware prediction (M1-M4) | Novel sites (90% beyond MANE) | π Active |
| Agentic Layer | Multi-source validation + reports | Validated isoforms + drug targets | π Planned |
See: Architecture β Multi-Layer Pipeline for the full diagram, directory structure, and delta score analysis
Interactive web tools for splice site analysis:
# Start the Bioinformatics Lab (port 8005)
mamba run -n agentic-spliceai python -m server.bio.app
# Browse: http://localhost:8005/Pages: Gene Browser (/) | Genome View (/genome/{gene}) | Metrics Dashboard (/metrics)
Start the service:
# Splice prediction API (port 8004)
agentic-spliceai-server
# Or run directly:
mamba run -n agentic-spliceai python -m server.splice_service.splice_serviceAccess the API:
- Swagger UI: http://localhost:8004/docs
- API Root: http://localhost:8004
Example API call:
curl -X POST http://localhost:8004/analyze/template \
-H "Content-Type: application/json" \
-d '{
"dataset_path": "data/splice_sites_enhanced.tsv",
"analysis_type": "high_alternative_splicing",
"model": "gpt-4o-mini"
}'from agentic_spliceai import create_dataset
from agentic_spliceai.splice_analysis import generate_analysis_insight
from openai import OpenAI
# Load dataset
dataset = create_dataset("data/splice_sites_enhanced.tsv")
# Generate analysis
client = OpenAI()
result = generate_analysis_insight(
dataset=dataset,
analysis_type="high_alternative_splicing",
client=client,
model="gpt-4o-mini"
)
# Save and execute code
with open("analysis.py", "w") as f:
f.write(result["chart_code"])
# Execute to generate chart
exec(result["chart_code"])Generate comprehensive research reports on splicing topics:
# Generate research report on splicing mechanisms
nexus "Alternative Splicing Mechanisms in Cancer" --pdf
# Research specific splicing topics
nexus "SpliceAI Deep Learning Architecture" \
--model openai:gpt-4o \
--length comprehensive
# Quick literature review
nexus "Recent advances in splice site prediction" \
--model openai:gpt-4o-mini \
--length briefPython API:
from nexus.agents.research import ResearchAgent
from nexus.core.config import Config
# Initialize research agent
config = Config()
agent = ResearchAgent(config)
# Generate research report
result = agent.research(
topic="Splice Site Recognition by U1 snRNP",
length="standard",
generate_pdf=True
)
print(f"Report saved to: {result['output_path']}")Use Cases:
- Research latest splicing mechanisms before analysis
- Generate literature reviews for grant proposals
- Stay updated on splice prediction methods
- Validate analysis approaches with current research
- Generate comprehensive background sections
Explore foundation model embeddings for splice site prediction (experimental sub-project):
# Check hardware feasibility
python examples/foundation_models/01_resource_check.py
# Run full pipeline with synthetic data (no GPU needed, <30s)
python examples/foundation_models/02_synthetic_training_pipeline.py
# Orchestrate real pipeline (dry-run first)
python examples/foundation_models/05_run_pipeline.py --dry-run
python examples/foundation_models/05_run_pipeline.py --local-only # synthetic data
python examples/foundation_models/05_run_pipeline.py --execute # real GPU (costs $)Cloud deployment (SkyPilot + RunPod):
# Extract Evo2 embeddings on A40 GPU
sky launch foundation_models/configs/skypilot/extract_embeddings_a40.yaml
# Train exon classifier
sky launch foundation_models/configs/skypilot/train_classifier_a40.yamlHardware requirements:
| Task | M1 Mac (16GB) | A40 (48GB) | A100 (80GB) |
|---|---|---|---|
| Evo2 7B embeddings | ~100 bp/s (INT8) | ~10K bp/s | ~10K bp/s |
| Classifier training | CPU only | Full precision | Full precision |
| Evo2 40B | Not feasible | Tight | Comfortable |
See: foundation_models/README.md for detailed setup
Predict splice sites using state-of-the-art models:
# CLI: Predict for genes
agentic-spliceai-predict --genes BRCA1 TP53 UNC13A
# CLI: Predict for chromosome
agentic-spliceai-predict --chromosomes 21 --base-model openspliceaiPython API:
from agentic_spliceai.splice_engine import predict_splice_sites
# Simple prediction
results = predict_splice_sites(genes=["BRCA1", "TP53"])
positions = results["positions"]
# High-confidence predictions
import polars as pl
high_conf = positions.filter(pl.col("donor_score") > 0.9)Use Cases:
- Predict splice sites for genes of interest
- Genome-wide splice site analysis
- Validate predictions against annotations
- Generate training data for meta-models
See: Splice Prediction Guide for complete documentation
# Python 3.12 (requires >= 3.11)
python --version
# Create environment for agentic-spliceai
mamba env create -f environment.yml
mamba activate agentic-spliceai# Install package in development mode
cd agentic-spliceai
pip install -e ".[dev]"Note: This project is designed to run independently with its own environment and dependencies.
# Copy environment template
cp .env.example .env
# Edit .env and add your OpenAI API key
# OPENAI_API_KEY=sk-...See: Architecture for full directory structure; API Endpoints for REST API reference, configuration, and data format
- Splice Prediction Guide - Complete prediction walkthrough
- Meta Layer Methods - Model variants (M1-M4), label hierarchy, annotation-driven prediction
- Base Layer Architecture - Architecture, coordinates, data preparation
- System Design - Architectural design documents
Base Layer (examples/base_layer/) β 5 scripts:
- Single gene prediction β 2. Chromosome prediction β 3. Evaluation β 4. Chunked workflows β 5. Genome precomputation
Feature Engineering (examples/features/) β 4 scripts:
- Base score features (43 columns) β 2. Multi-modal (annotation + genomic) β 3. Configurable modalities β 4. Genome-scale workflow
Foundation Models (examples/foundation_models/) β 5 scripts:
- Hardware feasibility check β 2. Synthetic pipeline (no GPU) β 3. Evo2 embedding extraction β 4. Classifier training β 5. End-to-end orchestrator
Data Preparation (examples/data_preparation/) β Ground truth generation, data validation
- Meta-SpliceAI - Original research implementation with base and meta layers
- Agentic AI Lab - Nexus Research Agent and agentic workflows
Splice Agent is designed to be extensible. Contributions welcome!
Add new analysis templates:
- Add template to
splice_analysis.py::ANALYSIS_TEMPLATES - Include SQL query, chart prompt, and biological context
- Test with sample data
- Submit PR
Add new data sources:
- Implement
ChartDatasetinterface indata_access.py - Add format detection logic
- Test with real data
- Submit PR
MIT License - see LICENSE file for details
- Originally refactored from Meta-SpliceAI
- Nexus Research Agent from Agentic AI Lab
- Foundation models: Evo2 (Arc Institute), SpliceAI (Illumina), OpenSpliceAI
- LLM-powered workflows via OpenAI and Anthropic APIs
- Inspired by genomics research community
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: barnettchiu@gmail.com
| Phase | Description | Status |
|---|---|---|
| 1-3 | Base Layer + Data Prep + Workflows | β Complete |
| 2.5 | Bioinformatics Lab UI | β Complete |
| 4 | Feature Engineering (10 modalities, 116 columns) | β Complete |
| 5 | Foundation Models (Evo2, SpliceBERT) | π¬ Experimental |
| 6 | Meta Layer β M1-S (PR-AUC 0.9954), M2-S trained (PR-AUC 0.965 alt sites) | π Active |
| 7 | Agentic Validation Layer | π Planned |
| 8 | Variant Analysis β Phase 1A+1B done, ClinVar + saturation scan next | π Active |
| 9 | Isoform Discovery | π― Ultimate Goal |
See:
- Full Roadmap β detailed phase breakdowns, deliverables, success metrics
- Application Ledger β maturity-tracked view of what currently runs (complements the phase-level roadmap)
Ready to analyze splice sites? Start with the Quick Start guide or explore the documentation!
