This document contains the project development outline and assignments as well as associated timeline and project roadmap. This is intended to be used as a tool for organizing our project development and understanding our needs to meet out timeline.
This Gantt chart represents a starting point for understanding the timeline for our development and serves a roadmap for our development stages. This a high-level chart showing the different parts of the project and how development time can ovelap. Currently, this is non-finalized and is intended only as a starting point for discussing the relative timelines of tasks.
---
displayMode: compact
---
gantt
title Data Model-Based Ingestion Pipeline Roadmap
dateFormat YYYY-MM-DD
section Data Access
- :done, 2025-07-15, 0d
section 1
Preferred Data Access Deadline :crit, milestone, 2025-02-28, 1d
Critical Data Access Deadline :crit, milestone, 2025-03-15, 1d
Absolute Latest Data Access Deadline :crit, milestone, 2025-05-01, 1d
section Joint Working Group
Leads - Patrick, Corey :done, 2025-01-10, 0d
section 2
Development Setup & Documentation :done, 2025-01-01, 2025-04-15
Build & Deployment Tools (Makefile) :done, 2025-02-10, 2025-06-15
section Data Sets
Leads - Corey, Patrick, Madan :done, 2025-01-10, 0d
section 3
Initial Toy Data :done, 2025-01-01, 2025-03-01
Robust Toy Data for Build & Test :done, 2025-02-15, 2025-04-01
Initial Synthetic Data :done, 2025-02-10, 2025-04-01
Synthetic Data (Prioritized Variables) :done, 2025-02-20, 2025-06-01
section Schema Generation
Leads - Trish, Corey :done, 2025-01-10, 0d
section 4
Schema Toy Data :done, 2025-02-01, 2025-03-08
Verify Schema Automator on Toy Data :done, 2025-02-10, 2025-04-01
Synthetic Data for Schema Automator :done, 2025-02-20, 2025-05-15
Schema Automator on Synthetic Data :done, 2025-03-15, 2025-06-01
Use Schema Automator on Real Data :done, 2025-04-01, 2025-07-01
Close Schema Automator Gaps :done, 2025-04-15, 2025-07-01
Schema Sheets & Toy/Synthetic Data :done, 2025-03-01, 2025-05-01
Annotation of Data :active, 2025-04-01, 2025-06-01
section Schema Validator
Lead - Madan :done, 2025-01-10, 0d
section 5
Schema Validator Toy Data :done, 2025-02-10, 2025-03-30
Add Schema Validator & Write Tests :done, 2025-02-10, 2025-04-01
Synthetic Data for Schema Validator :done, 2025-02-20, 2025-05-15
Schema Validator on Real Data :done, 2025-04-01, 2025-07-01
section Schema Data Map
Lead - BDC Data Model Team, Corey :active, 2025-01-10, 0d
section 6
Manually Curated Data Map :2025-03-015, 2025-06-01
Attempt Automatic Map Generation :2025-05-01, 2025-07-01
section LinkML Map
Lead - Corey :done, 2025-01-10, 0d
section 7
LinkML Map Toy Data :active, 2025-02-01, 2025-03-08
Add LinkML Map :active, 2025-02-10, 2025-04-01
Synthetic Data for LinkML Map :active, 2025-03-15, 2025-06-01
Real Data with LinkML Map :2025-04-01, 2025-07-01
Remediate Mapping Issues :2025-05-01, 2025-07-01
section Preprocessing
Lead -- Patrick, TBD :done, 2025-01-10, 0d
section 8
Simple Data Cleaning Scripts :done, 2025-02-01, 2025-04-15
Scripts for Complex Mappings [postponed, canceled] :done, 2025-04-15, 2025-07-01
section Exec & Deploy
Leads - Patrick, Stephen, Pierette :done, 2025-01-10, 0d
section 9
Containerized Deployment (AWS, GCP) :2025-03-15, 2025-06-15
Local Execution with Automation :2025-04-01, 2025-07-01
section Final Pipeline Run
Leads - Corey, Stephen, Pierette :done, 2025-01-10, 0d
section 10
Full Pipeline with Synthetic Data :2025-04-01, 2025-06-01
Single Real Data Set Through Pipeline :2025-04-15, 2025-06-15
Fully Automated Pipeline for Cloud :2025-04-15, 2025-07-01
Fully Automated Pipeline for Local :2025-04-01, 2025-06-15
Final Run & Harmonization :2025-05-15, 2025-07-01
axisFormat %B
tickInterval 1month
This outline captures the main features shown in the project roadmap above. This outline can be more detailed and contains some project steps that are less useful in the top-level roadmap. We should plan to refine the use of these tools over time to capture the parts of the project that need to be represented at a top-level plan and a more detailed overview that collects all of the parts we are working on.
- ✔️Joint Working Repository - Lead -- Patrick Golden
- Overall Project Documentation - Ongoing
- Automatic Project Documentation Deployment - Ongoing
- Project Testing Suite - Initial - Completed, Needs Sub-tasks
- Automated Pre-Commit Project Testing - Completed
- Automated Build and Deployment Toolset - Makefile - Ongoing
- ✔️Ingest-Wide Toy Data Set - Lead -- Corey Cox & Patrick Golden
- Single Toy Data Set for Testing and Build Environment Validation Across Full Pipeline - Initial - Completed
- Integration of Toy Data Set with Automated Build/Test Harness - ?
- ✔️Ingest-Wide Synthetic Data Set - Assigned -- Corey Cox, Patrick Golden, Madan
- Initial Synthetic Data Set Based on Data Available (BDC Synthetic) - In Progress
- Initial Synthetic Data Set Generated from BDC Model identified variables - Not Started
- NOTE: Initial Toy dataset replaced the need for synthetic dataset as Corey had to build the data set froms scratch to accompdate for what would have been needed from synthetic dataset.
- ✔️Schema Automator - Lead -- Trish Whetzel, helping -- Corey Cox
- Add Schema-Automator to Project and verify it works - Completed
- Had to Downgrade Python Version from 3.13 to 3.12 for now
- Used BDC Synthetic Data and produced Schema, no true testing or validation
- Add Schema-Automator usage and installation process to documentation
- Create Toy Data Set to verify functionality and start testing harness
- Verify Schema-Automator works on Toy Data Set and Write Tests
- Create Synthetic Data Set for advanced Schema-Automator functionality
- Verify Schema-Automator works on Synthetic Data Set and Write Tests
- Use Schema-Automator on a real data set and evaluate gaps
- Close Schema Automator Gaps
- Add richer data to synthetic or toy data set to represent gaps
- Development on Schema Automator to add functionality
- Development on Upstream Complex Mapper for hard to add functionality
- Add Schema-Automator to Project and verify it works - Completed
- ✔️Schema Sheets for Data Dictionary - Lead -- Trish Whetzel
- Add Schema Sheets as additional tool to create data models
▶️ Annotation of the Data - Lead -- Trish Whetzel- Move to the repo - completed
- Make the tool generalizeable/configurable to allow the tool to prioritize certain resources vs the others depending on need and the use case.
- Add instructions on how to add ontologies by use case
- ✔️Schema Validator - Lead -- Madan
- Add Schema Validator to project and verify it works
- Create Toy Data Set to verify functionality and start testing harness
- Toy Data Set resembling output of Schema Automator
- Toy Data Set of data input to Schema Automator to validate to Schema
- Valid and invalid data sets for testing
- Verify Schema Validator works for Toy Data Set and Write Tests
- Create Synthetic Data Set for Schema Validatory and Write Tests
- Use Schema Validator on real data with Schema Automator generated schema
▶️ Schema Data Map - Yaml file that describes the map from one data model to another- Manually Curated data map from BDC Data Model Team
- Attempt Automatic generation of map from LLM working group
▶️ LinkML Mapper - Doing the Transformation - Lead -- Corey Cox- Add LinkML Mapper to Project and verify it works
- Create Toy Data for LinkML Mapper functionality
- Toy Data Set used for Schema Automator Schema generation
- Schema Automator derived implicit data model from toy data set
- Toy Subset of the BDC Model appropriate for Toy Data set
- Toy Schema Data Map appropriate for all of the above Toy Data set items
- Verify LinkML Mapper works on toy data set
- Create Synthetic Data set for LinkML Mappers
- Synthetic Data set covering reasonable subset of real data
- Schema Automator derived data model for above synthetic data
- Appropriate Subset (or full set) of BDC Data Model
- Data map appropriate to the above synthetic data items
- Use LinkML Mapper on real data set using upstream ingest pipeline tool artifacts
- Remediate mappings that can't be performed by LinkML Mapper
- Identify variables that LinkML Mapper is unable to perform harmonization and mapping
- Create Issues in LinkML Mapper and assess feasibility of adding functionality to LinkML Mapper
- Add functionality to LinkML Mapper or add scripts to Complex Data Mapper on a per-variable manner
- Perform full ingest data transform with LinkML Mapper on real data
- ✔️Simple Data Cleaner - Lead -- Patrick Golden
- Simple scripts necessary for Data cleaning outside of ingest pipeline
- Poorly formatted Enums (Male, male, M, 1 - all meaning male)
- Bad missing data representation (i.e. 9 for no data)
- Empty columns
- Other bad data practices we can’t expect our ingest to handle
- Simple scripts necessary for Data cleaning outside of ingest pipeline
- ❎Complex Data Mapper - postponed, canceled as not planned
- One-off scripts on per dataset basis to map data that is too complex for the tools as the exist
- Create these as-needed for each variable that cannot be cleanly mapped with LinkML Mapper
- Execution and Deployment pipeline - Lead -- BDC -- Patrick and Stephen, INCLUDE -- Pierette Lo
- Wrapping tools and steps into containers for deployment to cloud environments (BDC Catalyst through AWS, Google Cloud)
- Local system execution of pipeline in fully automated way if possible or with checkpoints and human-in-the-loop.
- Running all data through pipeline to produce a harmonized whole - Lead -- BDC -- Corey Cox with Stephen Hwang, INCLUDE -- Pierette Lo
- With the final robust synthetic data set, run the full pipeline with both Cloud and Local architectures
- Run a single data set all the way through the pipeline either on Cloud or Local architecture
- Create Fully automated pipelines for real-world data set
- With initial working real-world data set create a full automated pipeline for Cloud Architecture
- Create fully automated pipeline for Local architecture
- Test over all appropriate systems
- Run the full pipeline on all of the target data sets with the appropriate architecture
- Identify failures and gaps for each of the data sets
- Fill identified gaps with issues and development in the approprite tool or Complex Data Mapper
- Final run for each data set to the harmonized model