Ingest observations from NNJA and convert to DART observation sequence format

## Ingest observations from NNJA (via Brightband) and convert to DART observation sequence format

**Summary**  
This project will ingest observation data from the Brightband **nnja-ai** API (the AI-ready NOAA-NASA Joint Archive, [NNJA](https://psl.noaa.gov/data/nnja_obs/)) and convert it into the DART observation sequence (`obs_seq`) format. The goal is to enable direct assimilation of NNJA observations in DART-based workflows, bridging modern observational archives and existing data assimilation systems.

**Motivation**  
- The **nnja-ai** dataset provides a modern, well-structured, cloud-native observational archive (in Parquet / tabular format) for a wide range of sensors (satellites, radiosondes, surface stations, etc.)
- DART requires observations in its `obs_seq` structure (with associated metadata, error specifications, and observation types) to perform assimilation.  
- By building a conversion pipeline, we unlock the potential of NNJA observations for assimilation experiments, operational workflows, and hybrid AI–DA systems.  
- This also helps users avoid manual, ad-hoc conversions and ensures consistency, traceability, and robustness in data handling.

**Goals**  
- Develop a conversion tool that queries or ingests NNJA data from the Brightband nnja-ai API.  
- Map NNJA variables, sensor identifiers, timestamps, locations, and metadata to DART observation definitions (`obs_def`).  
- Generate valid DART `obs_seq` files  from the ingested data.  
- Validate output by testing small examples using the DART obs_sequence_tool
- Provide documentation and example notebooks demonstrating conversion workflows.  
- (Optional) Automate periodic ingestion / updates so new NNJA observations can be converted on demand.

**Approach / Methodology**  
* **Familiarize yourself with the nnja-ai API / SDK**  
   - Use the Brightband `nnja-ai` SDK or API to query or download observations in a programmatic way. 
   - Explore the data schemas, partitioning (e.g. date, sensor type), and how to filter for desired subsets.  
* **Define mapping between NNJA observation schema and DART observation definitions**  
   - Determine how NNJA field names (e.g. sensor, variable, quality flags, geolocation) map to DART’s `obs_type`, `obs_error`, `obs_kind`, etc.  
   - Handle sensor-specific nuances (e.g. satellite radiances vs in-situ data).  
* **Build conversion routines**  
   - Read NNJA data into a notebook
   - Apply filters, quality control, and coordinate/time transformations (if needed).  
   - Create DART-compatible data structures and metadata.  
   - Write out `obs_seq` files 
   - 
* **Testing & validation**  
   - Use small subsets of NNJA data to test conversions.  
   - Run DART observation tool to confirm DART can read the resulting observation seqeunces.
   - Compare statistics (observation count, error distributions) before and after conversion.  
* **Documentation and automation**  
   - Create a [Sphinx Gallery example](https://ncar.github.io/pyDARTdiags/devel/index.html#creating-examples) to showcase the nnja-to-dart converter on the [pyDARTdiags gallery of notebooks](https://ncar.github.io/pyDARTdiags/examples/index.html). 
   - Optionally wrap the converter into a CLI or automated pipeline for ongoing usage.  

**Skills Needed or to be gained **  
- Python programming (file I/O, data processing)  
- Experience with data handling libraries (Pandas, PyArrow, Dask, xarray)  
- Familiarity with Parquet, columnar data formats, and large-volume data reading  
- Understanding of DART observation sequence format, `obs_def`, `obs_seq` conventions  
- Some knowledge of remote sensing / satellite observation metadata if working with radiance data  
- Comfort with time coordinate systems, geospatial transforms, and quality flags  

**Possible Challenges & Open Questions**  
- Some observations may lack full metadata (e.g. sensor angles, calibration) needed by DART.  
- Time zone, time reference, or timestamp precision mismatches between NNJA and DART.  
- Ensuring that coordinate systems align (e.g. lat/lon grids, altitude levels).  
- Performance issues when converting large volumes of data (memory, I/O).  
- Consistency with DART observation error and quality control expectations.  
- Handling edge cases — missing data, sensor blacklisting, quality flags, or observation duplicates.  
- Maintaining compatibility as the nnja-ai schema evolves or updates (versioning).  

**References**  
- Brightband nnja-ai project on [GitHub](https://github.com/brightbandtech/nnja-ai) the API / SDK for NNJA observations
- Brightband NNJA-AI [example notebook](https://github.com/brightbandtech/nnja-ai/blob/main/example_notebooks/basic_dataset_example.ipynb)
- [pyDARTdiags documentation](https://ncar.github.io/pyDARTdiags/)
- DART observation sequence and obs_def [documentation](https://docs.dart.ucar.edu/en/latest/guide/detailed-structure-obs-seq.html) 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest observations from NNJA and convert to DART observation sequence format #2

Ingest observations from NNJA (via Brightband) and convert to DART observation sequence format

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ingest observations from NNJA and convert to DART observation sequence format #2

Description

Ingest observations from NNJA (via Brightband) and convert to DART observation sequence format

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions