Skip to content

BANYAN-ingest is a python module that focuses on extracting information rich features from scientific and technical documents. Rather than re-invent the wheel, BANYAN-ingest is a toolbox of state-of-the-art document processing tools and techniques. The goal of BANYAN-ingest is to provide an easy-to-use interface and standardized outputs. 

License

Notifications You must be signed in to change notification settings

sandialabs/banyan-ingest

Repository files navigation

Banyan Ingest

banyan_ingest is a python module that prepares documents for use in GenAI and LLM applications.

Rather than re-invent the wheel, banyan_ingest aims to utilize state-of-the-art tools to provide this capability.

Installation

In a python environment (conda, venv, etc.), use the following:

cd PATH_TO_REPO/
pip install .

You will also need to make sure poppler is installed on your system.

Supported Tools and File Formats

Currently we provide support for marker (link here) and NVIDIA's nemotron-parse models (link here). To install the necessary dependencies for these tools please use pip install .[marker] or pip install .[nemotronparse] respectively.

Note: please ensure you follow the guidelines and usage licenses of the tools.

Examples

The example_XXX.py scripts contain basic scripts for processing pdf documents using different OCR tools under the hood.

About

BANYAN-ingest is a python module that focuses on extracting information rich features from scientific and technical documents. Rather than re-invent the wheel, BANYAN-ingest is a toolbox of state-of-the-art document processing tools and techniques. The goal of BANYAN-ingest is to provide an easy-to-use interface and standardized outputs. 

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages