This Python script implements a key step in building a Retrieval-Augmented Generation (RAG) system: transforming raw text data linked to a Knowledge Graph (KG) into discrete, semantically modeled text chunks. It processes semi-structured data (TSV) and outputs the results as an RDF graph in Turtle format, ready for import into GraphDB. It is also useful for creating meaningful chunks from large documents for vector similarity search via OpenSearch or Elasticsearch using GraphDB connectors.
The project requires Python 3.x and the dependencies listed in requirements.txt.
Install the required packages using pip:
pip install -r requirements.txt
The pipeline's behavior is controlled by the CONFIG dictionary within the tsv_processor.py file.
| Parameter | Description |
|---|---|
| FILENAME_INPUT | The input TSV file path. |
| FILENAME_OUTPUT | The output RDF file path. |
| CHUNK_SIZE | The maximum size (in characters) for each text chunk. |
| CHUNK_OVERLAP | The character overlap between consecutive chunks to maintain context. |
| SPLITTING_STRATEGY | Specifies the LangChain splitter class to use. |
| COLUMN_NAME_URI | The header name for the original document URI column (e.g., ?doc). |
| COLUMN_NAME_PROP | The header name for the property URI column that holds the text (e.g., ?prop). |
| COLUMN_NAME_TEXT | The header name for the full text content column (e.g., ?text). |
The RDF output uses a clean, Schema.org-based model to reify each text chunk as a resource, maintaining traceability back to the original entity.
| Triple Component | Schema.org Property/Class | Value Type | Purpose |
|---|---|---|---|
| Chunk Node | rdf:type schema:Text | schema:Text | Classifying the resource as a text segment. |
| Chunk Node | schema:isPartOf | URIRef | Links the chunk back to the original KG entity (e.g., dbpedia:Article). |
| Chunk Node | schema:additionalProperty | URIRef | Records the original RDF property (e.g., schema:description) from which the text was derived. |
| Chunk Node | schema:position | xsd:integer | The sequential index of the chunk within the original full text. |
| Chunk Node | schema:text | Literal | The actual string content of the chunk. |
To execute the data processing pipeline, ensure your input TSV file is in the expected format and run the main script:
python chunker.py
The script will handle reading the data, cleaning URIs, splitting the text, transforming the data to RDF, and writing the final Turtle (.ttl) file to the path specified in FILENAME_OUTPUT.
Note: Due to a rdflib debug, you might have to fix the schema prefix of the ttl file as it sometimes comes out as schema1: or ns1: