feat(metadata-extraction): add LLM based metadata extraction by juliehinge · Pull Request #18 · inveniosoftware/airdec-workflows

juliehinge · 2026-03-12T14:31:10Z

Add LLM-powered metadata extraction workflow

This PR adds pdf metadata extraction using LLMs as a second activity in the workflow.

It takes the output from Activity 1 (full text extracted from PDF using extraction tool) and uses an LLM to extract title, authors, and abstract from that text and returns it.

app/activities/extract_metadata.py

slint · 2026-03-12T17:06:56Z

app/activities/extract_metadata.py

+    abstract: str | None = Field(default=None, description="Document abstract")
+    authors: list[str] = Field(
+        default_factory=list, description="List of document authors"
+    )


minor: we should also add publication_date and doi fields to cover the minimal metadata for the deposit form.

we should also add examples=[...], especially for the dates and DOI, so that the format is correct

We could add a validator for these (we could have a feedback loop, until it's of the correct format. (with like 1-2 retries)

mairasalazar

just made a few suggestions, minor stuff

mairasalazar · 2026-03-12T22:33:40Z

app/activities/__init__.py

@@ -0,0 +1,6 @@
+"""Activities for the airdec-workflows application."""


I suggest we already start changing the name to Orcha :)

app/activities/extract_metadata.py

mairasalazar · 2026-03-12T22:55:01Z

app/activities/extract_pdf_content.py


 @activity.defn
-async def create(request: ExtractPdfContentRequest) -> ExtractPdfContentResponse:
+async def text_extraction(


minor: if you agree with my previous comment, then maybe here it could be parse_content() or extract_raw_content(), something like that

mairasalazar · 2026-03-12T23:26:46Z

app/workers.py

+            text_extraction,  # Activity for extraction all text from the PDF
+            metadata_extraction,  # Activity for extraction metadata from PDF text


very minor suggestion:

Suggested change

text_extraction, # Activity for extraction all text from the PDF

metadata_extraction, # Activity for extraction metadata from PDF text

text_extraction, # Activity for extracting all content from file

metadata_extraction, # Activity for extracting metadata from file content

Mostly for the future, since we also want to handle other file types, so feel free to leave it as it is for now

slint · 2026-03-13T10:13:03Z

app/activities/extract_metadata.py

+    """Request to extract metadata from document text."""
+
+    text: str = Field(description="Document text to analyze")
+    model: str = Field(default="groq/qwen/qwen3-32b", description="Model to use")


minor/comment: I like the idea that we can easily select different models like so. my only concern is in the case where we start having different configs/parameters that we pass for each model (e.g. temperature, flags for thinking effort, etc.) in which case we'll probably need some sort of preconfigured presets... not an issue right now since we can evolve this signature as we go

slint · 2026-03-13T10:16:37Z

app/activities/extract_metadata.py

+INSTRUCTIONS = """\
+Extract structured metadata from this document text.
+Focus on finding the title, abstract/summary, and authors.
+For authors, extract individual names as separate list items.
+Only include information that is clearly stated in the text.
+"""


I know that adding this info in the system prompt makes it a bit more "explicit", but there's a bit of repetition also compared to what the Pydantic schema passes as context... I guess we need to prompt-engineer this (the future is here! 🙃) so that we have a balanced approach on giving "purpose" to the task, but details coming from the schema.

From my experience models are usually very happy with something like this:

Give a very specific goal, something like:

Extract bibliographic metadata from the document text.

Then what it should focus on, in a very clear verbal text (I have experienced models thinking "/" means "find both"):

Focus on identifying the main document title, the abstract or summary if present and the list of authors.

And then some guidelines! Very important when trying to avoid hallucinations (Even though they from my experience still sometimes think they are the new researchers of our time and invent new stuff 😆 )

Guidelines: - Prefer information from bla bla bla. - Ignore page numbers, bla bla bla. - Only extract information explicitly stated in the text. - If information is missing, leave the corresponding field empty.

And we might also need to add a "Thank you!" for when AI takes over the world 🤣

My concern is that in the future, we might focus on different fields for extraction depending on the resource type and the community to which the record is submitted. For example:

for a journal articles, we'll also need journal title, ISSN, issue, volume, etc.

for EU-funded projects we'll also check for funding (though for papers this usually comes later in the PDF under "Acknowledgments", but for reports/deliverables it's usually in the footer of every page).

The other use-case is "field-specific" refinement/extraction, where the user gives a prompt for a specific field. Providing the entire context would be confusing, but if e.g. we could pass a partial Pydantic model schema, we focus only on the specific information we need.

So I would like as much as possible thatthe Pydantic models/schemas drive a big part of this (also because in the future we might be dynamically generating these from a user-provided ruleset in communities).

For now let's stick with a prompt that "just works", since we're just targeting the required fields of the deposit form. And we shelve a bigger task on how to design this better.

And for some encouragement, I would go with:

MAKE NO MISTAKES! Or we switch to {competitor's model}!

My concern is that in the future, we might focus on different fields for extraction depending on the resource type and the community to which the record is submitted. For example:

for a journal articles, we'll also need journal title, ISSN, issue, volume, etc.

for EU-funded projects we'll also check for funding (though for papers this usually comes later in the PDF under "Acknowledgments", but for reports/deliverables it's usually in the footer of every page).

The other use-case is "field-specific" refinement/extraction, where the user gives a prompt for a specific field. Providing the entire context would be confusing, but if e.g. we could pass a partial Pydantic model schema, we focus only on the specific information we need.

So I would like as much as possible thatthe Pydantic models/schemas drive a big part of this (also because in the future we might be dynamically generating these from a user-provided ruleset in communities).

For now let's stick with a prompt that "just works", since we're just targeting the required fields of the deposit form. And we shelve a bigger task on how to design this better.

That makes sense, especially when we expect the set of extracted fields to evolve depending on the resource type or community rules.

My thinking with the prompt was mostly pragmatic. In my experience models behave more reliably if they get a very clear goal and a few extraction guidelines (prefer first page, ignore headers/footers, don’t hallucinate, etc.). But I agree that we shouldn’t encode too much field-specific logic in the prompt and let the schema drive that.

OliverGeneser · 2026-03-13T12:05:40Z

app/workflows/extract_metadata_workflow.py

+        # Activity 2: Extract metadata using LLM
+        metadata = await workflow.execute_activity(
+            metadata_extraction,
+            ExtractMetadataRequest(text=content.text),


We should probably think about chunking the content, so we don't exceed the token limit

You mean the max context configured for the model on the provider? If we end up with a document with too much content in the first pages, I'm not sure if chunking would help 😅

I don't know if the best approach is to have a "context budget" and basically just quickly fail documents that surpass it... I don't think it's the case with the sample dataset we have, but definitely something to look into for reliability.

Yeah the context window of the model. Right now it's possible to upload a document with millions of lines of lorem ipsum font size 1, and then the workflow and model would choke

OliverGeneser · 2026-03-13T12:06:43Z

app/workflows/extract_metadata_workflow.py

-        #     f"Please extract the structured metadata from this document."
-        # )
+        # Activity 2: Extract metadata using LLM
+        metadata = await workflow.execute_activity(


We should also add a Temporal RetryPolicy

OliverGeneser · 2026-03-13T12:21:55Z

app/activities/extract_metadata.py

+INSTRUCTIONS = """\
+Extract structured metadata from this document text.
+Focus on finding the title, abstract/summary, and authors.
+For authors, extract individual names as separate list items.
+Only include information that is clearly stated in the text.
+"""


From my experience models are usually very happy with something like this:

Give a very specific goal, something like:

Extract bibliographic metadata from the document text.

Then what it should focus on, in a very clear verbal text (I have experienced models thinking "/" means "find both"):

Focus on identifying the main document title, the abstract or summary if present and the list of authors.

And then some guidelines! Very important when trying to avoid hallucinations (Even though they from my experience still sometimes think they are the new researchers of our time and invent new stuff 😆 )

Guidelines: - Prefer information from bla bla bla. - Ignore page numbers, bla bla bla. - Only extract information explicitly stated in the text. - If information is missing, leave the corresponding field empty.

OliverGeneser · 2026-03-13T12:23:08Z

app/activities/extract_metadata.py

+INSTRUCTIONS = """\
+Extract structured metadata from this document text.
+Focus on finding the title, abstract/summary, and authors.
+For authors, extract individual names as separate list items.
+Only include information that is clearly stated in the text.
+"""


And we might also need to add a "Thank you!" for when AI takes over the world 🤣

… clarity

…rsist

juliehinge force-pushed the llm-extraction branch from 0be77b2 to 592ac93 Compare March 12, 2026 14:58

yashlamba reviewed Mar 12, 2026

View reviewed changes

app/activities/extract_metadata.py Outdated Show resolved Hide resolved

app/activities/extract_metadata.py Outdated Show resolved Hide resolved

app/activities/extract_metadata.py Show resolved Hide resolved

slint reviewed Mar 12, 2026

View reviewed changes

mairasalazar reviewed Mar 13, 2026

View reviewed changes

slint reviewed Mar 13, 2026

View reviewed changes

OliverGeneser requested changes Mar 13, 2026

View reviewed changes

yashlamba changed the title ~~Llm extraction~~ feat(metadata-extraction): add LLM based metadata extraction Mar 16, 2026

juliehinge and others added 5 commits March 16, 2026 13:46

LLM title and abstract extraction

4006823

LLM extraction also extracts authors now

7a6d5d7

Add LiteLLM API key congifuration to README and rename activities for…

b8e88a2

… clarity

Fix code formatting with ruff

47a2212

feat(result): add result field to model; add save result activity

33bf36c

yashlamba force-pushed the llm-extraction branch from 592ac93 to d480dab Compare March 16, 2026 15:01

feat(suggestions): return suggestions from metadata extraction and pe…

da14ab8

…rsist

yashlamba force-pushed the llm-extraction branch from d480dab to da14ab8 Compare March 16, 2026 15:09

yashlamba requested a review from slint March 16, 2026 15:49

yashlamba mentioned this pull request Mar 16, 2026

feat: prompt engineering #25

Open

		@@ -0,0 +1,6 @@
		"""Activities for the airdec-workflows application."""

		text_extraction, # Activity for extraction all text from the PDF
		metadata_extraction, # Activity for extraction metadata from PDF text

Conversation

juliehinge commented Mar 12, 2026

Add LLM-powered metadata extraction workflow

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yashlamba Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mairasalazar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yashlamba Mar 16, 2026 •

edited

Loading