Skip to content

feat(metadata-extraction): add LLM based metadata extraction#18

Open
juliehinge wants to merge 6 commits intomainfrom
llm-extraction
Open

feat(metadata-extraction): add LLM based metadata extraction#18
juliehinge wants to merge 6 commits intomainfrom
llm-extraction

Conversation

@juliehinge
Copy link
Collaborator

Add LLM-powered metadata extraction workflow

This PR adds pdf metadata extraction using LLMs as a second activity in the workflow.

It takes the output from Activity 1 (full text extracted from PDF using extraction tool) and uses an LLM to extract title, authors, and abstract from that text and returns it.

abstract: str | None = Field(default=None, description="Document abstract")
authors: list[str] = Field(
default_factory=list, description="List of document authors"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: we should also add publication_date and doi fields to cover the minimal metadata for the deposit form.

we should also add examples=[...], especially for the dates and DOI, so that the format is correct

Copy link
Member

@yashlamba yashlamba Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a validator for these (we could have a feedback loop, until it's of the correct format. (with like 1-2 retries)

Copy link
Collaborator

@mairasalazar mairasalazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just made a few suggestions, minor stuff

@@ -0,0 +1,6 @@
"""Activities for the airdec-workflows application."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we already start changing the name to Orcha :)


@activity.defn
async def create(request: ExtractPdfContentRequest) -> ExtractPdfContentResponse:
async def text_extraction(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: if you agree with my previous comment, then maybe here it could be parse_content() or extract_raw_content(), something like that

Comment on lines +27 to +28
text_extraction, # Activity for extraction all text from the PDF
metadata_extraction, # Activity for extraction metadata from PDF text
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very minor suggestion:

Suggested change
text_extraction, # Activity for extraction all text from the PDF
metadata_extraction, # Activity for extraction metadata from PDF text
text_extraction, # Activity for extracting all content from file
metadata_extraction, # Activity for extracting metadata from file content

Mostly for the future, since we also want to handle other file types, so feel free to leave it as it is for now

"""Request to extract metadata from document text."""

text: str = Field(description="Document text to analyze")
model: str = Field(default="groq/qwen/qwen3-32b", description="Model to use")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor/comment: I like the idea that we can easily select different models like so. my only concern is in the case where we start having different configs/parameters that we pass for each model (e.g. temperature, flags for thinking effort, etc.) in which case we'll probably need some sort of preconfigured presets... not an issue right now since we can evolve this signature as we go

Comment on lines +29 to +34
INSTRUCTIONS = """\
Extract structured metadata from this document text.
Focus on finding the title, abstract/summary, and authors.
For authors, extract individual names as separate list items.
Only include information that is clearly stated in the text.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that adding this info in the system prompt makes it a bit more "explicit", but there's a bit of repetition also compared to what the Pydantic schema passes as context... I guess we need to prompt-engineer this (the future is here! 🙃) so that we have a balanced approach on giving "purpose" to the task, but details coming from the schema.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my experience models are usually very happy with something like this:

Give a very specific goal, something like:

Extract bibliographic metadata from the document text.

Then what it should focus on, in a very clear verbal text (I have experienced models thinking "/" means "find both"):

Focus on identifying the main document title, the abstract or summary if present and the list of authors.

And then some guidelines! Very important when trying to avoid hallucinations (Even though they from my experience still sometimes think they are the new researchers of our time and invent new stuff 😆 )

Guidelines:
- Prefer information from bla bla bla.
- Ignore page numbers, bla bla bla.
- Only extract information explicitly stated in the text.
- If information is missing, leave the corresponding field empty.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we might also need to add a "Thank you!" for when AI takes over the world 🤣

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that in the future, we might focus on different fields for extraction depending on the resource type and the community to which the record is submitted. For example:

  • for a journal articles, we'll also need journal title, ISSN, issue, volume, etc.
  • for EU-funded projects we'll also check for funding (though for papers this usually comes later in the PDF under "Acknowledgments", but for reports/deliverables it's usually in the footer of every page).

The other use-case is "field-specific" refinement/extraction, where the user gives a prompt for a specific field. Providing the entire context would be confusing, but if e.g. we could pass a partial Pydantic model schema, we focus only on the specific information we need.

So I would like as much as possible thatthe Pydantic models/schemas drive a big part of this (also because in the future we might be dynamically generating these from a user-provided ruleset in communities).


For now let's stick with a prompt that "just works", since we're just targeting the required fields of the deposit form. And we shelve a bigger task on how to design this better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And for some encouragement, I would go with:

MAKE NO MISTAKES! Or we switch to {competitor's model}!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that in the future, we might focus on different fields for extraction depending on the resource type and the community to which the record is submitted. For example:

  • for a journal articles, we'll also need journal title, ISSN, issue, volume, etc.
  • for EU-funded projects we'll also check for funding (though for papers this usually comes later in the PDF under "Acknowledgments", but for reports/deliverables it's usually in the footer of every page).

The other use-case is "field-specific" refinement/extraction, where the user gives a prompt for a specific field. Providing the entire context would be confusing, but if e.g. we could pass a partial Pydantic model schema, we focus only on the specific information we need.

So I would like as much as possible thatthe Pydantic models/schemas drive a big part of this (also because in the future we might be dynamically generating these from a user-provided ruleset in communities).

For now let's stick with a prompt that "just works", since we're just targeting the required fields of the deposit form. And we shelve a bigger task on how to design this better.

That makes sense, especially when we expect the set of extracted fields to evolve depending on the resource type or community rules.

My thinking with the prompt was mostly pragmatic. In my experience models behave more reliably if they get a very clear goal and a few extraction guidelines (prefer first page, ignore headers/footers, don’t hallucinate, etc.). But I agree that we shouldn’t encode too much field-specific logic in the prompt and let the schema drive that.

# Activity 2: Extract metadata using LLM
metadata = await workflow.execute_activity(
metadata_extraction,
ExtractMetadataRequest(text=content.text),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably think about chunking the content, so we don't exceed the token limit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the max context configured for the model on the provider? If we end up with a document with too much content in the first pages, I'm not sure if chunking would help 😅

I don't know if the best approach is to have a "context budget" and basically just quickly fail documents that surpass it... I don't think it's the case with the sample dataset we have, but definitely something to look into for reliability.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the context window of the model. Right now it's possible to upload a document with millions of lines of lorem ipsum font size 1, and then the workflow and model would choke

# f"Please extract the structured metadata from this document."
# )
# Activity 2: Extract metadata using LLM
metadata = await workflow.execute_activity(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add a Temporal RetryPolicy

Comment on lines +29 to +34
INSTRUCTIONS = """\
Extract structured metadata from this document text.
Focus on finding the title, abstract/summary, and authors.
For authors, extract individual names as separate list items.
Only include information that is clearly stated in the text.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my experience models are usually very happy with something like this:

Give a very specific goal, something like:

Extract bibliographic metadata from the document text.

Then what it should focus on, in a very clear verbal text (I have experienced models thinking "/" means "find both"):

Focus on identifying the main document title, the abstract or summary if present and the list of authors.

And then some guidelines! Very important when trying to avoid hallucinations (Even though they from my experience still sometimes think they are the new researchers of our time and invent new stuff 😆 )

Guidelines:
- Prefer information from bla bla bla.
- Ignore page numbers, bla bla bla.
- Only extract information explicitly stated in the text.
- If information is missing, leave the corresponding field empty.

Comment on lines +29 to +34
INSTRUCTIONS = """\
Extract structured metadata from this document text.
Focus on finding the title, abstract/summary, and authors.
For authors, extract individual names as separate list items.
Only include information that is clearly stated in the text.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we might also need to add a "Thank you!" for when AI takes over the world 🤣

@yashlamba yashlamba changed the title Llm extraction feat(metadata-extraction): add LLM based metadata extraction Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants