feat(metadata-extraction): add LLM based metadata extraction#18
feat(metadata-extraction): add LLM based metadata extraction#18juliehinge wants to merge 6 commits intomainfrom
Conversation
0be77b2 to
592ac93
Compare
app/activities/extract_metadata.py
Outdated
| abstract: str | None = Field(default=None, description="Document abstract") | ||
| authors: list[str] = Field( | ||
| default_factory=list, description="List of document authors" | ||
| ) |
There was a problem hiding this comment.
minor: we should also add publication_date and doi fields to cover the minimal metadata for the deposit form.
we should also add examples=[...], especially for the dates and DOI, so that the format is correct
There was a problem hiding this comment.
We could add a validator for these (we could have a feedback loop, until it's of the correct format. (with like 1-2 retries)
mairasalazar
left a comment
There was a problem hiding this comment.
just made a few suggestions, minor stuff
| @@ -0,0 +1,6 @@ | |||
| """Activities for the airdec-workflows application.""" | |||
There was a problem hiding this comment.
I suggest we already start changing the name to Orcha :)
|
|
||
| @activity.defn | ||
| async def create(request: ExtractPdfContentRequest) -> ExtractPdfContentResponse: | ||
| async def text_extraction( |
There was a problem hiding this comment.
minor: if you agree with my previous comment, then maybe here it could be parse_content() or extract_raw_content(), something like that
| text_extraction, # Activity for extraction all text from the PDF | ||
| metadata_extraction, # Activity for extraction metadata from PDF text |
There was a problem hiding this comment.
very minor suggestion:
| text_extraction, # Activity for extraction all text from the PDF | |
| metadata_extraction, # Activity for extraction metadata from PDF text | |
| text_extraction, # Activity for extracting all content from file | |
| metadata_extraction, # Activity for extracting metadata from file content |
Mostly for the future, since we also want to handle other file types, so feel free to leave it as it is for now
app/activities/extract_metadata.py
Outdated
| """Request to extract metadata from document text.""" | ||
|
|
||
| text: str = Field(description="Document text to analyze") | ||
| model: str = Field(default="groq/qwen/qwen3-32b", description="Model to use") |
There was a problem hiding this comment.
minor/comment: I like the idea that we can easily select different models like so. my only concern is in the case where we start having different configs/parameters that we pass for each model (e.g. temperature, flags for thinking effort, etc.) in which case we'll probably need some sort of preconfigured presets... not an issue right now since we can evolve this signature as we go
| INSTRUCTIONS = """\ | ||
| Extract structured metadata from this document text. | ||
| Focus on finding the title, abstract/summary, and authors. | ||
| For authors, extract individual names as separate list items. | ||
| Only include information that is clearly stated in the text. | ||
| """ |
There was a problem hiding this comment.
I know that adding this info in the system prompt makes it a bit more "explicit", but there's a bit of repetition also compared to what the Pydantic schema passes as context... I guess we need to prompt-engineer this (the future is here! 🙃) so that we have a balanced approach on giving "purpose" to the task, but details coming from the schema.
There was a problem hiding this comment.
From my experience models are usually very happy with something like this:
Give a very specific goal, something like:
Extract bibliographic metadata from the document text.
Then what it should focus on, in a very clear verbal text (I have experienced models thinking "/" means "find both"):
Focus on identifying the main document title, the abstract or summary if present and the list of authors.
And then some guidelines! Very important when trying to avoid hallucinations (Even though they from my experience still sometimes think they are the new researchers of our time and invent new stuff 😆 )
Guidelines:
- Prefer information from bla bla bla.
- Ignore page numbers, bla bla bla.
- Only extract information explicitly stated in the text.
- If information is missing, leave the corresponding field empty.
There was a problem hiding this comment.
And we might also need to add a "Thank you!" for when AI takes over the world 🤣
There was a problem hiding this comment.
My concern is that in the future, we might focus on different fields for extraction depending on the resource type and the community to which the record is submitted. For example:
- for a journal articles, we'll also need journal title, ISSN, issue, volume, etc.
- for EU-funded projects we'll also check for funding (though for papers this usually comes later in the PDF under "Acknowledgments", but for reports/deliverables it's usually in the footer of every page).
The other use-case is "field-specific" refinement/extraction, where the user gives a prompt for a specific field. Providing the entire context would be confusing, but if e.g. we could pass a partial Pydantic model schema, we focus only on the specific information we need.
So I would like as much as possible thatthe Pydantic models/schemas drive a big part of this (also because in the future we might be dynamically generating these from a user-provided ruleset in communities).
For now let's stick with a prompt that "just works", since we're just targeting the required fields of the deposit form. And we shelve a bigger task on how to design this better.
There was a problem hiding this comment.
And for some encouragement, I would go with:
MAKE NO MISTAKES! Or we switch to {competitor's model}!
There was a problem hiding this comment.
My concern is that in the future, we might focus on different fields for extraction depending on the resource type and the community to which the record is submitted. For example:
- for a journal articles, we'll also need journal title, ISSN, issue, volume, etc.
- for EU-funded projects we'll also check for funding (though for papers this usually comes later in the PDF under "Acknowledgments", but for reports/deliverables it's usually in the footer of every page).
The other use-case is "field-specific" refinement/extraction, where the user gives a prompt for a specific field. Providing the entire context would be confusing, but if e.g. we could pass a partial Pydantic model schema, we focus only on the specific information we need.
So I would like as much as possible thatthe Pydantic models/schemas drive a big part of this (also because in the future we might be dynamically generating these from a user-provided ruleset in communities).
For now let's stick with a prompt that "just works", since we're just targeting the required fields of the deposit form. And we shelve a bigger task on how to design this better.
That makes sense, especially when we expect the set of extracted fields to evolve depending on the resource type or community rules.
My thinking with the prompt was mostly pragmatic. In my experience models behave more reliably if they get a very clear goal and a few extraction guidelines (prefer first page, ignore headers/footers, don’t hallucinate, etc.). But I agree that we shouldn’t encode too much field-specific logic in the prompt and let the schema drive that.
| # Activity 2: Extract metadata using LLM | ||
| metadata = await workflow.execute_activity( | ||
| metadata_extraction, | ||
| ExtractMetadataRequest(text=content.text), |
There was a problem hiding this comment.
We should probably think about chunking the content, so we don't exceed the token limit
There was a problem hiding this comment.
You mean the max context configured for the model on the provider? If we end up with a document with too much content in the first pages, I'm not sure if chunking would help 😅
I don't know if the best approach is to have a "context budget" and basically just quickly fail documents that surpass it... I don't think it's the case with the sample dataset we have, but definitely something to look into for reliability.
There was a problem hiding this comment.
Yeah the context window of the model. Right now it's possible to upload a document with millions of lines of lorem ipsum font size 1, and then the workflow and model would choke
| # f"Please extract the structured metadata from this document." | ||
| # ) | ||
| # Activity 2: Extract metadata using LLM | ||
| metadata = await workflow.execute_activity( |
There was a problem hiding this comment.
We should also add a Temporal RetryPolicy
| INSTRUCTIONS = """\ | ||
| Extract structured metadata from this document text. | ||
| Focus on finding the title, abstract/summary, and authors. | ||
| For authors, extract individual names as separate list items. | ||
| Only include information that is clearly stated in the text. | ||
| """ |
There was a problem hiding this comment.
From my experience models are usually very happy with something like this:
Give a very specific goal, something like:
Extract bibliographic metadata from the document text.
Then what it should focus on, in a very clear verbal text (I have experienced models thinking "/" means "find both"):
Focus on identifying the main document title, the abstract or summary if present and the list of authors.
And then some guidelines! Very important when trying to avoid hallucinations (Even though they from my experience still sometimes think they are the new researchers of our time and invent new stuff 😆 )
Guidelines:
- Prefer information from bla bla bla.
- Ignore page numbers, bla bla bla.
- Only extract information explicitly stated in the text.
- If information is missing, leave the corresponding field empty.
| INSTRUCTIONS = """\ | ||
| Extract structured metadata from this document text. | ||
| Focus on finding the title, abstract/summary, and authors. | ||
| For authors, extract individual names as separate list items. | ||
| Only include information that is clearly stated in the text. | ||
| """ |
There was a problem hiding this comment.
And we might also need to add a "Thank you!" for when AI takes over the world 🤣
592ac93 to
d480dab
Compare
d480dab to
da14ab8
Compare
Add LLM-powered metadata extraction workflow
This PR adds pdf metadata extraction using LLMs as a second activity in the workflow.
It takes the output from Activity 1 (full text extracted from PDF using extraction tool) and uses an LLM to extract title, authors, and abstract from that text and returns it.