Skip to content

PATENTS queries: EMA initialization and CPC hierarchy-level definitions are under-specified #45

@sahrizvi

Description

@sahrizvi

Summary

While running an agent against query_PATENTS, three places in the question text appear to be under-specified relative to what the validators check. This is a question for the maintainers — is the ambiguity intentional (e.g., to test agents' ability to handle under-specified data tasks), or worth tightening to match the convention already established in db_description_withhint.txt files for other datasets?

Observations

1. CPC hierarchy "level" naming is benchmark-internal

The questions in query_PATENTS reference CPC hierarchy "levels" by number, but the level numbering doesn't match any standard CPC reference (USPTO, EPO, Wikipedia) we could find.

Comparing query1 and query2 ground truths:

Query Question text Ground-truth code shape Standard CPC term
q1 "…CPC group codes at level 5 whose best year is 2022." 4-char codes (e.g. A22B, A23J, A41G) subclass
q2 "…the best year for each CPC group at level 4." 3-char codes (e.g. A21, A61, B23) class

So "level 5" in q1 means subclass and "level 4" in q2 means class. An agent that interprets these levels via standard CPC documentation lands at a different granularity than the validator expects.

2. Exponential moving average initialization is unspecified

Both q1 and q2 ask for the "highest exponential moving average of patent filings each year." The recurrence

EMA[t] = α · x[t] + (1 − α) · EMA[t−1]

requires a seed for EMA[0], and there are at least three legitimate conventions:

  • Seed with first observation: EMA[0] = x[0]
  • Seed with zero: EMA[0] = 0
  • Simple-average warmup: EMA[N−1] = mean(x[0..N−1])

Each produces different EMA series and consequently different "best year" selections per CPC code, which then changes which codes pass the q1 (best_year = 2022) filter. The question doesn't say which convention to use.

3. The cardinality of "highest" is unspecified

The question asks for the CPC areas with the "highest" exponential moving average but doesn't say how many. The q1 ground truth has exactly 50 entries and the q2 ground truth has 23. An agent reading the question alone has no way to derive these specific cutoffs.

Suggested resolution (optional)

If the ambiguity is unintentional, the most lightweight fix would mirror what other datasets in DAB already do — add a paragraph to query_PATENTS/db_description_withhint.txt pinning down the conventions, e.g.:

"EMA initialization: seed with the first observation (EMA[0] = x[0]). 'Level 5' refers to the 4-character CPC subclass code (e.g., A22B); 'level 4' refers to the 3-character class code (e.g., A22). For 'highest' rankings, return the top 50 (q1) or top 23 (q2) entries."

This stays in the spirit of the existing hint-file convention (e.g., the explicit term-code definitions in query_stockmarket/db_description_withhint.txt) and doesn't change the questions themselves. Happy to send a PR if the maintainers think this direction is welcome.

If the under-specification is intentional — testing agents on ambiguous real-world specs — feel free to close the issue; this is just to make sure the convention isn't an unintended gap.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions