PATENTS queries: EMA initialization and CPC hierarchy-level definitions are under-specified

## Summary

While running an agent against `query_PATENTS`, three places in the question text appear to be under-specified relative to what the validators check. This is a question for the maintainers — is the ambiguity intentional (e.g., to test agents' ability to handle under-specified data tasks), or worth tightening to match the convention already established in `db_description_withhint.txt` files for other datasets?

## Observations

### 1. CPC hierarchy "level" naming is benchmark-internal

The questions in `query_PATENTS` reference CPC hierarchy "levels" by number, but the level numbering doesn't match any standard CPC reference (USPTO, EPO, Wikipedia) we could find.

Comparing `query1` and `query2` ground truths:

| Query | Question text | Ground-truth code shape | Standard CPC term |
|---|---|---|---|
| q1 | *"…CPC group codes at **level 5** whose best year is 2022."* | 4-char codes (e.g. `A22B`, `A23J`, `A41G`) | **subclass** |
| q2 | *"…the best year for each CPC group at **level 4**."* | 3-char codes (e.g. `A21`, `A61`, `B23`) | **class** |

So "level 5" in q1 means subclass and "level 4" in q2 means class. An agent that interprets these levels via standard CPC documentation lands at a different granularity than the validator expects.

### 2. Exponential moving average initialization is unspecified

Both q1 and q2 ask for the "highest exponential moving average of patent filings each year." The recurrence

```
EMA[t] = α · x[t] + (1 − α) · EMA[t−1]
```

requires a seed for `EMA[0]`, and there are at least three legitimate conventions:

- **Seed with first observation**: `EMA[0] = x[0]`
- **Seed with zero**: `EMA[0] = 0`
- **Simple-average warmup**: `EMA[N−1] = mean(x[0..N−1])`

Each produces different `EMA` series and consequently different "best year" selections per CPC code, which then changes which codes pass the q1 (best_year = 2022) filter. The question doesn't say which convention to use.

### 3. The cardinality of "highest" is unspecified

The question asks for the CPC areas with the "highest" exponential moving average but doesn't say how many. The q1 ground truth has exactly 50 entries and the q2 ground truth has 23. An agent reading the question alone has no way to derive these specific cutoffs.

## Suggested resolution (optional)

If the ambiguity is unintentional, the most lightweight fix would mirror what other datasets in DAB already do — add a paragraph to `query_PATENTS/db_description_withhint.txt` pinning down the conventions, e.g.:

> "EMA initialization: seed with the first observation (`EMA[0] = x[0]`). 'Level 5' refers to the 4-character CPC subclass code (e.g., `A22B`); 'level 4' refers to the 3-character class code (e.g., `A22`). For 'highest' rankings, return the top 50 (q1) or top 23 (q2) entries."

This stays in the spirit of the existing hint-file convention (e.g., the explicit term-code definitions in `query_stockmarket/db_description_withhint.txt`) and doesn't change the questions themselves. Happy to send a PR if the maintainers think this direction is welcome.

If the under-specification is intentional — testing agents on ambiguous real-world specs — feel free to close the issue; this is just to make sure the convention isn't an unintended gap.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PATENTS queries: EMA initialization and CPC hierarchy-level definitions are under-specified #45

Summary

Observations

1. CPC hierarchy "level" naming is benchmark-internal

2. Exponential moving average initialization is unspecified

3. The cardinality of "highest" is unspecified

Suggested resolution (optional)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Query	Question text	Ground-truth code shape	Standard CPC term
q1	"…CPC group codes at level 5* whose best year is 2022."*	4-char codes (e.g. `A22B`, `A23J`, `A41G`)	subclass
q2	"…the best year for each CPC group at level 4."	3-char codes (e.g. `A21`, `A61`, `B23`)	class

PATENTS queries: EMA initialization and CPC hierarchy-level definitions are under-specified #45

Description

Summary

Observations

1. CPC hierarchy "level" naming is benchmark-internal

2. Exponential moving average initialization is unspecified

3. The cardinality of "highest" is unspecified

Suggested resolution (optional)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions