fix: numberMatched returns null in heavy search requests by YuriZmytrakov · Pull Request #610 · stac-utils/stac-fastapi-elasticsearch-opensearch

YuriZmytrakov · 2026-02-23T10:41:47Z

Description:

In heavy search load, the execute_search function returns numberMatched = null in small fraction of request because the search task does not wait for the count task to complete before returning the results.

This issue occurs under heavy load in approximately 5–10% of requests. It was tested using a script that executed a butch of requests. To resolve this bug, the search task must wait for the count task to complete before returning results. This can be achieved by implementing asyncio.TaskGroup, which ensures that all tasks finish execution. After introducing asyncio.TaskGroup, the numberMatched = null issue no longer occurs.

PR Checklist:

Code is formatted and linted (run pre-commit run --all-files)
Tests pass (run make test)
Documentation has been updated to reflect changes, if applicable
Changes are added to the changelog

ZSzCF

asyncio. tasks moved to TaskGroup just

jonhealy1

@YuriZmytrakov Thank you for looking into this and putting this PR together!

While I definitely understand the desire to guarantee the numberMatched field is populated, I think we need to balance this with the performance implications of forcing the API to wait for slow count queries. In large Elasticsearch/OpenSearch clusters, count queries can be significantly slower than standard searches.

Additionally, we cannot use asyncio.TaskGroup here for a few critical technical reasons:

Exception Handling: In the original code, if the count task fails, we safely catch it when calling count_task.result(), log the error, and still return the search hits. If a task inside a TaskGroup fails, it raises an ExceptionGroup when the context manager exits. This bypasses our downstream error handling and will crash the API with a 500 error instead of failing gracefully.
Aggressive Cancellation: If the count query fails, the TaskGroup will immediately cancel the search task, destroying the user's search request entirely.

Proposed Solution: A Bounded Timeout Toggle
Instead of forcing all deployments to wait infinitely (which risks hanging the API worker threads if the database stalls), the safest middle ground may be to introduce a configurable timeout variable (e.g., COUNT_TIMEOUT).

If we set the default to a fast 0.5 seconds, the API gives the count task a reasonable grace period to finish, but strictly aborts the wait if the database is struggling. Deployments that need guaranteed counts can increase this timeout, while those that prioritize raw speed can set it to 0.

Could we update this PR to implement this bounded approach? The logic would revert the TaskGroup back to create_task and look something like this:

# 1. Create the tasks independently
search_task = asyncio.create_task(
    self.client.search(
        index=index_param,
        # ... (keep existing search args)
    )
)

count_task = asyncio.create_task(
    self.client.count(
        index=index_param,
        ignore_unavailable=ignore_unavailable,
        body=count_query,
    )
)

# 2. Await the search task as the absolute priority
try:
    es_response = await search_task
except exceptions.NotFoundError:
    raise NotFoundError(f"Collections '{collection_ids}' do not exist")

# 3. Explicitly wait for count with a strict boundary
import os
count_timeout = float(os.getenv("COUNT_TIMEOUT", 0.5))

if count_timeout > 0 and not count_task.done():
    await asyncio.wait([count_task], timeout=count_timeout)

# 4. Safely extract the count if it finished
if count_task.done():
    try:
        matched = count_task.result().get("count")
    except Exception as e:
        logger.error(f"Count task failed: {e}")

This ensures we don't break our error handling, protects the API's overall response times, but provides a safe, bounded way to vastly reduce numberMatched=null responses. Let me know what you think of this approach.

- Use COUNT_TIMEOUT environment variable (default 0.5s) - Return search results without count if task exceeds timeout - Log warning when count task times out to help debugging - Add unit test for count delay simulation and solved by timeout

YuriZmytrakov · 2026-02-24T13:16:01Z

@YuriZmytrakov Thank you for looking into this and putting this PR together!

While I definitely understand the desire to guarantee the numberMatched field is populated, I think we need to balance this with the performance implications of forcing the API to wait for slow count queries. In large Elasticsearch/OpenSearch clusters, count queries can be significantly slower than standard searches.

Additionally, we cannot use asyncio.TaskGroup here for a few critical technical reasons:

Exception Handling: In the original code, if the count task fails, we safely catch it when calling count_task.result(), log the error, and still return the search hits. If a task inside a TaskGroup fails, it raises an ExceptionGroup when the context manager exits. This bypasses our downstream error handling and will crash the API with a 500 error instead of failing gracefully.

Aggressive Cancellation: If the count query fails, the TaskGroup will immediately cancel the search task, destroying the user's search request entirely.

Proposed Solution: A Bounded Timeout Toggle Instead of forcing all deployments to wait infinitely (which risks hanging the API worker threads if the database stalls), the safest middle ground may be to introduce a configurable timeout variable (e.g., COUNT_TIMEOUT).

If we set the default to a fast 0.5 seconds, the API gives the count task a reasonable grace period to finish, but strictly aborts the wait if the database is struggling. Deployments that need guaranteed counts can increase this timeout, while those that prioritize raw speed can set it to 0.

Could we update this PR to implement this bounded approach? The logic would revert the TaskGroup back to create_task and look something like this:
# 1. Create the tasks independently
search_task = asyncio.create_task(
    self.client.search(
        index=index_param,
        # ... (keep existing search args)
    )
)

count_task = asyncio.create_task(
    self.client.count(
        index=index_param,
        ignore_unavailable=ignore_unavailable,
        body=count_query,
    )
)

# 2. Await the search task as the absolute priority
try:
    es_response = await search_task
except exceptions.NotFoundError:
    raise NotFoundError(f"Collections '{collection_ids}' do not exist")

# 3. Explicitly wait for count with a strict boundary
import os
count_timeout = float(os.getenv("COUNT_TIMEOUT", 0.5))

if count_timeout > 0 and not count_task.done():
    await asyncio.wait([count_task], timeout=count_timeout)

# 4. Safely extract the count if it finished
if count_task.done():
    try:
        matched = count_task.result().get("count")
    except Exception as e:
        logger.error(f"Count task failed: {e}")
This ensures we don't break our error handling, protects the API's overall response times, but provides a safe, bounded way to vastly reduce numberMatched=null responses. Let me know what you think of this approach.

Thank you, @jonhealy1 This is a very good suggestion. Although it does not fully resolve the issue of the count task, it significantly reduces the number of null values returned in /search requests. I tested this solution and observed a dramatic reduction in nulls. Adding this functionality will allow us to better calibrate the count timeout, preventing /search requests from slowing down while also minimizing null values in numberMatched. Additionally, I updated the documentation and added a unit test that simulates a delayed count task and verifies how it is handled by asyncio.wait.

jonhealy1

Looks great - thank you - a couple of minor things

jonhealy1 · 2026-02-24T15:42:58Z

stac_fastapi/opensearch/stac_fastapi/opensearch/database_logic.py

+
+        if count_timeout > 0 and not count_task.done():
+            try:
+                print("Waiting for count task to complete...")


This should be logger.debug not print.

jonhealy1 · 2026-02-24T15:43:17Z

CHANGELOG.md


+### Fixed
+
+- Fixed `numberMatched=null` responses by using `asyncio.TaskGroup` to wait for both `search` and `count` tasks to complete in `execute_search`, preventing premature returns when count task hadn't finished.[#610](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/610)


Changelog entry needs to be updated.

jonhealy1

Thanks @YuriZmytrakov

YuriZmytrakov requested a review from jonhealy1 February 23, 2026 10:59

ZSzCF approved these changes Feb 23, 2026

View reviewed changes

jonhealy1 requested changes Feb 23, 2026

View reviewed changes

YuriZmytrakov force-pushed the CAT-1718 branch 4 times, most recently from f277b66 to 33a7aa0 Compare February 24, 2026 12:53

YuriZmytrakov force-pushed the CAT-1718 branch 3 times, most recently from d9458bb to f73f8ae Compare February 24, 2026 13:04

YuriZmytrakov requested review from ZSzCF and jonhealy1 February 24, 2026 13:09

jonhealy1 requested changes Feb 24, 2026

View reviewed changes

docs: update changelog

84d2991

YuriZmytrakov force-pushed the CAT-1718 branch from 5c133ab to 84d2991 Compare February 24, 2026 15:55

Merge branch 'main' into CAT-1718

7ad97e6

YuriZmytrakov requested a review from jonhealy1 February 24, 2026 15:59

jonhealy1 approved these changes Feb 24, 2026

View reviewed changes

ZSzCF approved these changes Feb 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: numberMatched returns null in heavy search requests#610

fix: numberMatched returns null in heavy search requests#610
YuriZmytrakov wants to merge 3 commits intomainfrom
CAT-1718

YuriZmytrakov commented Feb 23, 2026 •

edited

Loading

Uh oh!

ZSzCF left a comment

Uh oh!

jonhealy1 left a comment

Uh oh!

YuriZmytrakov commented Feb 24, 2026

Uh oh!

jonhealy1 left a comment

Uh oh!

jonhealy1 Feb 24, 2026

Uh oh!

jonhealy1 Feb 24, 2026

Uh oh!

jonhealy1 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		### Fixed

		- Fixed `numberMatched=null` responses by using `asyncio.TaskGroup` to wait for both `search` and `count` tasks to complete in `execute_search`, preventing premature returns when count task hadn't finished.[#610](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/610)

Conversation

YuriZmytrakov commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZSzCF left a comment

Choose a reason for hiding this comment

Uh oh!

jonhealy1 left a comment

Choose a reason for hiding this comment

Uh oh!

YuriZmytrakov commented Feb 24, 2026

Uh oh!

jonhealy1 left a comment

Choose a reason for hiding this comment

Uh oh!

jonhealy1 Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

jonhealy1 Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

jonhealy1 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

YuriZmytrakov commented Feb 23, 2026 •

edited

Loading