Skip to content

fix: numberMatched returns null in heavy search requests#610

Open
YuriZmytrakov wants to merge 3 commits intomainfrom
CAT-1718
Open

fix: numberMatched returns null in heavy search requests#610
YuriZmytrakov wants to merge 3 commits intomainfrom
CAT-1718

Conversation

@YuriZmytrakov
Copy link
Collaborator

@YuriZmytrakov YuriZmytrakov commented Feb 23, 2026

Description:

In heavy search load, the execute_search function returns numberMatched = null in small fraction of request because the search task does not wait for the count task to complete before returning the results.

image

This issue occurs under heavy load in approximately 5–10% of requests. It was tested using a script that executed a butch of requests. To resolve this bug, the search task must wait for the count task to complete before returning results. This can be achieved by implementing asyncio.TaskGroup, which ensures that all tasks finish execution. After introducing asyncio.TaskGroup, the numberMatched = null issue no longer occurs.

PR Checklist:

  • Code is formatted and linted (run pre-commit run --all-files)
  • Tests pass (run make test)
  • Documentation has been updated to reflect changes, if applicable
  • Changes are added to the changelog

Copy link

@ZSzCF ZSzCF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asyncio. tasks moved to TaskGroup just

Copy link
Collaborator

@jonhealy1 jonhealy1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YuriZmytrakov Thank you for looking into this and putting this PR together!

While I definitely understand the desire to guarantee the numberMatched field is populated, I think we need to balance this with the performance implications of forcing the API to wait for slow count queries. In large Elasticsearch/OpenSearch clusters, count queries can be significantly slower than standard searches.

Additionally, we cannot use asyncio.TaskGroup here for a few critical technical reasons:

  1. Exception Handling: In the original code, if the count task fails, we safely catch it when calling count_task.result(), log the error, and still return the search hits. If a task inside a TaskGroup fails, it raises an ExceptionGroup when the context manager exits. This bypasses our downstream error handling and will crash the API with a 500 error instead of failing gracefully.
  2. Aggressive Cancellation: If the count query fails, the TaskGroup will immediately cancel the search task, destroying the user's search request entirely.

Proposed Solution: A Bounded Timeout Toggle
Instead of forcing all deployments to wait infinitely (which risks hanging the API worker threads if the database stalls), the safest middle ground may be to introduce a configurable timeout variable (e.g., COUNT_TIMEOUT).

If we set the default to a fast 0.5 seconds, the API gives the count task a reasonable grace period to finish, but strictly aborts the wait if the database is struggling. Deployments that need guaranteed counts can increase this timeout, while those that prioritize raw speed can set it to 0.

Could we update this PR to implement this bounded approach? The logic would revert the TaskGroup back to create_task and look something like this:

# 1. Create the tasks independently
search_task = asyncio.create_task(
    self.client.search(
        index=index_param,
        # ... (keep existing search args)
    )
)

count_task = asyncio.create_task(
    self.client.count(
        index=index_param,
        ignore_unavailable=ignore_unavailable,
        body=count_query,
    )
)

# 2. Await the search task as the absolute priority
try:
    es_response = await search_task
except exceptions.NotFoundError:
    raise NotFoundError(f"Collections '{collection_ids}' do not exist")

# 3. Explicitly wait for count with a strict boundary
import os
count_timeout = float(os.getenv("COUNT_TIMEOUT", 0.5))

if count_timeout > 0 and not count_task.done():
    await asyncio.wait([count_task], timeout=count_timeout)

# 4. Safely extract the count if it finished
if count_task.done():
    try:
        matched = count_task.result().get("count")
    except Exception as e:
        logger.error(f"Count task failed: {e}")

This ensures we don't break our error handling, protects the API's overall response times, but provides a safe, bounded way to vastly reduce numberMatched=null responses. Let me know what you think of this approach.

@YuriZmytrakov YuriZmytrakov force-pushed the CAT-1718 branch 4 times, most recently from f277b66 to 33a7aa0 Compare February 24, 2026 12:53
- Use COUNT_TIMEOUT environment variable (default 0.5s)
- Return search results without count if task exceeds timeout
- Log warning when count task times out to help debugging
- Add unit test for count delay simulation and solved by timeout
@YuriZmytrakov YuriZmytrakov force-pushed the CAT-1718 branch 3 times, most recently from d9458bb to f73f8ae Compare February 24, 2026 13:04
@YuriZmytrakov
Copy link
Collaborator Author

@YuriZmytrakov Thank you for looking into this and putting this PR together!

While I definitely understand the desire to guarantee the numberMatched field is populated, I think we need to balance this with the performance implications of forcing the API to wait for slow count queries. In large Elasticsearch/OpenSearch clusters, count queries can be significantly slower than standard searches.

Additionally, we cannot use asyncio.TaskGroup here for a few critical technical reasons:

  1. Exception Handling: In the original code, if the count task fails, we safely catch it when calling count_task.result(), log the error, and still return the search hits. If a task inside a TaskGroup fails, it raises an ExceptionGroup when the context manager exits. This bypasses our downstream error handling and will crash the API with a 500 error instead of failing gracefully.
  2. Aggressive Cancellation: If the count query fails, the TaskGroup will immediately cancel the search task, destroying the user's search request entirely.

Proposed Solution: A Bounded Timeout Toggle Instead of forcing all deployments to wait infinitely (which risks hanging the API worker threads if the database stalls), the safest middle ground may be to introduce a configurable timeout variable (e.g., COUNT_TIMEOUT).

If we set the default to a fast 0.5 seconds, the API gives the count task a reasonable grace period to finish, but strictly aborts the wait if the database is struggling. Deployments that need guaranteed counts can increase this timeout, while those that prioritize raw speed can set it to 0.

Could we update this PR to implement this bounded approach? The logic would revert the TaskGroup back to create_task and look something like this:

# 1. Create the tasks independently
search_task = asyncio.create_task(
    self.client.search(
        index=index_param,
        # ... (keep existing search args)
    )
)

count_task = asyncio.create_task(
    self.client.count(
        index=index_param,
        ignore_unavailable=ignore_unavailable,
        body=count_query,
    )
)

# 2. Await the search task as the absolute priority
try:
    es_response = await search_task
except exceptions.NotFoundError:
    raise NotFoundError(f"Collections '{collection_ids}' do not exist")

# 3. Explicitly wait for count with a strict boundary
import os
count_timeout = float(os.getenv("COUNT_TIMEOUT", 0.5))

if count_timeout > 0 and not count_task.done():
    await asyncio.wait([count_task], timeout=count_timeout)

# 4. Safely extract the count if it finished
if count_task.done():
    try:
        matched = count_task.result().get("count")
    except Exception as e:
        logger.error(f"Count task failed: {e}")

This ensures we don't break our error handling, protects the API's overall response times, but provides a safe, bounded way to vastly reduce numberMatched=null responses. Let me know what you think of this approach.

Thank you, @jonhealy1 This is a very good suggestion. Although it does not fully resolve the issue of the count task, it significantly reduces the number of null values returned in /search requests. I tested this solution and observed a dramatic reduction in nulls. Adding this functionality will allow us to better calibrate the count timeout, preventing /search requests from slowing down while also minimizing null values in numberMatched. Additionally, I updated the documentation and added a unit test that simulates a delayed count task and verifies how it is handled by asyncio.wait.

Copy link
Collaborator

@jonhealy1 jonhealy1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great - thank you - a couple of minor things


if count_timeout > 0 and not count_task.done():
try:
print("Waiting for count task to complete...")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be logger.debug not print.

CHANGELOG.md Outdated

### Fixed

- Fixed `numberMatched=null` responses by using `asyncio.TaskGroup` to wait for both `search` and `count` tasks to complete in `execute_search`, preventing premature returns when count task hadn't finished.[#610](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/610)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog entry needs to be updated.

Copy link
Collaborator

@jonhealy1 jonhealy1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants