fix: numberMatched returns null in heavy search requests#610
fix: numberMatched returns null in heavy search requests#610YuriZmytrakov wants to merge 3 commits intomainfrom
Conversation
ZSzCF
left a comment
There was a problem hiding this comment.
asyncio. tasks moved to TaskGroup just
jonhealy1
left a comment
There was a problem hiding this comment.
@YuriZmytrakov Thank you for looking into this and putting this PR together!
While I definitely understand the desire to guarantee the numberMatched field is populated, I think we need to balance this with the performance implications of forcing the API to wait for slow count queries. In large Elasticsearch/OpenSearch clusters, count queries can be significantly slower than standard searches.
Additionally, we cannot use asyncio.TaskGroup here for a few critical technical reasons:
- Exception Handling: In the original code, if the count task fails, we safely catch it when calling
count_task.result(), log the error, and still return the search hits. If a task inside aTaskGroupfails, it raises anExceptionGroupwhen the context manager exits. This bypasses our downstream error handling and will crash the API with a 500 error instead of failing gracefully. - Aggressive Cancellation: If the count query fails, the
TaskGroupwill immediately cancel the search task, destroying the user's search request entirely.
Proposed Solution: A Bounded Timeout Toggle
Instead of forcing all deployments to wait infinitely (which risks hanging the API worker threads if the database stalls), the safest middle ground may be to introduce a configurable timeout variable (e.g., COUNT_TIMEOUT).
If we set the default to a fast 0.5 seconds, the API gives the count task a reasonable grace period to finish, but strictly aborts the wait if the database is struggling. Deployments that need guaranteed counts can increase this timeout, while those that prioritize raw speed can set it to 0.
Could we update this PR to implement this bounded approach? The logic would revert the TaskGroup back to create_task and look something like this:
# 1. Create the tasks independently
search_task = asyncio.create_task(
self.client.search(
index=index_param,
# ... (keep existing search args)
)
)
count_task = asyncio.create_task(
self.client.count(
index=index_param,
ignore_unavailable=ignore_unavailable,
body=count_query,
)
)
# 2. Await the search task as the absolute priority
try:
es_response = await search_task
except exceptions.NotFoundError:
raise NotFoundError(f"Collections '{collection_ids}' do not exist")
# 3. Explicitly wait for count with a strict boundary
import os
count_timeout = float(os.getenv("COUNT_TIMEOUT", 0.5))
if count_timeout > 0 and not count_task.done():
await asyncio.wait([count_task], timeout=count_timeout)
# 4. Safely extract the count if it finished
if count_task.done():
try:
matched = count_task.result().get("count")
except Exception as e:
logger.error(f"Count task failed: {e}")This ensures we don't break our error handling, protects the API's overall response times, but provides a safe, bounded way to vastly reduce numberMatched=null responses. Let me know what you think of this approach.
f277b66 to
33a7aa0
Compare
- Use COUNT_TIMEOUT environment variable (default 0.5s) - Return search results without count if task exceeds timeout - Log warning when count task times out to help debugging - Add unit test for count delay simulation and solved by timeout
d9458bb to
f73f8ae
Compare
Thank you, @jonhealy1 This is a very good suggestion. Although it does not fully resolve the issue of the count task, it significantly reduces the number of |
jonhealy1
left a comment
There was a problem hiding this comment.
Looks great - thank you - a couple of minor things
|
|
||
| if count_timeout > 0 and not count_task.done(): | ||
| try: | ||
| print("Waiting for count task to complete...") |
There was a problem hiding this comment.
This should be logger.debug not print.
CHANGELOG.md
Outdated
|
|
||
| ### Fixed | ||
|
|
||
| - Fixed `numberMatched=null` responses by using `asyncio.TaskGroup` to wait for both `search` and `count` tasks to complete in `execute_search`, preventing premature returns when count task hadn't finished.[#610](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/610) |
There was a problem hiding this comment.
Changelog entry needs to be updated.
5c133ab to
84d2991
Compare
jonhealy1
left a comment
There was a problem hiding this comment.
Thanks @YuriZmytrakov
Description:
In heavy search load, the
execute_searchfunction returnsnumberMatched = nullin small fraction of request because thesearchtask does not wait for thecounttask to complete before returning the results.This issue occurs under heavy load in approximately 5–10% of requests. It was tested using a script that executed a butch of requests. To resolve this bug, the
searchtask must wait for thecounttask to complete before returning results. This can be achieved by implementingasyncio.TaskGroup, which ensures that all tasks finish execution. After introducingasyncio.TaskGroup, thenumberMatched = nullissue no longer occurs.PR Checklist:
pre-commit run --all-files)make test)