Skip to content

Refactor BatchLogRecordProcessor and associated tests#4535

Merged
lzchen merged 7 commits intoopen-telemetry:mainfrom
DylanRussell:refactor_blrp
Apr 24, 2025
Merged

Refactor BatchLogRecordProcessor and associated tests#4535
lzchen merged 7 commits intoopen-telemetry:mainfrom
DylanRussell:refactor_blrp

Conversation

@DylanRussell
Copy link
Contributor

@DylanRussell DylanRussell commented Apr 9, 2025

Description

Refactor BatchLogRecordProcessor, keeping the existing behavior mostly the same. This PR cleans up the code, including the tests, and also adds some new tests.

One exception is forceFlush which now calls export synchronously from the main thread and waits for it to finish.

Previously forceFlush would wait timeout_millis for the worker thread to make and finish an export call, and if an export call was in progress it would wait for the subsequent export call to finish. It would return true if this export call completed in time and false otherwise. It didn't cancel the request after timeout, it just stopped waiting for it to finish.

I think ideally forceFlush.timeout_millis (and also shutdown.timeout_millis) should be used as the time after which the export call(s) gets cancelled. But for that to work we need to be able to pass a timeout to export like what was proposed in #4183. Until then I think we should ignore it and document that it doesn't work.

I'm not sure what forceFlush should return, currently I have it return nothing (same as javascript. It could always return True, to signify that export was called until the queue was empty. It could return True if all export calls succeeded, and False otherwise, and it could stop flushing after the first failed export, like how go lang does it.

I think my proposed behavior is more inline with the spec too.

Note that the default for forceFlush.timeout_millis came from the OTEL_BLRP_EXPORT_TIMEOUT environment variable which is supposed to configure "the maximum allowed time to export data from the BatchLogRecordProcessor". I propose we leave this env var unused for now, and document that it doesn't do anything. This flag seems redundant with the OTLP Exporter timeout env vars anyway. Maybe in other languages the BatchLogRecordProcessor isn't the default one used for auto instrumentation, so it makes more sense for it to be configurable ?

Type of change

Please delete options that are not relevant.Please delete options that are not relevant.

  • [ X] Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Added lots of unit tests.

Does This PR Require a Contrib Repo Change?

  • Yes. - Link to PR:
  • [ x] No.

Checklist:

  • [x ] Followed the style guidelines of this project
  • Changelogs have been updated
  • [x ] Unit tests have been added
  • Documentation has been updated

@DylanRussell DylanRussell requested a review from a team as a code owner April 9, 2025 20:06
@xrmx xrmx moved this to Ready for review in Python PR digest Apr 14, 2025
@aabmass
Copy link
Member

aabmass commented Apr 16, 2025

This flag seems redundant with the OTLP Exporter timeout env vars anyway. Maybe in other languages the BatchLogRecordProcessor isn't the default one used for auto instrumentation, so it makes more sense for it to be configurable ?

The OTEL_BLRP_EXPORT_TIMEOUT should work with all exporters, not just OTLP. I think the intention of having a separate one for OTLP is to specifically target OTLP exporters in case there are multiple BLRP instances. It's definitely a bit clunky though.

Copy link
Member

@aabmass aabmass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a huge improvement to the complexity of the threading code 😃

I'd like to get some more eyes on this since it concurrency bugs can be really subtle

@aabmass
Copy link
Member

aabmass commented Apr 16, 2025

I think the failing Windows run is pretty typical of what we see with sleep() in tests: https://github.com/open-telemetry/opentelemetry-python/actions/runs/14366019090/job/40279304137?pr=4535. It might pass on a future run, but please try to improve the flakiness if you can

Copy link
Member

@pmcollins pmcollins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for these important changes!

My only feedback here is maybe it would also be helpful to try to eventually factor some classes out. For example, self._queue and self._queue_lock are often used together and perhaps could be in their own class. Also, more generally, we're doing batching for spans and logs -- could we use one generic batcher that could handle both signals?

@DylanRussell
Copy link
Contributor Author

Added a buffer to that test that flaked, thanks for point that out. Hopefully it passes this time

@DylanRussell
Copy link
Contributor Author

My only feedback here is maybe it would also be helpful to try to eventually factor some classes out. For example, self._queue and self._queue_lock are often used together and perhaps could be in their own class. Also, more generally, we're doing batching for spans and logs -- could we use one generic batcher that could handle both signals?

Sounds good ! I will look into this. I was planning to fix the BatchSpanProcessor code which works the exact same way, so some generic batch class makes a lot of sense. I think I'll do that in a separate PR tho, this one already getting big

@DylanRussell
Copy link
Contributor Author

Can someone add the Skip Changelog tag ? I don't think this needs a changelog, since it's basically just a refactor and not changing behavior

@DylanRussell
Copy link
Contributor Author

Alright I think this is good to merge, just need the Skip Changelog tag and then for someone to push it

Copy link
Contributor

@lzchen lzchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much cleaner than before thanks!

@lzchen lzchen added the Skip Changelog PRs that do not require a CHANGELOG.md entry label Apr 23, 2025
@lzchen lzchen merged commit 00329e0 into open-telemetry:main Apr 24, 2025
477 of 481 checks passed
@github-project-automation github-project-automation bot moved this from Ready for review to Done in Python PR digest Apr 24, 2025
DylanRussell added a commit to DylanRussell/opentelemetry-python that referenced this pull request Apr 29, 2025
DylanRussell added a commit to DylanRussell/opentelemetry-python that referenced this pull request Apr 30, 2025
liustve added a commit to aws-observability/aws-otel-python-instrumentation that referenced this pull request Feb 4, 2026
Automated update of OpenTelemetry dependencies.

**Build Status:** ❌
[failure](https://github.com/aws-observability/aws-otel-python-instrumentation/actions/runs/21465140126)

**Updated versions:**
- [OpenTelemetry
Python](https://github.com/open-telemetry/opentelemetry-python/releases/tag/v1.39.1):
1.39.1
- [OpenTelemetry
Contrib](https://github.com/open-telemetry/opentelemetry-python-contrib/releases/tag/v0.60b1):
0.60b1
-
[opentelemetry-sdk-extension-aws](https://pypi.org/project/opentelemetry-sdk-extension-aws/2.1.0/):
2.1.0
-
[opentelemetry-propagator-aws-xray](https://pypi.org/project/opentelemetry-propagator-aws-xray/1.0.2/):
1.0.2

**Upstream releases with breaking changes:**
Note: the mechanism to detect upstream breaking changes is not perfect.
Be sure to check all new releases and understand if any additional
changes need to be addressed.

**opentelemetry-python:**
- [Version
1.35.0/0.56b0](https://github.com/open-telemetry/opentelemetry-python/releases/tag/v1.35.0)

**opentelemetry-python-contrib:**
- [Version
1.34.0/0.55b0](https://github.com/open-telemetry/opentelemetry-python-contrib/releases/tag/v0.55b0)

*Description of changes:*

- Un-reverts changes done in this PR:
#531

- Removed patches for Bedrock following the changes applied in these
PRs:
open-telemetry/opentelemetry-python-contrib#3544,
open-telemetry/opentelemetry-python-contrib#3548,
open-telemetry/opentelemetry-python-contrib#3875,
open-telemetry/opentelemetry-python-contrib#3990

- Removes patches for Secrets Manager, SNS, and Step Functions following
the changes applied in these PRs:
open-telemetry/opentelemetry-python-contrib#3734,
open-telemetry/opentelemetry-python-contrib#3737,
open-telemetry/opentelemetry-python-contrib#3765,

- Removes patches for Starlette following the changes applied in this
PR:
open-telemetry/opentelemetry-python-contrib#3456

- Changes imports and implementation of `OTLPAwsLogExporter`,
`AwsCloudWatchOtlpBatchLogRecordProcessor`, and
`CompactConsoleLogExporter` following these PRs:
open-telemetry/opentelemetry-python#4580,
open-telemetry/opentelemetry-python#4535,
open-telemetry/opentelemetry-python#4562,
open-telemetry/opentelemetry-python#4647,
open-telemetry/opentelemetry-python#4676

- Removes a few AWS semantic conventions from `_aws_attribute_keys ` and
replaces them with equivalent ones from upstream following the changes
in this PR:
open-telemetry/opentelemetry-python#4791

- Fix Lambda instrumentation test to set `AWS_LAMBDA_FUNCTION_NAME` env
var following changes in:
open-telemetry/opentelemetry-python-contrib#3183

- Adds a few more contract tests to verify upstream's botocore
instrumentation library


By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.

---------

Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Thomas Pierce <thp@amazon.com>
Co-authored-by: Steve Liu <liustve@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Skip Changelog PRs that do not require a CHANGELOG.md entry

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants