Skip to content

Update CF_TIP_BOOT_FAILED_COUNT Metric#17

Open
AYUSHMAAN-B wants to merge 8 commits intoittiam-systems:masterfrom
AYUSHMAAN-B:Update_Cuttlefish_Metric
Open

Update CF_TIP_BOOT_FAILED_COUNT Metric#17
AYUSHMAAN-B wants to merge 8 commits intoittiam-systems:masterfrom
AYUSHMAAN-B:Update_Cuttlefish_Metric

Conversation

@AYUSHMAAN-B
Copy link
Collaborator

Add 'is_candidate' label to 'CF_TIP_BOOT_FAILED_COUNT' metric to distinguish prod vs candidate instances. Update 'get_instance_name()' to return the GCE VM ID for accurate instance tracking.

@AYUSHMAAN-B AYUSHMAAN-B marked this pull request as draft November 12, 2025 03:23
@AYUSHMAAN-B AYUSHMAAN-B marked this pull request as ready for review November 12, 2025 03:25
if adb.get_device_state() != 'device':
if environment.is_android_cuttlefish():
logs.info('Trying to boot cuttlefish instance using stable build.')
# Increment the boot failure count with the candidate field.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to add this comment for now

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if environment.is_running_on_app_engine():
return environment.get_value('GAE_INSTANCE', '')

# For Cuttlefish instances, get the GCE-assigned VM instance ID
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to add this here , create a separate function for this. Don't want to modify already existing function. Changing logic here might have adverse effect over other components of cluster-fuzz , which were not supposed to affected by this change.

It seems like you are retrieving instance_id here, but the same is not being transmitted through the metric. Why are we doing this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to add this here , create a separate function for this. Don't want to modify already existing function. Changing logic here might have adverse effect over other components of cluster-fuzz , which were not supposed to affected by this change.

Done. I have reverted changes in get_instance_name().

It seems like you are retrieving instance_id here, but the same is not being transmitted through the metric. Why are we doing this.

I have now created get_instance_ID() in monitor.py to get the instance ID directly which can be used in the metric.

Copy link
Collaborator

@aditya-wazir aditya-wazir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running: git diff --name-only FETCH_HEAD
| src/clusterfuzz/_internal/base/utils.py
| src/clusterfuzz/_internal/metrics/monitoring_metrics.py
| src/clusterfuzz/_internal/platforms/android/flash.py
| src/clusterfuzz/_internal/tests/core/metrics/monitor_test.py
Running: pylint --score=no --jobs=0 --ignore=protos,tests,grammars clusterfuzz
Running: pylint --score=no --jobs=0 --ignore=protos,grammars --max-line-length=240 --disable no-member clusterfuzz._internal.tests
| ************* Module clusterfuzz._internal.tests.core.metrics.monitor_test
| src/clusterfuzz/_internal/tests/core/metrics/monitor_test.py:123:11: E0602: Undefined variable 'is_candidate' (undefined-variable)
| src/clusterfuzz/_internal/tests/core/metrics/monitor_test.py:124:12: E0602: Undefined variable 'is_candidate' (undefined-variable)
| src/clusterfuzz/_internal/tests/core/metrics/monitor_test.py:129:63: E0602: Undefined variable 'is_candidate' (undefined-variable)
| Return code is non-zero (2).
Running: yapf -p -d src/clusterfuzz/_internal/base/utils.py src/clusterfuzz/_internal/metrics/monitoring_metrics.py src/clusterfuzz/_internal/platforms/android/flash.py src/clusterfuzz/_internal/tests/core/metrics/monitor_test.py
Running: isort --dont-order-by-type --force-single-line-imports --force-sort-within-sections --line-length=80 -p handlers -p libs -p clusterfuzz -c src/clusterfuzz/_internal/base/utils.py src/clusterfuzz/_internal/metrics/monitoring_metrics.py src/clusterfuzz/_internal/platforms/android/flash.py src/clusterfuzz/_internal/tests/core/metrics/monitor_test.py
Linting failed, see errors above.
Error: Process completed with exit code 1.

run/Basic Test is failing with lint error. Please ensure moving fwd that u do run unit test and pylint before further requesting for review. Thanks

Updated Cuttlefish boot metric unit test and added two new unit tests
for Candidate fleet. Reverted changes in utils.py.
Added get_instance_ID() in monitor.py to get the GCE VM ID directly.
@AYUSHMAAN-B AYUSHMAAN-B force-pushed the Update_Cuttlefish_Metric branch from f945014 to e3eeb08 Compare November 12, 2025 12:38

# For Cuttlefish instances, get the GCE-assigned VM instance ID
# for accurate instance tracking.
def get_instance_ID():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function missing docstring

if instance_id:
return instance_id

return utils.get_instance_name()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return utils.get_instance_name() --> why this??

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added at as a fail safe. I have now merged the get_instance_id() implementation in flash.py directly.

_monitored_resource.labels['project_id'] = utils.get_application_id()

_monitored_resource.labels['instance_id'] = utils.get_instance_name()
_monitored_resource.labels['instance_id'] = get_instance_ID()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we doing this. this will have effect on rest of the metrics too. This should not be done here AFAIK

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Reverted the changes.

# Add 'is_candidate' field to distinguish between prod and
# candidate instances.
monitor.BooleanField('is_candidate'),
monitor.BooleanField('is_succeeded'),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

monitor.StringField('instance_id') ---> lets define this. Check if String field is the most accurate data type

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added 'instance_id field in the metric as suggested. Adding it as a StringField as it was being used as string everywhere in the repo.

if environment.is_android_cuttlefish():
logs.info('Trying to boot cuttlefish instance using stable build.')
monitoring_metrics.CF_TIP_BOOT_FAILED_COUNT.increment({
'build_id': build_info['bid'],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add instance_id in CF_TIP_BOOT_FAILED_COUNT metric, that way we have consistent mapping among is_candidate, instance_id and is_succeeded key, which can the be used in PromQL query to get exact stats.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

self.mock.get_device_state.return_value = 'device'
flash.flash_to_latest_build_if_needed()
args = call_queue.get(timeout=20)
time_series = args['time_series']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about get_instance_id(), that we have added, test for that api??

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a unit test for get_instance_id() required a new flash_test.py file. It was noticed that the file did not exist before and no functions from flash.py have unit tests. Since, the function is being used only once and flash_test.py file did not exist before, I merged the function's logic directly into the single point of use to avoid creating a new file.

Added 'instance_id' in the CF_TIP_BOOT_FAILED_COUNT metric. Reverted
changes in monitor.py. Merged 'get_instance_id()' implementation
directly in flash.py to avoid creating a new unit test file.

# For Cuttlefish instances, get the GCE-assigned VM instance ID
# for accurate instance tracking.
instance_id = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to line 176. Here it will get invoked, even if device is not cuttlefish , and i don't want any regression or failure due to our code . change if this sounds good to you

Refactoring code related to getting GCE VM Instance ID

@patch(
'clusterfuzz._internal.metrics.monitor.monitoring_v3.MetricServiceClient')
def test_cuttlefish_boot_success_metric(self, mock_client):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_for_production_fleet should be added to test name of this and below test. Better to ensure naming is consistent across the tests added by us

@AYUSHMAAN-B AYUSHMAAN-B force-pushed the Update_Cuttlefish_Metric branch from c4218ee to 997b2ac Compare December 1, 2025 09:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants