Skip to content

bypass panic intended for wlan firmware debugging#102

Open
jyoung8607 wants to merge 2 commits intocommaai:masterfrom
jyoung8607:patch-1
Open

bypass panic intended for wlan firmware debugging#102
jyoung8607 wants to merge 2 commits intocommaai:masterfrom
jyoung8607:patch-1

Conversation

@jyoung8607
Copy link
Copy Markdown

@jyoung8607 jyoung8607 commented Feb 10, 2026

Potential mitigation for commaai/openpilot#35788. It looks like the WLAN firmware is crashing, and in the process, setting a development flag to force a kernel panic for debugging purposes. We can do without the panic.

The force_err_fatal flag was added in a standalone commit that explains how it works: https://android.googlesource.com/kernel/msm/+/3c2c7bf20432119e11105bd161ea5ddcedf4f116%5E%21/

If I'm understanding right, the sequence of events:

  1. Either the WLAN FW or the driver tries to color outside the lines talking between WLAN and the host

  2. ARM-SMMU frowns upon this, this is the visible start of the event

[38516.422393] arm-smmu 15000000.apps-smmu: Unhandled context fault: iova=0xac9e40d8, fsr=0x40000402, fsynr=0x80003, cb=5
[38516.422424] arm-smmu 15000000.apps-smmu: FAR    = 00000000ac9e40d8
[38516.422445] arm-smmu 15000000.apps-smmu: FSR    = 40000402 [TF SS ]
[38516.422465] arm-smmu 15000000.apps-smmu: soft iova-to-phys=0x0000000000000000
[38516.422484] arm-smmu 15000000.apps-smmu: SOFTWARE TABLE WALK FAILED! Looks like 15000000.apps-smmu accessed an unmapped address!
[38516.422504] arm-smmu 15000000.apps-smmu: hard iova-to-phys (ATOS) failed
[38516.422522] arm-smmu 15000000.apps-smmu: SID=0x40
[38516.422541] arm-smmu 15000000.apps-smmu: Unhandled arm-smmu context fault!
  1. WLAN firmware seems to notice this or otherwise crash

  2. WLAN firmware raises an optional debug-me development flag while dying

[38516.424565] icnss: Received force error fatal request from FW
  1. The icnss kernel driver is designed to panic upon receipt of this flag, and does so:
[38516.667361] icnss: ASSERT at line 2460

The SMMU should have saved us from the scribbling. After that, we can skip the panic and hopefully just proceed with recovery. Even if subsystem restart doesn't work, we'd much rather just lose WLAN than panic.

Comment out ICNSS_ASSERT and log error message.
if (priv->force_err_fatal)
ICNSS_ASSERT(0);
// ICNSS_ASSERT(0);
icnss_pr_err("comma hax: skipping BUG_ON and proceeding with FW crash recovery");
Copy link
Copy Markdown
Author

@jyoung8607 jyoung8607 Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to adjust this to suit your telemetry.

@adeebshihadeh
Copy link
Copy Markdown
Contributor

@robbederks can you check this out?

@jyoung8607
Copy link
Copy Markdown
Author

jyoung8607 commented Feb 10, 2026

Added a reset of that flag once we log the event. Nothing else reset it before, since setting it resulted in a panic.

If we had total confidence this mitigation would work, we could remove this flag code entirely, but right now it serves our interests to keep the kernel logic identical up until the panic.

If we have access to newer qcom modem/wifi firmware, that would also be good to try. When's the last time we've even touched it? There's no public repo history before commaai/agnos-builder#429.

@robbederks
Copy link
Copy Markdown
Collaborator

robbederks commented Feb 13, 2026

This isn't the only place it will cause the kernel to panic, but in combination with #100 it might limp on a bit longer.

What would really help though is some way to reproduce this. It's (still) a very prevalent crash in the field, but I haven't been able to reproduce it on a desk at all.

From looking through a few events, it mostly seems to happen when the device is connecting to a known network that's just come into range, but simulating that on a desk yielded little result so far. Also constant connect/disconnect loops through networkmanager or dbus didn't do anything for me. Maybe network dependent?

RE a newer kernel driver: I've tried backporting a newer version (iirc from a xiaomi sdm845 kernel that was more maintained) in #98 but that also had a few instances of the same bug after a few days on the ephot micis. Not sure if we have a newer firmware image, will take a look

@jyoung8607
Copy link
Copy Markdown
Author

Re: how likely this patch is to help, I am largely in agreement with you.

I agree that being Wi-Fi connected is a factor. But, the most recent report I worked on wasn't coming back into range, but driving along with a hotspot in the car, on a major highway. No roaming and no weak signal issues. Not very likely it ever sees another usable network, though technically possible if it remembers using a popular public SSID.

I sense (gut, not telemetry) that it's environmental. It's rare, not many people experience it, but those that do have reported it happening multiple times. It's probably not a hardware variance, it feels like software. So it's something about or around the user.

The Wi-Fi world has changed a lot since SDM845. We never did get WPA3 working, so the firmware may just be that old. You can see everything it complains about in dmesg. It's going to see all kinds of new crap in beacon and other management frames that didn't exist back then, which it's going to see and try to parse even if not associated.

Couple thoughts:

  • For a repro, maybe put it in some complex or hostile environments? https://github.com/alipay/Owfuzz
  • Take ALL the matching panics in your telemetry and throw their last known GPS location on a map. See if the map has interesting clusters.

@andiradulescu
Copy link
Copy Markdown

Why not just

# CONFIG_ICNSS_DEBUG is not set

instead of

CONFIG_ICNSS_DEBUG=y

so all ICNSS_ASSERT become no-ops?

@jyoung8607
Copy link
Copy Markdown
Author

You could, but given the uncertain efficacy and lack of controlled repro, I wanted to retain logging of the event.

@robbederks
Copy link
Copy Markdown
Collaborator

Ordered a supported SDR to try the openwifi fuzzer, good idea @jyoung8607 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants