bypass panic intended for wlan firmware debugging#102
bypass panic intended for wlan firmware debugging#102jyoung8607 wants to merge 2 commits intocommaai:masterfrom
Conversation
Comment out ICNSS_ASSERT and log error message.
| if (priv->force_err_fatal) | ||
| ICNSS_ASSERT(0); | ||
| // ICNSS_ASSERT(0); | ||
| icnss_pr_err("comma hax: skipping BUG_ON and proceeding with FW crash recovery"); |
There was a problem hiding this comment.
Feel free to adjust this to suit your telemetry.
|
@robbederks can you check this out? |
|
Added a reset of that flag once we log the event. Nothing else reset it before, since setting it resulted in a panic. If we had total confidence this mitigation would work, we could remove this flag code entirely, but right now it serves our interests to keep the kernel logic identical up until the panic. If we have access to newer qcom modem/wifi firmware, that would also be good to try. When's the last time we've even touched it? There's no public repo history before commaai/agnos-builder#429. |
|
This isn't the only place it will cause the kernel to panic, but in combination with #100 it might limp on a bit longer. What would really help though is some way to reproduce this. It's (still) a very prevalent crash in the field, but I haven't been able to reproduce it on a desk at all. From looking through a few events, it mostly seems to happen when the device is connecting to a known network that's just come into range, but simulating that on a desk yielded little result so far. Also constant connect/disconnect loops through networkmanager or dbus didn't do anything for me. Maybe network dependent? RE a newer kernel driver: I've tried backporting a newer version (iirc from a xiaomi sdm845 kernel that was more maintained) in #98 but that also had a few instances of the same bug after a few days on the ephot micis. Not sure if we have a newer firmware image, will take a look |
|
Re: how likely this patch is to help, I am largely in agreement with you. I agree that being Wi-Fi connected is a factor. But, the most recent report I worked on wasn't coming back into range, but driving along with a hotspot in the car, on a major highway. No roaming and no weak signal issues. Not very likely it ever sees another usable network, though technically possible if it remembers using a popular public SSID. I sense (gut, not telemetry) that it's environmental. It's rare, not many people experience it, but those that do have reported it happening multiple times. It's probably not a hardware variance, it feels like software. So it's something about or around the user. The Wi-Fi world has changed a lot since SDM845. We never did get WPA3 working, so the firmware may just be that old. You can see everything it complains about in dmesg. It's going to see all kinds of new crap in beacon and other management frames that didn't exist back then, which it's going to see and try to parse even if not associated. Couple thoughts:
|
|
Why not just instead of so all ICNSS_ASSERT become no-ops? |
|
You could, but given the uncertain efficacy and lack of controlled repro, I wanted to retain logging of the event. |
|
Ordered a supported SDR to try the openwifi fuzzer, good idea @jyoung8607 ! |
Potential mitigation for commaai/openpilot#35788. It looks like the WLAN firmware is crashing, and in the process, setting a development flag to force a kernel panic for debugging purposes. We can do without the panic.
The force_err_fatal flag was added in a standalone commit that explains how it works: https://android.googlesource.com/kernel/msm/+/3c2c7bf20432119e11105bd161ea5ddcedf4f116%5E%21/
If I'm understanding right, the sequence of events:
Either the WLAN FW or the driver tries to color outside the lines talking between WLAN and the host
ARM-SMMU frowns upon this, this is the visible start of the event
WLAN firmware seems to notice this or otherwise crash
WLAN firmware raises an optional debug-me development flag while dying
The SMMU should have saved us from the scribbling. After that, we can skip the panic and hopefully just proceed with recovery. Even if subsystem restart doesn't work, we'd much rather just lose WLAN than panic.