fix: add deepcopy to not clone rm and cond_rm#423
fix: add deepcopy to not clone rm and cond_rm#423jazken wants to merge 3 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR modifies Casbin’s Assertion model copying behavior to avoid deep-copying live role-manager state (rm/cond_rm) during policy reloads, addressing production race conditions observed when using watchers under concurrent load.
Changes:
- Add a custom
Assertion.__deepcopy__implementation. - Exclude
rmandcond_rmfrom deep copy to avoid recursive/cyclic graph traversal and concurrent mutation errors. - Add
copyimport to support the new deepcopy logic.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for k, v in self.__dict__.items(): | ||
| if k in ("rm", "cond_rm"): | ||
| setattr(result, k, None) | ||
| else: | ||
| setattr(result, k, copy.deepcopy(v, memo)) |
There was a problem hiding this comment.
Setting rm/cond_rm to None during deepcopy changes behavior when auto_build_role_links is disabled (or role links aren’t rebuilt after load_policy()): enforce_ex() builds the g() function from ast.rm/ast.cond_rm, so None makes g() fall back to name1 == name2 and effectively ignores any existing role manager state. Consider preserving the existing rm/cond_rm references (shallow copy / direct assignment) instead of forcing them to None, while still avoiding deep-copying their internal graphs.
| def __deepcopy__(self, memo): | ||
| """Custom deepcopy implementation that excludes rm and cond_rm attributes. | ||
| This stems from an issue with the watcher concurrent reloading causing an edge case of deepcopy error. | ||
| """ | ||
| cls = self.__class__ | ||
| result = cls.__new__(cls) | ||
| memo[id(self)] = result | ||
| for k, v in self.__dict__.items(): | ||
| if k in ("rm", "cond_rm"): | ||
| setattr(result, k, None) | ||
| else: | ||
| setattr(result, k, copy.deepcopy(v, memo)) | ||
| return result |
There was a problem hiding this comment.
This change introduces custom Assertion.__deepcopy__ behavior that affects load_policy() (which uses copy.deepcopy(self.model)), but there doesn’t appear to be a unit test covering the new deepcopy semantics (e.g., that rm/cond_rm are not deep-copied and that role-link behavior remains correct). Adding a focused test that exercises copy.deepcopy(Assertion) (and/or load_policy() on a model containing a RoleManager/ConditionalRoleManager) would help prevent regressions.
Background Problem
We currently use pycasbin in our Django backend for checking access control. We scaled up the number of casbin backends and workers and introduced a Redis watcher. It worked perfectly in our development environment but it failed in our production environment. After much debugging, we realised it was partially due to the slower hardware in our production environment which created a race condition.
Why?
Python's default deepcopy traverses the entire object graph recursively. RoleManager (rm) and ConditionalRoleManager (cond_rm) contain bidirectionally linked Role objects where role.roles and role.users reference each other, forming a cyclic graph. Under concurrent load, two failure modes arise:
Python's memo dict should catch this, but the cycle goes through reduce/getstate reconstruction paths that register the memo entry too late, allowing infinite recursion before the cycle is detected.
If a concurrent thread calls rm.add_link() via build_role_links() at the same moment, Python raises this error because the dict is mutated mid-iteration. This race window is wider on slower hardware where deepcopy takes longer, which is why the failure is non-deterministic and hardware-dependent.
Why excluding rm and cond_rm is safe
Both attributes are always rebuilt immediately after load_policy() completes via build_role_links() and build_conditional_role_links(). The deepcopy in load_policy() is only a snapshot of policy data (rules, tokens, values) — it never needs to carry live role manager state. Setting them to None in the copy is equivalent to the state of a freshly constructed Assertion before any role links have been built.
Transparency
The above is debugged together with the help of Claude, but the traceback and errors are validated by human.