Use npz instead of pickle for Thai W2P weights#1328
Conversation
Change np.load to disallow pickling for security.
|
@copilot Make np.load that disallow allow_pickle working. |
|
@wannaphong I've opened a new pull request, #1329, to work on those changes. Once the pull request is ready, I'll request review from you. |
) * Initial plan * Fix np.load allow_pickle=False to work with .npz NpzFile format - Replace .item().get(key) with [key] dict-style access on NpzFile - Remove variables instance attribute; use local variable instead - Add type annotation for variables local var as np.lib.npyio.NpzFile - Add allow_pickle=False to embeddings.npy load in words_spelling_correction.py Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
|
@copilot How to fix this? Traceback (most recent call last): |
|
@wannaphong I've opened a new pull request, #1330, to work on those changes. Once the pull request is ready, I'll request review from you. |
The thai_w2p corpus v0.2 is stored as a .npy pickled dict. Loading it with allow_pickle=False raises ValueError. Detect file format by extension: - .npz: use allow_pickle=False (secure, for future corpus versions) - .npy (legacy): use allow_pickle=True + dict validation Also add `import os` for os.path.splitext(). Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
Fix np.load allow_pickle=False breaking thai_w2p corpus loading
|
@copilot The dev branch is updated. Sync the changes. Review all the changes in file loading, including the pickle handling. Pay attention to security issues. When loading a pickle file, emit a warning. |
Co-authored-by: bact <128572+bact@users.noreply.github.com>
This release focuses on security issues related to corpus file loading, including improved pickle handling and defensive file loading.
Disallow pickle in np.load, warn on legacy pickle loading, sync dev defensive file loading
There was a problem hiding this comment.
Pull request overview
Updates corpus/model loading paths to reduce unsafe deserialization risks, primarily by tightening np.load usage and hardening some corpus CSV parsing.
Changes:
- Add
.npz-first loading for Thai W2P weights and warn on legacy pickle-based.npyloading. - Set
allow_pickle=Falsefor embedding matrix loading in spelling correction. - Minor hardening/refactor of corpus CSV field validation and doc/changelog updates.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pythainlp/transliterate/w2p.py | Add .npz loading path and legacy .npy warning/pickle fallback for Thai W2P weights. |
| pythainlp/tag/_tag_perceptron.py | Update load() docstring to reflect JSON model format. |
| pythainlp/spell/words_spelling_correction.py | Disallow pickle when loading embeddings with np.load. |
| pythainlp/corpus/common.py | Refactor validation logic for CSV fields before processing/warning. |
| CHANGELOG.md | Add 5.3.1 entry and update compare links. |
You can also share your feedback on Copilot code review. Take the survey.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR tightens security around model/corpus loading by disabling legacy pickle-based .npy weight loading by default, adding a .npz-first loading path for Thai W2P, and requiring explicit opt-in via PYTHAINLP_ALLOW_UNSAFE_PICKLE with warnings.
Changes:
- Add
.npzloading for Thai W2P weights and block legacy pickle-based.npyloading unlessPYTHAINLP_ALLOW_UNSAFE_PICKLEis set (with warning). - Introduce
is_unsafe_pickle_allowed()and re-export it viapythainlpandpythainlp.tools. - Update tests and documentation to reflect the new unsafe-pickle opt-in behavior.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| tests/extra/testx_transliterate.py | Adjust W2P-related tests to opt into legacy pickle loading and add coverage for blocked/allowed behavior. |
| tests/core/test_transliterate.py | Minor formatting change. |
| tests/core/test_tools.py | Add unit tests for is_unsafe_pickle_allowed(). |
| tests/compact/testc_util.py | Opt into unsafe pickle for tests that trigger W2P-backed functionality. |
| pythainlp/transliterate/w2p.py | Implement .npz-first loading; gate legacy .npy pickle loading behind env var + warning. |
| pythainlp/tools/path.py | Add is_unsafe_pickle_allowed() helper. |
| pythainlp/tools/init.py | Re-export is_unsafe_pickle_allowed(). |
| pythainlp/tag/_tag_perceptron.py | Update docstrings/comments to reflect JSON model save/load. |
| pythainlp/spell/words_spelling_correction.py | Load embeddings with allow_pickle=False. |
| pythainlp/corpus/common.py | Make CSV field validation checks more explicit/defensive. |
| pythainlp/init.py | Re-export is_unsafe_pickle_allowed() at top-level. |
| README_TH.md | Document PYTHAINLP_ALLOW_UNSAFE_PICKLE. |
| README.md | Document PYTHAINLP_ALLOW_UNSAFE_PICKLE. |
| CHANGELOG.md | Add 5.3.1 entry and note improved pickle handling. |
You can also share your feedback on Copilot code review. Take the survey.
Updated model name and refactored variable loading to use .npz format exclusively, removing legacy .npy handling.
|
@bact I was convert Thai W2P model to |
Removed unused imports from w2p.py
|
@wannaphong I have removed most tests related to |
There was a problem hiding this comment.
Pull request overview
This PR hardens corpus/model loading against unsafe pickle deserialization by default, adds an explicit opt-in gate via PYTHAINLP_ALLOW_UNSAFE_PICKLE, and updates Thai W2P to load .npz weights.
Changes:
- Add
is_unsafe_pickle_allowed()and re-export it viapythainlpandpythainlp.tools. - Switch Thai W2P weights loading to a
.npz-based format withallow_pickle=False. - Update tests and docs/README/changelog to reflect the new unsafe-pickle opt-in behavior.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/core/test_transliterate.py | Minor formatting change at EOF. |
| tests/core/test_tools.py | Adds unit tests for is_unsafe_pickle_allowed(). |
| pythainlp/transliterate/w2p.py | Switches Thai W2P weight loading to .npz keys and disables pickle. |
| pythainlp/tools/path.py | Introduces is_unsafe_pickle_allowed() env-var gate. |
| pythainlp/tools/init.py | Re-exports is_unsafe_pickle_allowed. |
| pythainlp/tag/_tag_perceptron.py | Updates docstrings/comments to reflect JSON model persistence. |
| pythainlp/spell/words_spelling_correction.py | Hardens np.load by setting allow_pickle=False. |
| pythainlp/corpus/common.py | Tightens CSV field validation checks before processing. |
| pythainlp/init.py | Re-exports is_unsafe_pickle_allowed at top-level API. |
| README_TH.md | Documents PYTHAINLP_ALLOW_UNSAFE_PICKLE. |
| README.md | Documents PYTHAINLP_ALLOW_UNSAFE_PICKLE. |
| CHANGELOG.md | Adds 5.3.1 security notes and updates compare links. |
You can also share your feedback on Copilot code review. Take the survey.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
We no longer use pickle. Do not advertise this env var. Keep it internally for future use. (may remove in 6.0.0)
|
|
@wannaphong feel free to merge when you are ready |



This PR hardens corpus/model loading against unsafe pickle deserialization and updates Thai W2P to load
.npzweights.Changes:
Switch Thai W2P weights loading to a
.npz-based format withallow_pickle=False.Add
is_unsafe_pickle_allowed()and re-export it viapythainlpandpythainlp.tools.Update tests and docs/README/changelog to reflect the new unsafe-pickle opt-in behavior.
Passed code styles and structures
Passed code linting checks and unit test