Skip to content

Use npz instead of pickle for Thai W2P weights#1328

Merged
wannaphong merged 30 commits intodevfrom
improve-w2p
Mar 14, 2026
Merged

Use npz instead of pickle for Thai W2P weights#1328
wannaphong merged 30 commits intodevfrom
improve-w2p

Conversation

@wannaphong
Copy link
Member

@wannaphong wannaphong commented Mar 13, 2026

This PR hardens corpus/model loading against unsafe pickle deserialization and updates Thai W2P to load .npz weights.

Changes:

  • Switch Thai W2P weights loading to a .npz-based format with allow_pickle=False.

  • Add is_unsafe_pickle_allowed() and re-export it via pythainlp and pythainlp.tools.

  • Update tests and docs/README/changelog to reflect the new unsafe-pickle opt-in behavior.

  • Passed code styles and structures

  • Passed code linting checks and unit test

Change np.load to disallow pickling for security.
@wannaphong
Copy link
Member Author

@copilot Make np.load that disallow allow_pickle working.

Copy link
Contributor

Copilot AI commented Mar 13, 2026

@wannaphong I've opened a new pull request, #1329, to work on those changes. Once the pull request is ready, I'll request review from you.

)

* Initial plan

* Fix np.load allow_pickle=False to work with .npz NpzFile format

- Replace .item().get(key) with [key] dict-style access on NpzFile
- Remove variables instance attribute; use local variable instead
- Add type annotation for variables local var as np.lib.npyio.NpzFile
- Add allow_pickle=False to embeddings.npy load in words_spelling_correction.py

Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
@wannaphong
Copy link
Member Author

@copilot How to fix this?

Traceback (most recent call last):
File "/home/runner/work/pythainlp/pythainlp/tests/compact/testc_util.py", line 48, in test_check_khuap_klam
self.assertTrue(check_khuap_klam("กราบ"))
File "/home/runner/work/pythainlp/pythainlp/pythainlp/util/khuap_klam.py", line 63, in check_khuap_klam
reading = pronunciate(word, engine="w2p").replace("\u0e3a", "")
File "/home/runner/work/pythainlp/pythainlp/pythainlp/transliterate/core.py", line 217, in pronunciate
from pythainlp.transliterate.w2p import pronunciate
File "/home/runner/work/pythainlp/pythainlp/pythainlp/transliterate/w2p.py", line 261, in
_THAI_W2P: "Thai_W2P" = Thai_W2P()
File "/home/runner/work/pythainlp/pythainlp/pythainlp/transliterate/w2p.py", line 96, in init
self._load_variables()
File "/home/runner/work/pythainlp/pythainlp/pythainlp/transliterate/w2p.py", line 103, in _load_variables
variables: "np.lib.npyio.NpzFile" = np.load(
File "/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/numpy/lib/_npyio_impl.py", line 484, in load
return format.read_array(fid, allow_pickle=allow_pickle,
File "/opt/hostedtoolcache/Python/3.9.25/x64/lib/python3.9/site-packages/numpy/lib/format.py", line 822, in read_array
raise ValueError("Object arrays cannot be loaded when "
ValueError: Object arrays cannot be loaded when allow_pickle=False

Copy link
Contributor

Copilot AI commented Mar 13, 2026

@wannaphong I've opened a new pull request, #1330, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI and others added 2 commits March 13, 2026 22:21
The thai_w2p corpus v0.2 is stored as a .npy pickled dict.
Loading it with allow_pickle=False raises ValueError.

Detect file format by extension:
- .npz: use allow_pickle=False (secure, for future corpus versions)
- .npy (legacy): use allow_pickle=True + dict validation

Also add `import os` for os.path.splitext().

Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
Fix np.load allow_pickle=False breaking thai_w2p corpus loading
@coveralls
Copy link

coveralls commented Mar 14, 2026

Coverage Status

coverage: 65.93% (-0.003%) from 65.933%
when pulling 500c6e5 on improve-w2p
into d9dec9e on dev.

@bact bact added corpus corpus/dataset-related issues security and removed corpus corpus/dataset-related issues labels Mar 14, 2026
@bact bact added this to the 5.3.1 milestone Mar 14, 2026
@bact
Copy link
Member

bact commented Mar 14, 2026

@copilot The dev branch is updated. Sync the changes.

Review all the changes in file loading, including the pickle handling. Pay attention to security issues.

When loading a pickle file, emit a warning.

Copy link
Contributor

Copilot AI commented Mar 14, 2026

@bact I've opened a new pull request, #1331, to work on those changes. Once the pull request is ready, I'll request review from you.

@bact bact changed the title [WIP] Update np.load to disallow allow_pickle [WIP] Update np.load to handle allow_pickle Mar 14, 2026
@bact bact changed the title [WIP] Update np.load to handle allow_pickle [WIP] Update np.load to handle pickle Mar 14, 2026
Copilot AI and others added 4 commits March 14, 2026 03:30
Co-authored-by: bact <128572+bact@users.noreply.github.com>
This release focuses on security issues related to corpus file loading, including improved pickle handling and defensive file loading.
Disallow pickle in np.load, warn on legacy pickle loading, sync dev defensive file loading
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates corpus/model loading paths to reduce unsafe deserialization risks, primarily by tightening np.load usage and hardening some corpus CSV parsing.

Changes:

  • Add .npz-first loading for Thai W2P weights and warn on legacy pickle-based .npy loading.
  • Set allow_pickle=False for embedding matrix loading in spelling correction.
  • Minor hardening/refactor of corpus CSV field validation and doc/changelog updates.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pythainlp/transliterate/w2p.py Add .npz loading path and legacy .npy warning/pickle fallback for Thai W2P weights.
pythainlp/tag/_tag_perceptron.py Update load() docstring to reflect JSON model format.
pythainlp/spell/words_spelling_correction.py Disallow pickle when loading embeddings with np.load.
pythainlp/corpus/common.py Refactor validation logic for CSV fields before processing/warning.
CHANGELOG.md Add 5.3.1 entry and update compare links.

You can also share your feedback on Copilot code review. Take the survey.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@bact bact marked this pull request as ready for review March 14, 2026 05:30
@bact bact requested a review from Copilot March 14, 2026 05:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR tightens security around model/corpus loading by disabling legacy pickle-based .npy weight loading by default, adding a .npz-first loading path for Thai W2P, and requiring explicit opt-in via PYTHAINLP_ALLOW_UNSAFE_PICKLE with warnings.

Changes:

  • Add .npz loading for Thai W2P weights and block legacy pickle-based .npy loading unless PYTHAINLP_ALLOW_UNSAFE_PICKLE is set (with warning).
  • Introduce is_unsafe_pickle_allowed() and re-export it via pythainlp and pythainlp.tools.
  • Update tests and documentation to reflect the new unsafe-pickle opt-in behavior.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/extra/testx_transliterate.py Adjust W2P-related tests to opt into legacy pickle loading and add coverage for blocked/allowed behavior.
tests/core/test_transliterate.py Minor formatting change.
tests/core/test_tools.py Add unit tests for is_unsafe_pickle_allowed().
tests/compact/testc_util.py Opt into unsafe pickle for tests that trigger W2P-backed functionality.
pythainlp/transliterate/w2p.py Implement .npz-first loading; gate legacy .npy pickle loading behind env var + warning.
pythainlp/tools/path.py Add is_unsafe_pickle_allowed() helper.
pythainlp/tools/init.py Re-export is_unsafe_pickle_allowed().
pythainlp/tag/_tag_perceptron.py Update docstrings/comments to reflect JSON model save/load.
pythainlp/spell/words_spelling_correction.py Load embeddings with allow_pickle=False.
pythainlp/corpus/common.py Make CSV field validation checks more explicit/defensive.
pythainlp/init.py Re-export is_unsafe_pickle_allowed() at top-level.
README_TH.md Document PYTHAINLP_ALLOW_UNSAFE_PICKLE.
README.md Document PYTHAINLP_ALLOW_UNSAFE_PICKLE.
CHANGELOG.md Add 5.3.1 entry and note improved pickle handling.

You can also share your feedback on Copilot code review. Take the survey.

bact and others added 2 commits March 14, 2026 12:45
Updated model name and refactored variable loading to use .npz format exclusively, removing legacy .npy handling.
@wannaphong
Copy link
Member Author

@bact I was convert Thai W2P model to .npz file.

@bact
Copy link
Member

bact commented Mar 14, 2026

@wannaphong I have removed most tests related to PYTHAINLP_ALLOW_UNSAFE_PICKLE then

@bact bact changed the title [WIP] Only load pickle when PYTHAINLP_ALLOW_UNSAFE_PICKLE is set [WIP] Disallow pickle loading in np.load Mar 14, 2026
@bact bact requested a review from Copilot March 14, 2026 06:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens corpus/model loading against unsafe pickle deserialization by default, adds an explicit opt-in gate via PYTHAINLP_ALLOW_UNSAFE_PICKLE, and updates Thai W2P to load .npz weights.

Changes:

  • Add is_unsafe_pickle_allowed() and re-export it via pythainlp and pythainlp.tools.
  • Switch Thai W2P weights loading to a .npz-based format with allow_pickle=False.
  • Update tests and docs/README/changelog to reflect the new unsafe-pickle opt-in behavior.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/core/test_transliterate.py Minor formatting change at EOF.
tests/core/test_tools.py Adds unit tests for is_unsafe_pickle_allowed().
pythainlp/transliterate/w2p.py Switches Thai W2P weight loading to .npz keys and disables pickle.
pythainlp/tools/path.py Introduces is_unsafe_pickle_allowed() env-var gate.
pythainlp/tools/init.py Re-exports is_unsafe_pickle_allowed.
pythainlp/tag/_tag_perceptron.py Updates docstrings/comments to reflect JSON model persistence.
pythainlp/spell/words_spelling_correction.py Hardens np.load by setting allow_pickle=False.
pythainlp/corpus/common.py Tightens CSV field validation checks before processing.
pythainlp/init.py Re-exports is_unsafe_pickle_allowed at top-level API.
README_TH.md Documents PYTHAINLP_ALLOW_UNSAFE_PICKLE.
README.md Documents PYTHAINLP_ALLOW_UNSAFE_PICKLE.
CHANGELOG.md Adds 5.3.1 security notes and updates compare links.

You can also share your feedback on Copilot code review. Take the survey.

bact and others added 3 commits March 14, 2026 13:13
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
We no longer use pickle. Do not advertise this env var. Keep it internally for future use. (may remove in 6.0.0)
@sonarqubecloud
Copy link

@bact
Copy link
Member

bact commented Mar 14, 2026

@wannaphong feel free to merge when you are ready

@wannaphong wannaphong changed the title [WIP] Disallow pickle loading in np.load Disallow pickle loading in np.load Mar 14, 2026
@bact bact changed the title Disallow pickle loading in np.load Use npz instead of pickle for Thai W2P weights Mar 14, 2026
@wannaphong wannaphong merged commit 6df48fb into dev Mar 14, 2026
28 of 29 checks passed
@wannaphong wannaphong deleted the improve-w2p branch March 14, 2026 06:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove the use of pickle for model files

5 participants