Skip to content

Commit 6df48fb

Browse files
wannaphongCopilotbactCopilot
authored
Use npz instead of pickle for Thai W2P weights (#1328)
* Update np.load to disallow allow_pickle Change np.load to disallow pickling for security. * Fix np.load allow_pickle=False to work correctly with .npz format (#1329) * Initial plan * Fix np.load allow_pickle=False to work with .npz NpzFile format - Replace .item().get(key) with [key] dict-style access on NpzFile - Remove variables instance attribute; use local variable instead - Add type annotation for variables local var as np.lib.npyio.NpzFile - Add allow_pickle=False to embeddings.npy load in words_spelling_correction.py Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com> * Initial plan * Fix np.load allow_pickle for legacy .npy corpus (ValueError fix) The thai_w2p corpus v0.2 is stored as a .npy pickled dict. Loading it with allow_pickle=False raises ValueError. Detect file format by extension: - .npz: use allow_pickle=False (secure, for future corpus versions) - .npy (legacy): use allow_pickle=True + dict validation Also add `import os` for os.path.splitext(). Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com> * Initial plan * Sync dev, add pickle warning, fix docstring and code style Co-authored-by: bact <128572+bact@users.noreply.github.com> * Update CHANGELOG for release 5.3.1 This release focuses on security issues related to corpus file loading, including improved pickle handling and defensive file loading. * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Only load pickle file for PYTHAINLP_W2P_ALLOW_LEGACY_PICKLE is set Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Add PYTHAINLP_W2P_ALLOW_LEGACY_PICKLE: 1 to untitest env * Change PYTHAINLP_W2P_ALLOW_LEGACY_PICKLE to PYTHAINLP_ALLOW_UNSAFE_PICKLE Add is_unsafe_pickle_allowed() function * Sort import * Add test for PYTHAINLP_ALLOW_UNSAFE_PICKLE * Fix import * Set PYTHAINLP_ALLOW_UNSAFE_PICKLE for tone detector test * Refactor is_unsafe_pickle_allowed() * Refactor model loading to use .npz format only Updated model name and refactored variable loading to use .npz format exclusively, removing legacy .npy handling. * Clean up w2p.py by removing os and warnings imports Removed unused imports from w2p.py * Remove PYTHAINLP_ALLOW_UNSAFE_PICKLE tests * Update CHANGELOG.md * Fix imports * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * Update CHANGELOG.md * Remove PYTHAINLP_ALLOW_UNSAFE_PICKLE from doc We no longer use pickle. Do not advertise this env var. Keep it internally for future use. (may remove in 6.0.0) --------- Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com> Co-authored-by: Arthit Suriyawongkul <arthit@gmail.com> Co-authored-by: bact <128572+bact@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
1 parent d9dec9e commit 6df48fb

File tree

9 files changed

+126
-68
lines changed

9 files changed

+126
-68
lines changed

CHANGELOG.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,17 @@ and this project adheres to
1515
[Semantic Versioning](https://semver.org/spec/v2.0.0.html).
1616

1717
- Full release notes: <https://github.com/PyThaiNLP/pythainlp/releases>
18-
- Commit history: <https://github.com/PyThaiNLP/pythainlp/compare/v5.2.0...5.3.0>
18+
- Commit history: <https://github.com/PyThaiNLP/pythainlp/compare/v5.3.0...v5.3.1>
19+
20+
## [5.3.1] - 2026-03-15
21+
22+
This release focuses on security issues related to corpus file loading.
23+
24+
### Security
25+
26+
- thai2fit: Use JSON model instead of pickle (#1325)
27+
- Defensive corpus loading: validate fields before processing (#1327)
28+
- w2p: Use npz model instead of pickle (#1328)
1929

2030
## [5.3.0] - 2026-03-10
2131

@@ -202,6 +212,7 @@ The minimum requirement is now Python 3.9.
202212

203213
- See <https://github.com/PyThaiNLP/pythainlp/releases/tag/v5.0.0>
204214

215+
[5.3.1]: https://github.com/PyThaiNLP/pythainlp/compare/v5.3.0...v5.3.1
205216
[5.3.0]: https://github.com/PyThaiNLP/pythainlp/compare/v5.2.0...v5.3.0
206217
[5.2.0]: https://github.com/PyThaiNLP/pythainlp/compare/v5.1.2...v5.2.0
207218
[5.1.2]: https://github.com/PyThaiNLP/pythainlp/compare/v5.1.1...v5.1.2

pythainlp/__init__.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@
5757
"correct",
5858
"is_offline_mode",
5959
"is_read_only_mode",
60+
"is_unsafe_pickle_allowed",
6061
"pos_tag",
6162
"romanize",
6263
"spell",
@@ -78,6 +79,10 @@
7879
subword_tokenize,
7980
word_tokenize,
8081
)
81-
from pythainlp.tools.path import is_offline_mode, is_read_only_mode
82+
from pythainlp.tools.path import (
83+
is_offline_mode,
84+
is_read_only_mode,
85+
is_unsafe_pickle_allowed,
86+
)
8287
from pythainlp.transliterate import romanize, transliterate
8388
from pythainlp.util import collate, thai_strftime

pythainlp/corpus/common.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -319,7 +319,7 @@ def thai_dict() -> dict[str, list[str]]:
319319
for row in reader:
320320
word = row.get("word")
321321
meaning = row.get("meaning")
322-
if not word or not word.strip() or not meaning or not meaning.strip():
322+
if not (word and word.strip() and meaning and meaning.strip()):
323323
warnings.warn(
324324
f"Skipping thai_dict entry with missing or empty field(s): {dict(row)!r}",
325325
UserWarning,
@@ -406,7 +406,11 @@ def thai_synonyms() -> dict[str, Union[list[str], list[list[str]]]]:
406406
word = row.get("word")
407407
pos = row.get("pos")
408408
synonym = row.get("synonym")
409-
if not word or not word.strip() or not pos or not pos.strip() or not synonym or not synonym.strip():
409+
if not (
410+
word and word.strip()
411+
and pos and pos.strip()
412+
and synonym and synonym.strip()
413+
):
410414
warnings.warn(
411415
f"Skipping thai_synonyms entry with missing or empty field(s): {dict(row)!r}",
412416
UserWarning,

pythainlp/spell/words_spelling_correction.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,9 @@ def _load_embeddings(self) -> tuple[list[str], NDArray[np.float32]]:
8484
"""Loads embeddings matrix and vocabulary list."""
8585
import numpy as np
8686

87-
input_matrix = np.load(os.path.join(self.model_dir, "embeddings.npy"))
87+
input_matrix = np.load(
88+
os.path.join(self.model_dir, "embeddings.npy"), allow_pickle=False
89+
)
8890
words = []
8991
vocab_path = os.path.join(self.model_dir, "vocabulary.txt")
9092
with open(vocab_path, encoding="utf-8") as f:

pythainlp/tag/_tag_perceptron.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -166,8 +166,8 @@ def train(
166166
``nr_iter`` controls the number of Perceptron training iterations.
167167
168168
:param sentences: A list of (words, tags) tuples.
169-
:param save_loc: If not ``None``, saves a pickled model in this \
170-
location.
169+
:param save_loc: If not ``None``, saves the model as a JSON file in \
170+
this location.
171171
:param nr_iter: Number of training iterations.
172172
"""
173173
import random
@@ -200,7 +200,7 @@ def train(
200200
random.shuffle(sentences_list)
201201
self.model.average_weights()
202202

203-
# save the model
203+
# save the model as JSON
204204
if save_loc is not None:
205205
data: dict[str, Union[dict, list]] = {}
206206
data["weights"] = self.model.weights
@@ -210,7 +210,7 @@ def train(
210210
json.dump(data, f, ensure_ascii=False)
211211

212212
def load(self, loc: str) -> None:
213-
"""Load a pickled model.
213+
"""Load a saved model from a JSON file.
214214
:param str loc: model path
215215
"""
216216
try:

pythainlp/tools/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
"get_pythainlp_path",
99
"is_offline_mode",
1010
"is_read_only_mode",
11+
"is_unsafe_pickle_allowed",
1112
"safe_print",
1213
"warn_deprecation",
1314
]
@@ -20,4 +21,5 @@
2021
get_pythainlp_path,
2122
is_offline_mode,
2223
is_read_only_mode,
24+
is_unsafe_pickle_allowed,
2325
)

pythainlp/tools/path.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,22 @@ def is_read_only_mode() -> bool:
102102
return False
103103

104104

105+
def is_unsafe_pickle_allowed() -> bool:
106+
"""Return whether loading legacy pickle-based corpus files is allowed.
107+
108+
Pickle deserialisation can execute arbitrary code if the file has been
109+
tampered with, so it is **disabled by default**.
110+
Set the ``PYTHAINLP_ALLOW_UNSAFE_PICKLE`` environment variable to
111+
a truthy value (e.g. ``"1"``) only when you trust the corpus file and
112+
understand the risk.
113+
114+
:return: ``True`` if legacy pickle loading is allowed, ``False`` otherwise.
115+
:rtype: bool
116+
"""
117+
val = os.getenv("PYTHAINLP_ALLOW_UNSAFE_PICKLE", "")
118+
return val.strip().lower() in ("1", "true", "yes", "on")
119+
120+
105121
def is_offline_mode() -> bool:
106122
"""Return whether PyThaiNLP is operating in offline mode.
107123

pythainlp/transliterate/w2p.py

Lines changed: 51 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@
2222
"-พจใงต้ืฮแาฐฒฤูศฅถฺฎหคสุขเึดฟำฝยลอ็ม" + " ณิฑชฉซทรํฬฏ–ัฃวก่ปผ์ฆบี๊ธฌญะไษ๋นโภ?"
2323
)
2424

25-
_MODEL_NAME: str = "thai_w2p"
25+
_MODEL_NAME: str = "thai_w2p_npz"
2626

2727

2828
class _Hparams:
@@ -40,9 +40,9 @@ class _Hparams:
4040
hp: _Hparams = _Hparams()
4141

4242

43-
def _load_vocab() -> tuple[
44-
dict[str, int], dict[int, str], dict[str, int], dict[int, str]
45-
]:
43+
def _load_vocab() -> (
44+
tuple[dict[str, int], dict[int, str], dict[str, int], dict[int, str]]
45+
):
4646
g2idx = {g: idx for idx, g in enumerate(hp.graphemes)}
4747
idx2g = dict(enumerate(hp.graphemes))
4848

@@ -60,7 +60,6 @@ class Thai_W2P:
6060
p2idx: dict[str, int]
6161
idx2p: dict[int, str]
6262
checkpoint: Optional[str]
63-
variables: "NDArray"
6463
enc_emb: "NDArray"
6564
enc_w_ih: "NDArray"
6665
enc_w_hh: "NDArray"
@@ -84,9 +83,7 @@ def __init__(self) -> None:
8483
self.p2idx: dict[str, int]
8584
self.idx2p: dict[int, str]
8685
self.g2idx, self.idx2g, self.p2idx, self.idx2p = _load_vocab()
87-
self.checkpoint: Optional[str] = get_corpus_path(
88-
_MODEL_NAME, version="0.2"
89-
)
86+
self.checkpoint: Optional[str] = get_corpus_path(_MODEL_NAME, version="0.2")
9087
if not self.checkpoint:
9188
raise FileNotFoundError(
9289
f"corpus-not-found name={_MODEL_NAME!r}\n"
@@ -98,55 +95,54 @@ def __init__(self) -> None:
9895

9996
def _load_variables(self) -> None:
10097
import numpy as np
101-
10298
if self.checkpoint is None:
10399
raise RuntimeError("checkpoint path is not set")
104-
self.variables: "NDArray" = np.load(self.checkpoint, allow_pickle=True)
105-
# (29, 64). (len(graphemes), emb)
106-
self.enc_emb: "NDArray" = self.variables.item().get(
107-
"encoder.emb.weight"
108-
)
109-
# (3*128, 64)
110-
self.enc_w_ih: "NDArray" = self.variables.item().get(
111-
"encoder.rnn.weight_ih_l0"
112-
)
113-
# (3*128, 128)
114-
self.enc_w_hh: "NDArray" = self.variables.item().get(
115-
"encoder.rnn.weight_hh_l0"
116-
)
117-
# (3*128,)
118-
self.enc_b_ih: "NDArray" = self.variables.item().get(
119-
"encoder.rnn.bias_ih_l0"
120-
)
121-
# (3*128,)
122-
self.enc_b_hh: "NDArray" = self.variables.item().get(
123-
"encoder.rnn.bias_hh_l0"
124-
)
125-
126-
# (74, 64). (len(phonemes), emb)
127-
self.dec_emb: "NDArray" = self.variables.item().get(
128-
"decoder.emb.weight"
129-
)
130-
# (3*128, 64)
131-
self.dec_w_ih: "NDArray" = self.variables.item().get(
132-
"decoder.rnn.weight_ih_l0"
133-
)
134-
# (3*128, 128)
135-
self.dec_w_hh: "NDArray" = self.variables.item().get(
136-
"decoder.rnn.weight_hh_l0"
137-
)
138-
# (3*128,)
139-
self.dec_b_ih: "NDArray" = self.variables.item().get(
140-
"decoder.rnn.bias_ih_l0"
141-
)
142-
# (3*128,)
143-
self.dec_b_hh: "NDArray" = self.variables.item().get(
144-
"decoder.rnn.bias_hh_l0"
145-
)
146-
# (74, 128)
147-
self.fc_w: "NDArray" = self.variables.item().get("decoder.fc.weight")
148-
# (74,)
149-
self.fc_b: "NDArray" = self.variables.item().get("decoder.fc.bias")
100+
with np.load(self.checkpoint, allow_pickle=False) as variables:
101+
# (29, 64). (len(graphemes), emb)
102+
self.enc_emb: "NDArray" = variables[
103+
"encoder_emb_weight"
104+
]
105+
# (3*128, 64)
106+
self.enc_w_ih: "NDArray" = variables[
107+
"encoder_rnn_weight_ih_l0"
108+
]
109+
# (3*128, 128)
110+
self.enc_w_hh: "NDArray" = variables[
111+
"encoder_rnn_weight_hh_l0"
112+
]
113+
# (3*128,)
114+
self.enc_b_ih: "NDArray" = variables[
115+
"encoder_rnn_bias_ih_l0"
116+
]
117+
# (3*128,)
118+
self.enc_b_hh: "NDArray" = variables[
119+
"encoder_rnn_bias_hh_l0"
120+
]
121+
122+
# (74, 64). (len(phonemes), emb)
123+
self.dec_emb: "NDArray" = variables[
124+
"decoder_emb_weight"
125+
]
126+
# (3*128, 64)
127+
self.dec_w_ih: "NDArray" = variables[
128+
"decoder_rnn_weight_ih_l0"
129+
]
130+
# (3*128, 128)
131+
self.dec_w_hh: "NDArray" = variables[
132+
"decoder_rnn_weight_hh_l0"
133+
]
134+
# (3*128,)
135+
self.dec_b_ih: "NDArray" = variables[
136+
"decoder_rnn_bias_ih_l0"
137+
]
138+
# (3*128,)
139+
self.dec_b_hh: "NDArray" = variables[
140+
"decoder_rnn_bias_hh_l0"
141+
]
142+
# (74, 128)
143+
self.fc_w: "NDArray" = variables["decoder_fc_weight"]
144+
# (74,)
145+
self.fc_b: "NDArray" = variables["decoder_fc_bias"]
150146

151147
def _sigmoid(self, x: "np.ndarray") -> "np.ndarray":
152148
import numpy as np

tests/core/test_tools.py

Lines changed: 26 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,11 @@
88
import warnings
99
from unittest.mock import patch
1010

11-
from pythainlp import is_offline_mode, is_read_only_mode
11+
from pythainlp import (
12+
is_offline_mode,
13+
is_read_only_mode,
14+
is_unsafe_pickle_allowed,
15+
)
1216
from pythainlp.tools import (
1317
get_full_data_path,
1418
get_pythainlp_data_path,
@@ -26,9 +30,7 @@
2630
class ToolsTestCase(unittest.TestCase):
2731
def test_path(self):
2832
data_filename = "ttc_freq.txt"
29-
self.assertTrue(
30-
get_full_data_path(data_filename).endswith(data_filename)
31-
)
33+
self.assertTrue(get_full_data_path(data_filename).endswith(data_filename))
3234
self.assertIsInstance(get_pythainlp_data_path(), str)
3335
self.assertIsInstance(get_pythainlp_path(), str)
3436

@@ -272,3 +274,23 @@ def test_get_pythainlp_data_path_no_makedirs_in_read_only(self):
272274
"Data directory should not be created in read-only mode",
273275
)
274276

277+
def test_is_unsafe_pickle_allowed(self):
278+
"""Test is_unsafe_pickle_allowed() reflects PYTHAINLP_ALLOW_UNSAFE_PICKLE env var."""
279+
# Truthy values
280+
for truthy in ("1", "true", "True", "TRUE", "yes", "YES", "on", "ON"):
281+
with patch.dict(os.environ, {"PYTHAINLP_ALLOW_UNSAFE_PICKLE": truthy}):
282+
self.assertTrue(
283+
is_unsafe_pickle_allowed(),
284+
f"Expected True for PYTHAINLP_ALLOW_UNSAFE_PICKLE={truthy!r}",
285+
)
286+
# Falsy values
287+
for falsy in ("", "0", "false", "False", "FALSE", "no", "NO", "off", "OFF"):
288+
with patch.dict(os.environ, {"PYTHAINLP_ALLOW_UNSAFE_PICKLE": falsy}):
289+
self.assertFalse(
290+
is_unsafe_pickle_allowed(),
291+
f"Expected False for PYTHAINLP_ALLOW_UNSAFE_PICKLE={falsy!r}",
292+
)
293+
# Unset: should default to False
294+
with patch.dict(os.environ, {}, clear=False):
295+
os.environ.pop("PYTHAINLP_ALLOW_UNSAFE_PICKLE", None)
296+
self.assertFalse(is_unsafe_pickle_allowed())

0 commit comments

Comments
 (0)