Skip to content

chore: migrate sentence segmentation from NLTK to spaCy#75

Open
Efreet408 wants to merge 8 commits intodevelopmentfrom
chore/update-text-processing-lib
Open

chore: migrate sentence segmentation from NLTK to spaCy#75
Efreet408 wants to merge 8 commits intodevelopmentfrom
chore/update-text-processing-lib

Conversation

@Efreet408
Copy link
Copy Markdown
Contributor

Description of changes

  • Migrated sentence segmentation from NLTK to spaCy
  • Removed repeated download calls; spaCy dependency is now declared in pyproject.toml

Checklist

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@Efreet408 Efreet408 requested a review from Allob as a code owner March 4, 2026 17:14

[tool.poetry.dependencies]
python = "^3.11"
python = ">=3.11, <3.15"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pydantic = "^2.7.1"
nltk = "^3.8.1"
pydantic = {version = ">=2.13.0b2", allow-prereleases = true}
spacy = ">=3.8.9,<4"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

srsly = ">=2.5.2"
murmurhash = ">=1.0.14"
cymem = ">=2.0.12"
preshed = ">=3.0.11"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyproject.toml Outdated
s3fs = {version = "^2024.3.1", optional = true}
pydantic = "^2.7.1"
nltk = "^3.8.1"
pydantic = {version = ">=2.13.0b2", allow-prereleases = true}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running on Python 3.14, the error occurs pydantic.v1.errors.ConfigError: unable to infer type for attribute "REGEX"

This issue is documented in spaCy’s GitHub (explosion/spaCy#13902 explosion/spaCy#13895)

Unfortunately, the required changes in Pydantic are still in prerelease
(https://github.com/pydantic/pydantic/releases/tag/v2.13.0b1
Latest V1.10.26 release under the pydantic.v1 namespace. This version includes support for Python 3.14.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant