Skip to content

Conversation

@lewismc
Copy link
Member

@lewismc lewismc commented Jan 12, 2026

This PR is an attempt to address NUTCH-3110 and in the process supersede #850.
Essentially it upgrades Apache Tika from the shaded artifacts to the official Tika 3.2.3 release, addressing compatibility issues and restoring full functionality. Some noteworthy proposals

  • Both plugins (language-identifier & parse-tika) exclude slf4j-api to prevent class loader conflicts (NUTCH-3108)
  • Duplicate outlinks: Changed HashMap to LinkedHashMap in DOMContentUtils.java to preserve link insertion order while deduplicating.
  • UTF-16 encoding test: Fixed double BOM issue in TestHtmlParser.java where Java's UTF-16 encoder was adding a second BOM.
    Boilerpipe support: Restored boilerpipe content extraction using the new tika-handler-boilerpipe module.

Additionally a bunch of new tests will assist in future Tika upgrades

  • TestBoilerpipeExtraction - Boilerpipe integration tests
  • TestLinkExtractionEdgeCases - Link extraction behavior tests
  • TestEncodingDetection - Charset detection tests
  • TestMetadataExtraction - HTML metadata extraction tests
  • TestParserFailureHandling - Error handling/graceful degradation tests

This PR needs diligent testing in distributed mode before we consider merging.

sebastian-nagel and others added 3 commits March 28, 2025 09:19
Upgrade to shaded Tika packages 3.1.0.0 provided by Tim Allison.
The shaded packages are required to avoid version conflicts when
running in distributed mode caused by incompatible versions of
the commons-io jar shipped with Hadoop and required by Tika,
cf. NUTCH-2959.
Add "text/javascript" as MIME type supported by "parse-js".
Note: fixes parse-js unit tests. Tika 3.1.0 identifies
the Javascript test document as "text/javascript" instead of
"application/javascript".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants