Releases: jhy/jsoup
jsoup 1.16.2
Improvements
- Optimized the performance of complex CSS selectors, by adding a cost-based query planner. Evaluators are sorted by their relative execution cost, and executed in order of lower to higher cost. This speeds the matching process by ensuring that simpler evaluations (such as a tag name match) are conducted prior to more complex evaluations (such as an attribute regex, or a deep child scan with a :has).
- Added support for
<svg>and<math>tags (and their children). This includes tag namespaces and case preservation on applicable tags and attributes.#2008
- When converting jsoup Documents to W3C Documents in
W3CDom, HTML documents will be placed in thehttp://www.w3.org/1999/xhtmlnamespace by default, per the HTML5 spec. This can be controlled by settingW3CDom#namespaceAware(boolean false).#1848
- Speed optimized the Structural Evaluators by memoizing previous evaluations. Particularly the
~(any preceding sibling) and:nth-of-typeselectors are improved.#1956
- Tweaked the performance of the
ElementnextElementSibling,previousElementSibling,firstElementSibling,lastElementSibling,firstElementChild, and `lastElementChild. They now inplace filter/skip in the child-node list, vs having to allocate and scan a complete Element filtered list.
- Optimized internal methods that previously called
Element.children()to use filter/skip child-node list accessors instead, reducing new Element List allocations.
- Tweaked the performance of parsing
:pseudoselectors.
- When using the
:emptypseudo-selector, blank textnodes are now considered empty. Previously, an element containing any whitespace was not considered empty.#1976
- In forms,
<input type="image">should be excluded fromElement.formData()(and hence from form submissions).#2010
Bug Fixes
- Bugfix:
formelements and empty elements (such asimg) did not have their attributes de-duplicated.#1950
- If
Document.OutputSettingswas cloned from a clone, an NPE would be thrown when used.#1964
- In
Jsoup.connect(String url), URL paths containing a %2B were incorrectly recoded to a '+', or a '+' was recoded to a ' '. Fixed by reverting to the previous behavior of not encoding supplied paths, other than normalizing to ASCII.#1952
- In
Jsoup.connect(String url), strings containing supplemental characters (e.g. emoji) were not URL escaped correctly.
- In
Jsoup.connect(String url), the ConstrainableInputStream would clear Thread interrupts when reading the body. This precluded callers from spawning a thread, running a number of requests for a length of time, then joining that thread after interrupting it.#1991
- When tracking HTML source positions, the closing tags for
H1...H6elements were not tracked correctly.#1987
- In
Jsoup.connect(), aDELETEmethod request did not support a request body.#1972
- When calling
Element.cssSelector()on an extremely deeply nested element, aStackOverflowErrorcould occur. Further, aStackOverflowErrormay occur when running the query.#2001
- Appending a node back to its original
Elementafterempty()would throw an Index out of bounds exception. Also, now the child nodes that were removed have their parent node cleared, fully detaching them from the original parent.#2013
- In
Connectionwhen adding headers, the value may have been assumed to be an incorrectly decodedISO_8859_1string, and re-encoded asUTF-8. The value is now left as-is.
Changes
- Removed previously deprecated methods
Document.normalise(),Element.forEach(org.jsoup.helper.Consumer<>),Node.forEach(org.jsoup.helper.Consumer<>), and theorg.jsoup.helper.Consumerinterface; the latter being a previously required compatibility shim prior to Android's de-sugaring support.
- The previous compatibility shim
org.jsoup.UncheckedIOExceptionis deprecated in favor of the now supportedjava.io.UncheckedIOException. If you are catching the former, modify your code to catch the latter instead.#1989
- Blocked
noscripttags from being added to Safelists, due to incompatibilities between parsers with and without script-mode enabled.
jsoup 1.16.1
jsoup Java HTML Parser release 1.16.1
Improvements
- In
Jsoup.connect(String url), natively support URLs with Unicode characters in the path or query string, without having to be escaped by the caller. #1914
- Calling
Node.remove()on a node with no parent is now a no-op, vs a validation error. #1898
Bug Fixes
- Aligned the HTML Tree Builder processing steps for
AfterBodyandAfterAfterBodyto the updated WHATWG standard, to not pop the stack to close<body>or<html>elements. This prevents an errant</html>closing the preceding structure. Also added appropriate error message outputs in this case. #1851
- Corrected support for ruby elements (
<ruby>,<rp>,<rt>, and<rtc>) to current spec. #1294
- When using
Node.before(Node)orNode.after(Node), if the incoming node was a sibling of the context node, the incoming node may be inserted into the wrong relative location. #1898
- In
Jsoup.connect(String url), if the input URL had components that were already%escaped, they would be escaped again, causing errors when fetched. #1902
- When tracking input source positions, text in tables that was fostered had invalid positions. #1927
- If the
Document.OutputSettingsclass was initialized, and thenEntities.escape(String)called, an NPE may be thrown due to a class loading circular dependency. #1910
- When pretty-printing, the first inline
ElementorCommentin a block would not be wrap-indented if it were preceded by a blank text node. #1906
- When pretty-printing a
<pre>containing block tags, those tags were incorrectly indented. #1891
- When pretty-printing nested inlineable blocks (such as a
<p>in a<td>), the inner element should be indented. #1926
<br>tags should be wrap-indented when in block tags (and not when in inline tags). #1911
- The contents of a sufficiently large
<textarea>with un-escaped HTML closing tags may be incorrectly parsed to an empty node. #1929
jsoup 1.15.4
jsoup Java HTML Parser release 1.15.4
jsoup 1.15.4 is out now, and includes a bunch of improvements, particularly when pretty-printing HTML, and bug fixes.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
Download jsoup now.
Improvements
- Added the ability to escape CSS selectors (tags, IDs, classes) to match elements that don't follow regular CSS syntax. For example, to match by classname
<p class="one.two">, usedocument.select("p.one\\.two");#838
- When pretty-printing, wrap text that follows a
<br>tag. #1858
- When pretty-printing, normalize newlines that follow self-closing tags in custom tags. #1852
- When pretty-printing, collapse non-significant whitespace between a block and an inline tag. #1802
- In
Element.forEach()andNode.forEachNode(), usejava.util.function.Consumerinstead of the previous Android compatibility shimorg.jsoup.helper.Consumer. Subsequently, the latter has been deprecated. #1870
- Added a new method
Document.forms(), to conveniently retrieve aList<FormElement>containing the<form>elements in a document.
- Added a new method
Document.expectForm(), to find the first matchingFormElement, or blow up trying.
Bug Fixes
- URLs containing characters such as
and <code>were not escaped correctly, and would throw aMalformedURLExceptionwhen fetched. #1873
Element.cssSelector()would create invalid selectors for elements where the tag name, ID, or classnames needed to be escaped (e.g. if a class name contained a:or.). #1742
Element.text()should have a space between a block and an inline element. #1877
- Form data on a previous request was copied to a new request in
newRequest(), resulting in an accumulation of form data when executing multi-step form submissions, or data sent to later requests incorrectly. Now,newRequest()only copies session related settings (cookies, proxy settings, user-agent, etc) but not the request data nor the body. #1778
- Fixed an issue in
Safelist.removeAttributes()which could throw aConcurrentModificationExceptionwhen using the:allpseudo-attribute.
- Given extremely deeply nested HTML, a number of methods in
Elementcould throw aStackOverflowErrordue to excessive recursion. Namely:#data(),#hasText(),#parents(), and#wrap(html). #1864
Changes
- Deprecated the unused
Document.normalise()method. Normalization occurs during the HTML tree construction, and no longer as a distinct phase.
My sincere thanks to everyone who contributed patches, suggestions, and bug reports. If you have any suggestions for the next release, I would love to hear them; please get in touch with me directly.
You can also follow me (@jhy@tilde.zone) on Mastodon / Fediverse to receive occasional notes about jsoup releases.
jsoup 1.15.3
jsoup 1.15.3 is out now, and includes a security fix for potential XSS attacks, along with other bug fixes and improvements, including more descriptive validation error messages.
Details:
jsoup 1.15.2
jsoup 1.15.2 is out now with a bunch of improvements and bug fixes.
jsoup 1.15.1
jsoup 1.15.1 is out now with a bunch of improvements and bug fixes.
jsoup 1.14.3
jsoup 1.14.3 is out now, adding native XPath selector support, improved <template> support, and also includes a bunch of bug fixes, improvements, and performance enhancements.
See the release announcement for the full changelog.
jsoup 1.14.2
Caught by the fuzz! jsoup 1.14.2 is out now, and includes a set of parser bug fixes and improvements for handling rough HTML and XML, as identified by the Jazzer JVM fuzzer. This release also includes other fixes and improvements.
See the release announcement for the full changelog.
jsoup 1.14.1
jsoup 1.14.1 is out now, with simple request session management, increased parse robustness, and a ton of other improvements, speed-ups, and bug fixes.
See the full announcement for all the details on what's changed.
jsoup 1.13.1
jsoup 1.13.1
See the release notes.
<dependency>
<!-- jsoup HTML parser library @ https://jsoup.org/ -->
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.13.1</version>
</dependency>