Skip to content

html: alternative to cheerio #899

@myfreeer

Description

@myfreeer

Summary

Find or make an alternative of cheerio with better performance (cpu or memory usage) while not breaking most of currently using api.

Goals

  • Performance
    • Parsing and serializing time should be much fewer then the current version of cheerio (1.0.0-rc.12)
    • Memory usage of parsed doc should not be larger than using the current version of cheerio
  • Bundle size
    • node_modules of dependents should not be too much larget then using the current version of cheerio (1.0.0-rc.12)
    • cheerio and the alternative should not be installed by default
  • Compatiblity
    • Most extra costs for dev, like creating or disposing wasm instances should be done inside website-scrap-engine
    • Most cheerio selectors should just work as is.
    • Most cheerio API should just work as is.
    • Should work in any host environment.
    • Should work for html, svg, and sitemap xml.
  • Documentation
    • A breaking release should be made.
    • Should clearify the changed install method in readme if any
    • Unsupported cheerio api should be documented.
    • Benchmark result should be documented.

Non-Goals

  • It is not a goal to use native bindings or external processes for this.
  • It is not a goal to use an embedded browser.
  • It is not a goal to change to a fully streaming workflow.
  • It is not a goal to use threaded parsing or serializing.
  • It is not a goal to use GPU, FPGA, or ASIC for this.
  • It is not a goal to make a text() method that exactly matches the behavior in browser.
  • It is not a goal to do charset conversions out of js.

Motivation

The current version of cheerio (1.0.0-rc.12) relies on parse5 to parse and serialize HTML documents. However, parse5 is not very fast, especially for large HTML files, and it adds a significant amount of bundle size to the node_modules of dependents. Moreover, JavaScript itself is not a very fast language for manipulating complex data structures such as DOM trees. A profiler shows that HTML parsing takes a considerable amount of time in the website-scrap-engine workflow. Therefore, it would be desirable to find or make an alternative of cheerio that has better performance (CPU or memory usage) while not breaking most of the currently used API.

Other notes

wasm is supported by most node.js runtimes;
html5ever is a fast parser and serializer written in rust, which can compile to wasm, but the wasm size is too large;
there are some existing work with html5ever like https://crates.io/crates/dom_query, https://crates.io/crates/scraper, https://github.com/importcjj/nipper, https://crates.io/crates/htmler, https://crates.io/crates/kuchiki, some of them are unmatained, but is still good reference;
existing work in c/cpp can also be considered to be compiled with wasi-sdk, like libxml2, (help me get more alternatives here) ;
it should be also considered to use wasm as parser-only, create bytecode from html, and load bytecode in js with existing cheerio code.
To better handle gc, one might create a command-encoder with cheerio-like api, and load the commands into the wasm module, but such way makes it impossible for runtime conditions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BREAKINGissues that can cause breaking changesenhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions