Skip to content

Stop adding data: and mailto: URIs to the database #483

@JustAnotherArchivist

Description

@JustAnotherArchivist

As of wpull 2.0.3, data: and mailto: URIs get added to the database, although neither serves any purpose. Not only are these schemes unsupported, there's also nothing to be retrieved for them anyway. tel: URIs (currently entirely unsupported and treated as relative paths instead) should likely also be treated the same.

As an extreme example of the impact in the real world: an ArchiveBot job's database grew to 106 GB over the past couple days due to data: URIs embedded in every page. After purging these URIs with (likely not the most efficient approach)

sqlite3 wpull.db 'SELECT id FROM url_strings WHERE url LIKE "data:%"' | sed 's,^.*$,UPDATE url_strings SET url = "data:<removed-&>" WHERE id = &\;,' >cmds
sqlite3 wpull.db <cmds
sqlite3 wpull.db VACUUM

the database size dropped to 860 MB.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions