Skip to content

Investigate incremental WACZ storage #37

@wvengen

Description

@wvengen

At the moment, one needs to have twice the storage page for all requests and responses (one for WARC, one for the WACZ). As this is not always known beforehand, and could potentially be larger from one crawl to another, it would be helpful to make it work also when not enough storage space is available.

Is it possible to store the WACZ incrementally while the spider is running? How could this be done?
The desire is to do this directly on object storage. But skipping the saving of WARCs first would also be some improvement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions