Data-discovery and index

In https://github.com/gauteh/hidefix/pull/8 a couple of different DBs have been benchmarked. The deserialization of the full index of a large file (4gb) takes about 8 us (on my laptop), its about 8 mb, and takes about 100-150 ns to read from memory-mapped type local databases (sled, heed). Reading it (8 mb binary) from redis, sqlite or similar takes about 3 to 6 ms which is maybe a bit too high. It would be interesting to also try postgres.

1) We need to keep data-discovery and dataset removal/update in mind:

* I think datasets should be registered, not auto-discovered by the data-server: the registration could be run by another dedicated service that auto-detects/scrapes sources.
* When a data-file turns out to be missing, or mtime has changed, we return an error, possibly notifying the scraper-service.

2) I think that we have to assume internal network-latency is OK, I don't see how we can do much about that, except keeping communication at a minimum.

A solution could be:

* Keep a central db with the index, DAS, DDS and list of datasets. This could be an SQL server or whatever, it is only written to by the scraper.
* Each worker has a local cache of datasets (index, DAS, DDS) (e.g. heed, or maybe even just in-memory): to avoid having to verify that a dataset still exists it checks the mtime of the source on request. If the mtime is changed: Update cache from server. In the case of NCML-aggregates this will not be discovered.
* When the central DB is changed, cache clearing is triggered at the workers. Retrieving new data from the central server is pretty cheap. This will handle NCML-changes.
* This will make it possible to extend to cloud data-sources since the central-db then would point to e.g. an s3 URL.

Unfortunately this complicates things significantly, but I don't see how to avoid it when scaling up. It would be nice to still support a stand-alone server that does not need a central db, but just caches locally and discovers datasets itself in some way. That would make it significantly easier to test the server out.

Some reasons:
* Storing the full index of all datasets on every worker takes a lot of space and needs to be kept in sync
* Network disk of index is slow? Embedded databases like SQLite still too slow, so then need a memory mapped DB anyway
* Indexing on-demand too slow, especially for aggregated datasets.

Since data is usually on network disks, caching data could possibly be done using large file system cache or maybe something like https://docs.rs/freqfs/latest/freqfs/index.html. 

@magnusuMET 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data-discovery and index #19

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Data-discovery and index #19

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions