-
Notifications
You must be signed in to change notification settings - Fork 4
Description
In gauteh/hidefix#8 a couple of different DBs have been benchmarked. The deserialization of the full index of a large file (4gb) takes about 8 us (on my laptop), its about 8 mb, and takes about 100-150 ns to read from memory-mapped type local databases (sled, heed). Reading it (8 mb binary) from redis, sqlite or similar takes about 3 to 6 ms which is maybe a bit too high. It would be interesting to also try postgres.
- We need to keep data-discovery and dataset removal/update in mind:
- I think datasets should be registered, not auto-discovered by the data-server: the registration could be run by another dedicated service that auto-detects/scrapes sources.
- When a data-file turns out to be missing, or mtime has changed, we return an error, possibly notifying the scraper-service.
- I think that we have to assume internal network-latency is OK, I don't see how we can do much about that, except keeping communication at a minimum.
A solution could be:
- Keep a central db with the index, DAS, DDS and list of datasets. This could be an SQL server or whatever, it is only written to by the scraper.
- Each worker has a local cache of datasets (index, DAS, DDS) (e.g. heed, or maybe even just in-memory): to avoid having to verify that a dataset still exists it checks the mtime of the source on request. If the mtime is changed: Update cache from server. In the case of NCML-aggregates this will not be discovered.
- When the central DB is changed, cache clearing is triggered at the workers. Retrieving new data from the central server is pretty cheap. This will handle NCML-changes.
- This will make it possible to extend to cloud data-sources since the central-db then would point to e.g. an s3 URL.
Unfortunately this complicates things significantly, but I don't see how to avoid it when scaling up. It would be nice to still support a stand-alone server that does not need a central db, but just caches locally and discovers datasets itself in some way. That would make it significantly easier to test the server out.
Some reasons:
- Storing the full index of all datasets on every worker takes a lot of space and needs to be kept in sync
- Network disk of index is slow? Embedded databases like SQLite still too slow, so then need a memory mapped DB anyway
- Indexing on-demand too slow, especially for aggregated datasets.
Since data is usually on network disks, caching data could possibly be done using large file system cache or maybe something like https://docs.rs/freqfs/latest/freqfs/index.html.