Skip to content

Data-discovery and index #19

@gauteh

Description

@gauteh

In gauteh/hidefix#8 a couple of different DBs have been benchmarked. The deserialization of the full index of a large file (4gb) takes about 8 us (on my laptop), its about 8 mb, and takes about 100-150 ns to read from memory-mapped type local databases (sled, heed). Reading it (8 mb binary) from redis, sqlite or similar takes about 3 to 6 ms which is maybe a bit too high. It would be interesting to also try postgres.

  1. We need to keep data-discovery and dataset removal/update in mind:
  • I think datasets should be registered, not auto-discovered by the data-server: the registration could be run by another dedicated service that auto-detects/scrapes sources.
  • When a data-file turns out to be missing, or mtime has changed, we return an error, possibly notifying the scraper-service.
  1. I think that we have to assume internal network-latency is OK, I don't see how we can do much about that, except keeping communication at a minimum.

A solution could be:

  • Keep a central db with the index, DAS, DDS and list of datasets. This could be an SQL server or whatever, it is only written to by the scraper.
  • Each worker has a local cache of datasets (index, DAS, DDS) (e.g. heed, or maybe even just in-memory): to avoid having to verify that a dataset still exists it checks the mtime of the source on request. If the mtime is changed: Update cache from server. In the case of NCML-aggregates this will not be discovered.
  • When the central DB is changed, cache clearing is triggered at the workers. Retrieving new data from the central server is pretty cheap. This will handle NCML-changes.
  • This will make it possible to extend to cloud data-sources since the central-db then would point to e.g. an s3 URL.

Unfortunately this complicates things significantly, but I don't see how to avoid it when scaling up. It would be nice to still support a stand-alone server that does not need a central db, but just caches locally and discovers datasets itself in some way. That would make it significantly easier to test the server out.

Some reasons:

  • Storing the full index of all datasets on every worker takes a lot of space and needs to be kept in sync
  • Network disk of index is slow? Embedded databases like SQLite still too slow, so then need a memory mapped DB anyway
  • Indexing on-demand too slow, especially for aggregated datasets.

Since data is usually on network disks, caching data could possibly be done using large file system cache or maybe something like https://docs.rs/freqfs/latest/freqfs/index.html.

@magnusuMET

Metadata

Metadata

Labels

enhancementNew feature or requesthelp wantedExtra attention is neededquestionFurther information is requested

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions