Fill and write each array before creating the next one, to save memory.#177
Fill and write each array before creating the next one, to save memory.#177qwertystop wants to merge 3 commits intojcjohnson:masterfrom
Conversation
|
Realized another improvement: Preprocessor can now use numpy arrays of |
scripts/preprocess.py
Outdated
| @@ -45,33 +45,44 @@ | |||
| # Choose the datatype based on the vocabulary size | |||
| dtype = np.uint8 | |||
| if len(token_to_idx) > 255: | |||
There was a problem hiding this comment.
Minor comment #1: use a single level of branching:
if len(..) > 4294967295:
....
elif len(..) > 65535:
...
else:
...
|
|
||
| h.create_dataset(set_name, data=arr) | ||
|
|
||
| # Write data to HDF5 file |
There was a problem hiding this comment.
Minor comment #2: Indentation doesn't match the rest of the file.
|
LGTM and is a useful improvement, though I haven't tested it. Two minor comments left inline. |
|
Fixed commented issues. I have tested it/am currently testing it – the preprocessor ran cleanly (1.8 GB input.txt, 256 "characters" in that it's non-text-encoded bytes), and the network is currently calculating validation loss for iteration 4000 (running at default settings). Certainly doesn't seem like it's going to break. Perhaps when this is done I'll work out how to make it read two bytes per "character" and test it on Red Book audio (16-bit stereo) to bump up the data type. Then four bytes, treating both channels together as a single sample, to bump it up again. (if someone else wants to do that, it'd be done faster – I've used up my spare-time budget getting this far) |
|
Anybody want to merge this? This looks perfect for a 2g dataset solution with 4g ram space |
The previous version said "we'll have to do something more clever for huge datasets."
I tried this with a huge dataset (the entire five-CD Okami soundtrack, WAV-formatted with headers stripped out, tracks separated by five seconds of silence, repeated three times and shuffled... about 10G, but then the preprocessor maps each byte to a unit32 so it gets magnified a bit).
I wouldn't call it "something clever," but it seems to work. All I did was re-organize things so that instead of making all the numpy arrays at once, filling them, and writing them all to h5 files, it makes one, fills it, writes it, garbage-collects it, then goes on to the next.