Fill and write each array before creating the next one, to save memory. by qwertystop · Pull Request #177 · jcjohnson/torch-rnn

qwertystop · 2017-04-02T15:40:59Z

The previous version said "we'll have to do something more clever for huge datasets."

I tried this with a huge dataset (the entire five-CD Okami soundtrack, WAV-formatted with headers stripped out, tracks separated by five seconds of silence, repeated three times and shuffled... about 10G, but then the preprocessor maps each byte to a unit32 so it gets magnified a bit).

I wouldn't call it "something clever," but it seems to work. All I did was re-organize things so that instead of making all the numpy arrays at once, filling them, and writing them all to h5 files, it makes one, fills it, writes it, garbage-collects it, then goes on to the next.

qwertystop · 2017-04-02T15:53:11Z

Realized another improvement: Preprocessor can now use numpy arrays of uint16 (more efficient use of space for files with e.g. all single-byte values) and uint64 (probably won't be necessary, but as long as I'm adding types I may as well be thorough).

ChrisCummins · 2017-04-02T18:30:09Z

scripts/preprocess.py

@@ -45,33 +45,44 @@
  # Choose the datatype based on the vocabulary size
  dtype = np.uint8
  if len(token_to_idx) > 255:


Minor comment #1: use a single level of branching:

if len(..) > 4294967295: .... elif len(..) > 65535: ... else: ...

ChrisCummins · 2017-04-02T18:30:38Z

scripts/preprocess.py

+
+          h.create_dataset(set_name, data=arr)

-  # Write data to HDF5 file


Minor comment #2: Indentation doesn't match the rest of the file.

ChrisCummins · 2017-04-02T18:31:20Z

LGTM and is a useful improvement, though I haven't tested it. Two minor comments left inline.

qwertystop · 2017-04-02T23:08:26Z

Fixed commented issues. I have tested it/am currently testing it – the preprocessor ran cleanly (1.8 GB input.txt, 256 "characters" in that it's non-text-encoded bytes), and the network is currently calculating validation loss for iteration 4000 (running at default settings). Certainly doesn't seem like it's going to break. Perhaps when this is done I'll work out how to make it read two bytes per "character" and test it on Red Book audio (16-bit stereo) to bump up the data type. Then four bytes, treating both channels together as a single sample, to bump it up again.

(if someone else wants to do that, it'd be done faster – I've used up my spare-time budget getting this far)

binary-person · 2019-03-08T22:27:09Z

Anybody want to merge this? This looks perfect for a 2g dataset solution with 4g ram space

qwertystop added 2 commits April 2, 2017 11:07

Fill and write each array before creating the next one, to save memory.

f63c351

allow use of dtype uint64 and uint16 in preprocessing

8287802

ChrisCummins reviewed Apr 2, 2017

View reviewed changes

spacing and branching cleanup as requested

7de2c74

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fill and write each array before creating the next one, to save memory.#177

Fill and write each array before creating the next one, to save memory.#177
qwertystop wants to merge 3 commits intojcjohnson:masterfrom
qwertystop:handle-large

qwertystop commented Apr 2, 2017 •

edited

Loading

Uh oh!

qwertystop commented Apr 2, 2017

Uh oh!

ChrisCummins Apr 2, 2017

Uh oh!

ChrisCummins Apr 2, 2017

Uh oh!

ChrisCummins commented Apr 2, 2017

Uh oh!

qwertystop commented Apr 2, 2017 •

edited

Loading

Uh oh!

binary-person commented Mar 8, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		h.create_dataset(set_name, data=arr)

		# Write data to HDF5 file

Conversation

qwertystop commented Apr 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qwertystop commented Apr 2, 2017

Uh oh!

ChrisCummins Apr 2, 2017

Choose a reason for hiding this comment

Uh oh!

ChrisCummins Apr 2, 2017

Choose a reason for hiding this comment

Uh oh!

ChrisCummins commented Apr 2, 2017

Uh oh!

qwertystop commented Apr 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

binary-person commented Mar 8, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qwertystop commented Apr 2, 2017 •

edited

Loading

qwertystop commented Apr 2, 2017 •

edited

Loading