March 29, 2017
Parallel and distributed algorithms
origin of neural nets renaissance
parallel computing (hardware and software)
brain is a parallel computer
large number of slow elements
we simulate it on a digital computer
neural nets are amenable to parallelism
hardware neural nets will come eventually
Accuracy may be the primary goal
Efficiency may be necessary to get there
larger net takes more RAM
state of the art always takes a week to train
high end GPU has 12 GB of RAM
what CPU or GPU should I purchase?
what software should I use?
how can I maximize performance?
multiple instructions, multiple data
e.g. multiple cores in a single Xeon chip
“single instruction multiple data”
aka “vector arithmetic” because adding two vectors is an example of SIMD
not to be confused with “vectorized” code in MATLAB/Python etc.
easy to harness for minibatch
GTX 1080 Ti
3584 cores @ 1.48 GHz
11 GB RAM
250 W
$699
much more expensive
24 cores @ 2.4 GHz
165W
60 MB cache
max turbo frequency 3.4 GHz
8-way configuration possible
$8898
eight Float32 (single precision)
four Float64 (double precision)
TensorFlow warnings: not compiled for SSE, AVX, FMA, etc
Intel Xeon Phi (manycore)
64 cores @ 1.3 GHz
215 W
$2438
4-way configuration possible
floating point operations per second
one vector instruction per cycle (with FMA)
number of cores x clock speed * SIMD size * 2 (FMA)
number of cores x clock speed * 2 (FMA)
people are even trying half precision or less these days
FLOPS gap between GPU and CPU is decreasing
good software required to utilize FLOPS efficiently
now approaching 100% of theoretical max FLOPs
CPU-RAM communication is slow
GPU-RAM communication is even slower
not a problem for the brain
computation and memory are intermingled
could include SSD and HDD
speed vs. capacity tradeoff
maximize reuse of data in fast memory
BLAS GEMM (general matrix matrix multiplication)
less FLOPs for small kernels
multiprocessor memory types
“nodes” that don’t share memory
network is too large to fit in GPU memory
network fits, but want faster training
divide network into modules
distribute modules across multiple compute nodes
modules should be weakly connected as internode communication is slow
each node computes gradient
every node maintains the same weights
weights can be slightly out of sync across nodes