Skip to content

Latest commit

 

History

History
136 lines (134 loc) · 3.93 KB

File metadata and controls

136 lines (134 loc) · 3.93 KB

March 29, 2017

Parallel and distributed algorithms

Motivations

Historical currents

origin of neural nets renaissance

parallel computing (hardware and software)

Computer vs. brain

brain is a parallel computer

large number of slow elements

we simulate it on a digital computer

neural nets are amenable to parallelism

hardware neural nets will come eventually

Practical

Accuracy may be the primary goal

Efficiency may be necessary to get there

larger net takes more RAM

longer training time

longer inference time

efficiency

speed

state of the art always takes a week to train

memory

high end GPU has 12 GB of RAM

practical knowledge

what CPU or GPU should I purchase?

what software should I use?

how can I maximize performance?

kinds of parallelism

MIMD parallelism

multiple instructions, multiple data

e.g. multiple cores in a single Xeon chip

SIMD parallelism

“single instruction multiple data”

aka “vector arithmetic” because adding two vectors is an example of SIMD

not to be confused with “vectorized” code in MATLAB/Python etc.

examples

single core of Intel CPU

GPU

easy to harness for minibatch

hardware examples

consumer grade

GTX 1080 Ti 3584 cores @ 1.48 GHz 11 GB RAM 250 W $699

server grade

much more expensive

CPU

Intel Xeon (multicore)

24 cores @ 2.4 GHz 165W 60 MB cache max turbo frequency 3.4 GHz 8-way configuration possible $8898

AVX and AVX2

256 bits
eight Float32 (single precision)
four Float64 (double precision)
TensorFlow warnings: not compiled for SSE, AVX, FMA, etc

Intel Xeon Phi (manycore)

Xeon Phi 7210 “Knights Landing”

64 cores @ 1.3 GHz 215 W $2438 4-way configuration possible

AVX512

512 bits, 16 Float32

FLOPS

floating point operations per second

theoretical maximum

Intel

one vector instruction per cycle (with FMA)

number of cores x clock speed * SIMD size * 2 (FMA)

E7-8894 v4 0.9 TFLOPS

Xeon Phi 7210 5.3 TFLOPS

NVIDIA

number of cores x clock speed * 2 (FMA)

GTX 1080 11 TFLOPS

use single precision!

people are even trying half precision or less these days

FLOPS gap between GPU and CPU is decreasing

good software required to utilize FLOPS efficiently

now approaching 100% of theoretical max FLOPs

NVIDIA cuDNN

Intel MKL

what is the difficulty?

von Neumann bottleneck

CPU-RAM communication is slow

GPU-RAM communication is even slower

not a problem for the brain

computation and memory are intermingled

memory hierarchy

register file

L1, L2, L3 cache

shared RAM

could include SSD and HDD

speed vs. capacity tradeoff

maximize reuse of data in fast memory

multilayer perceptron

BLAS GEMM (general matrix matrix multiplication)

convolutional net

im2col + GEMM

direct convolution

FFT

IFFT of product of FFTs

less FLOPs for small kernels

FFT reuse

outgoing

incoming

forward and backward

multiprocessor memory types

shared memory

distributed memory

distributed computing

“nodes” that don’t share memory

two motivations

network is too large to fit in GPU memory

network fits, but want faster training

model parallelism

divide network into modules

distribute modules across multiple compute nodes

modules should be weakly connected as internode communication is slow

data parallelism

minibatch

each node computes gradient

combine gradient updates

synchronous

every node maintains the same weights

asynchronous

weights can be slightly out of sync across nodes