Neural Networks (The Machine-Learning Kind) BCS 247 March 2019

Neurons http://biomedicalengineering.yolasite.com/neurons.php Networks https://en.wikipedia.org/wiki/network_theory#/media/file:social_network_analysis_visualization.png Neural Networks? https://en.wikibooks.org/wiki/artificial_neural_networks/activation_functions

Artificial Neurons Neural output summarized as a non-negative number (analogous to firing rate) Inputs from other neurons are weighted (dendrites) and summed (soma) Output (axon) consists of this sum passed through a nonlinear activation function f() (analogous to spiking threshold) f() enforces non-negativity w1x1 w2x2 w3x3 Computational abstraction Σ f

McCullough-Pitts Neurons (~1943) Single Neurons as Logic Gates (AND, OR, NOT) Networks implement any Truth Table http://ecee.colorado.edu/~ecen4831/lectures/nnet2.html

Perceptrons (~1957) Output = classification (weights = separating hyperplane) https://blog.dbrgn.ch/2013/3/26/perceptrons-in-python/ The XOR problem ~1969 http://ecee.colorado.edu/~ecen4831/lectures/nnet3.html

Multi-Layer Perceptrons sigmoid http://matlabgeeks.com/tips-tutorials/neural-networks-a-multilayer-perceptron-in-matlab/ Universal Approximation Theorem (~1989) A single hidden layer is sufficient to make MLPs a universal approximator deeplearning.net/tutorial/mlp.html

Gartner Hype Cycle How many hidden layer neurons? Learnability? https://en.wikipedia.org/wiki/hype_cycle#/media/file:gartner_hype_cycle.svg

Deep Learning MLPs are shallow deep network http://neuralnetworksanddeeplearning.com

1980s vs 2010s what changed? Single hidden layer Fully Connected architecture Sigmoid activation function Small data and toy problems Multiple hidden layers ( depth ) Specialized architectures e.g. convolution and recurrence ReLU activation function Big Data and fast computers Some theoretical progress

Backpropagation Credit assignment problem: in a deep network, will adjusting a given weight help or hurt? How MLPs and deeper networks are trained Originally created for supervised learning problems Nothing but the chain rule from calculus applied to a cost function

<latexit sha1_base64="azka2xv7c7i1vzyop7td+gfyhm0=">aaadfnicfvfda9rafj3erzz+bfvrkmffejeluuffhfkl+fkt4lafzblczn4kqyetmhmjlih/oz/fj9/e1776b5xsv1iz1qsdh3povxm/kkpjs2h4y/ovxl12/cbwdndz1u07dwc7945swruby1gq0pwkyffjjwosppckmghfova4ox3b6cdf0fhz6s+0qhbaqkzlkgwqo2adb3fqqdrxbyykkb4xqhmsnou2xsmpr4jz2vi3poglpjf5+qasxu/oq3m77uyrl3aut7pbmbyfy+cbifqbivvf4wzhexfps1exqekoshyshrvnm66iuoj+rs1wie4hw4mdggq002a54py/dsycp6vxtxnfsuszdrtwloreobt+bv/ryh9pxuxbiaxbdcokpvt1tjg6qgm1uogkrrwnknch5xnpujbaoadcsdcmfzm4fzi7exdvoxvw4igr+7fca1sap00mjivga+ugz+jnhfqfueo/roecwg0+6u95exw9h0uvrugnl8pdvdunttgd9og9yrf7xxbze3bixkx4d71978d74j/53/0f/s8lq++tcu6zv8i//w2zigei</latexit> <latexit sha1_base64="+xrvfzvtrjotjencg3qigsvvsbs=">aaac+hicfvfni9raeo3ej13j16wevtqoioimibuwxorlffeirudslkygodjtszrtpen3rzwn+soevilx/43/xs7scgnmtadh8d6rqq6qpflsuhj+8vwrv69d39q+edy8dfvo3chovworaynwllts5jqbi0qwoczjck8rg1akck+ss9edfvizjzw6/eslcqcfzkvmpqby1gzwnu4nicauwjaexemcke/sztg2aytlsddlwv6kb72e7djf35s36wl99dkeetsbdmnruay+caivgljvhm12vlfxxiu6wjkeamsnuvjrtokqcowub22xaneggu4clkfao22ws2z5i8fmeaqneyxxjbue0ubh7ajinlp7r+1rhfkvratoo1eb3dbmakpfthtzvjvhks5+ktakk+bdyfhcghskfg6amninw0uoboxkdhveb9ana/c9q/uhqgokzdmmbpmv8kv1w2fxsw79zyjlp0ahgsbtpurvermcvxhfu6pw497w4hb1g232gd1kt1je9tkbe8eo2jgjj3mpvede6j/73/zv/o8lq++tcu6zv8l/+ruf+/rz</latexit> Backpropagation y = h(g(f(x))) y = h(g(f(x,θf),θg),θh) f x g h y How will a change in θf affect the output y? - Difficult to predict in the forward direction: changes in f affect g which in turn affect h - Chain rule: @y @ f = @f @g @ f @f @h @g @y @h @y = @g @h @ g @ g @g @y @h

<latexit sha1_base64="l+k25s7c8n9bgoqw+jfee6veb1k=">aaadaxicfvhfa9raen6k/mjjr6t99gxxkijimvtb+icukuklwmfrc5fjmoxnkqwbbnidiefiu/8un3wtx/1l/g/cxe+ol+rawsf3ftozm5nuslokw1+ev3ht+o2bm1vbrdt37t4bbn8/tro2asdck21oe7coziljkqtwtdiiralwjdl73eknn9fyqctptkhwwkbwylqkiefnbl9349saaoikdelqpc6a8irtfm17iaqccwzzy1/xyd0ju8oyrjnyvinvma7snbezwtachcvgfrctwjct4mi27b2n51rubzykffg7ickkpk1xush0fwulfygzyhdiyakf2mmz3gbldx0z56k27pxel+zljaykaxdf4pzdf+261ph/0rqktho1wz5hulo6p21kwdwepbj4svortpp3t+nzavcqwjgawkg3dbc5ubwsu3aqv0e3rmh3ru6hcg2qno+bgexwwjfwdz/ftzr0p6ms/xgdcgk3+wh9z31wvdekno3cj8+hb4erg2yyb+whe8qi9oidshfsii2z8la8p96+99i/97/53/0ff1bfw+xssl/c//kbb8b3zw==</latexit> @y @h x f g h y @f @g @f @g @h @g @h @ f @ g @ h Each layer computes local gradient information of its output with respect to (i) its input, and (ii) its parameters Compute backwards from output: chain rule is the product of local gradient with everything in front of it Since all computations are local we can easily compute gradients for h(g(f(x))) or f(f(g(f(x)))) etc.

Stochastic Gradient Descent (SGD) Gradient Descent (GD) minimizes a function by making small adjustments to parameters in the direction of the gradient Backpropagation algorithm computes gradient with respect to each weight in the network Neural Network loss function is typically a sum of losses per data point for a large number of data points! Solution: get a noisy estimate of the gradient on a batch of data Take a small step in the noisy gradient direction Take smaller and smaller steps over time (simulated annealing) Lots of tricks for making this process better in practice

<latexit sha1_base64="m8m21e7gudec7sjpbjb2e4kf6vq=">aaacr3icfvfdaxnbfj1s/ajrv6qpvgwgizusdqvqghrkffhbjxzmu8ik6+zkbjj0zmezudsmlpu7/c0++kp/w9k0fw3qcwpnnnvu5z47aagkwyj61grwbty8dxv9tnj33v0hd9sbj46dka2avjdk2jouo1ayhz5kvhbswoa6vtbiz1439ce5wcdn/hnnbyw0n+qyk4kjp5l20xvjhn2jzju6qerexj9+paxhhhvya2zdzbpmc5ymwtwre/mkxmux9ebw74rnovbz2gs2k3yn6kwloksgxoiowczhstf6y8zglbpyfio7n4yjakcvtyifgjpkpyocizm+gaghodfgrtxce02fewzmm2p9y5eu2d87kq6dm+vuk5td3fvaq/6r1kx0tdfywbems8x2r5xmixihf5ebzkwiaghzatqwfgsquqdcwonnudhllgv0/xgyn+dnwvjg534qwhi09nnfuj1opqu9+qnbatd/hdk/enouhv7y8fu7r4lj7v78ohcdvezshyz/yj08iu9jl8rkh+ytd+sq9ikgx8l38op8dojgejwgxy6lqwvz85j8fyh8bqdk2ai=</latexit> <latexit sha1_base64="5main/4uxvzdlzv0wyqus8v1zx4=">aaac03icfvfbaxqxfm6mtzr1stvhx4klsjvszlrqkejrer+8vhdbwmydzmtp7izmkihj6c5hxsrxf6d4z8xst1666ihal+/7zufcilok69l0exrfuhjp8pwnq8nmtes3bva2bh1a3rioq66lnscfwjrc4dajj/g4nghvifgoohne6uef0fih1qe3qhfcwvsjunbwgcp7dvnqsmhzbw5wlp5zs19ra+kezbapci/2svbjw7ruyg7nzqmx2rsdcncmzntcpkw/fds7vz5sbs4v2mdyznv9ddddbl0h2qr0ysoo8q3ojzto3lsohjdg7shlazf2yjzgetuenrzr4ccwxvgaciq0y7/ct0vvbwzcs23cu44u2t8zpftwlqoiolte7xmti/+ldrvtj2qda4zr48onyy9u3thu/lstsphuadpdg06eqe7kigdgrorhkj+bae7czrl2asowbt+euu9qnoc0ue8zmgkf8zymp2u7hfqfuagzy0bjejafnd/zojh8sjs93e3fp+rvp1vdyipcixfjggtkmdknr8gbgrjofkrrlesb8td28zf466k1jly5t8lfex/7ctio5jk=</latexit> <latexit sha1_base64="h4p6ixvh9wixs5vs76lrrahflye=">aaac5nicfvfdaxnbfj1dv2r8stu3uqadkeoju1qoiejrer8uk5i2kanl7orudujszdjzvxowfffjn/hvv+wf8dc4m6rag/tcwjl7zrncj7ru0meu/qjcc+cvxly0cblz5eq16ze6mzcpnkmsgkewytijldtqusmqjso4ki3wilvwmb6/apndj2cdnpodzksyf3yqzsyfr59kup8zsjwbmmmekt4krocyp1n9qafvjhp0gwwukpi6zvizhbnwkuernw1dok4bljxya2ztz/onzkxj0qf0j25r+/eh5rzreemfw0m3fw2irdb1ek9aj6xip9kmxrgjevubgoxizo3iqmrxzs1koadpsmpbycuxn8liq80lcon6sbkgpvczcc2m9u8jxwrpo2peodcvuq9se3vnutb5l66t6frswfgtjcrmnoxrqcskqytlj1mlkbrahohopawbau4bf1b6yajiueuc/rk77cx4ys289xxflwa5gvuwztxocz5r/pbttt2i/wmlphf61on4zcdn97wodh4n4sed6p1ob+/56gyb5a65t/okjrtkj7wm+2ribpkz3a7ubvfcppwsfg2/lavhsplcin9f+p0x8ddt3a==</latexit> <latexit sha1_base64="9twxksd3c2kbj//uen7smshv8fk=">aaacrhicfvhbihnbeo2mt3w8zfxrl8ygieiyqkavwuinh7yssnldmb5ctadm0mxp99bdo4zh/sqf0vf9ehuscjqgbq2hc04xdarywitpsfjtej07f+hipb3l8zwr167fgo7fppa2crkn0mrrtnpwqjxbksnsefo7hcrxejkfvej1k0/ovllmijy1zhwurhvkagvqnnwvkqbfnrevoqgxofsq0nnshyfcqzdby0x7uenvrffcqxjbgx/g14ztftycjenkvxwxtdzgxdz1onsfvbzzk5skdukn3qetpkasbudkauxi0xisqz5biwmabir0wbsk3vg7gznzwrrwdpev++epfirvl1uenp2cflvryx9pfuffi9bhjiftqhiatcrudagr60mkrnoyvf8znyuhkvqyajbohtbclscbphcmwlzeenbhu9d3q40oylr7rqbxvvclc+fl8abh/zmq89syubyhzu+297wljh+oj4/gycfho4pnmxvssdvsdrvhjuwjo2bv2cgbmsm+su/sb/szjaojki2yttuabp7cyn9vvpwcfexx3w==</latexit> Stochastic Gradient Descent (SGD) Loss = NX i=1 r w Loss = r w Loss = h i E rw Loss error(f(x i ; w), ŷ i ) NX i=1 X b2batch r w error(f(x i ; w), ŷ i ) r w error(f(x b ; w), ŷ b ) = r w Loss <latexit sha1_base64="nkl9+y066kp2d3rs7etg6nn/rbc=">aaacnhicfvfdaxnbfj2sx+36leqjuaadikjhvwvtuylaskglfuxayirwd3i3hto7s8zctyyl/8zf42v74r9xno0qg/tcwogccy/33mlkrtwlya9wdov2nbv31tbj+w8epnrc3njs97zyenvsautom/colceekdj4wjqeitn4kp1/apstb+i8suyrtuscfjaxklcskfcj9q4ogm6yvl6ycaexj3doxval9g0xsmcfguzdaen4zl0ftttjn5kxxwxpantyoo5hg60dmbayktcq1od9ie1kgtbgsemns1huhkuq5zdbqyagcvtdeh50xl8ezsxz68izxofsckcnhfftigvozk9/u2vif2nnrn+i1ugkyvbrvjoslskrqiovn8krzcny5qx8rbxk0tmaqdovwnb5bg4khephyh9dwidhye7neh2qda9qaw5swpdzcd8rrxv0p6myf4wbxxg4fhrzzqug/7abvusmx7y6e+8xf7dgnrhn7cvl2tbby4fsmpwyzd/yt3bjrqlnad/6gb1dw6pwoucp+6ui/m/njtb0</latexit> w w r w Loss

Specialized Architectures

Fully-Connected

Convolutional Neural Networks <latexit sha1_base64="5ncu2qmdbsyamedmi9hxzlwcksg=">aaacy3icfvfdaxnbfj2swunga5l2tqqlqrapyvcl7wowunwri5gpyizwd3ktdpmdwwbufsosh+fr+8v8af4pz5minqfegdicc+bovwestaplyfir4j16/otgafwzx3v+4vblvdhsw50bjj2uptbdbcxkobbhgiqom4oqjhihyeky1ac3akzq6jstmxynmfdijjiqowaxjpgindrbyttcv7apoi1osw11j43kvtzvpe9rezdg7sgkmxoxyehwiss/zi1mwbcwx5gdctwr42i97yp445hpmnpghuxbmv33rggptcs0cc4u6nruaix5kfz2tkwode4zrjnnlsafufloqphmklkua9jbmu4wfqy5yaudwi1wywt8ggxwchn68sd0yxr84vp+zdaaafouimhmu/ixcsvp49ms/c8o1f+jq77vko92c94h/fft6em7/hbw6nzc/kgvvwkv2vswsxpwyz9zl/uyzwv2k92yu8pvr+y1veon1ats7xyxe+wd/af5hlpd</latexit> = <latexit sha1_base64="hn2vcmcuj0oktowkjmvkr5zjsgs=">aaacxxicfvfnswmxee1x68datdwdby+lrrcrsqucxosiil5ebatct8hsom1ds8mszmwy9bd41r/nyb9itlzqiw4ehu+9gwzeooqzbxz/rebmtrdnzufm3yxs4tjyubjyq2wqkdao5fldr6crm4enwwzh+0qhxbhhu6h/kut3j6g0k+lgdbjsxdavrmmogetdhz2uq37nh5u3cyixqjjxxt1ucmdhw9i0rmeob62bgz+yvgbkmmpx6iapxgroh7rytfbajlqvjtydepuwaxsdqewtxhux3zsyiluexjf1xmb6+rewk39p+usdi1lhhkgzms5hk2misq0k+rljj+wekv6ei9dmcqnhawuakmap8wgpffbj03pdu7thkrywcy8tvgck2s5cun0ynob2+g64k6p/jex8gs1yxzt88dvnsxc7wwv2av71frv+pp6dobjonsgwccgbqznzckuahbikz+sfvbbenajtcpy+ru5h3lnkfpsz9gfdkbdx</latexit>

Convolutional Neural Networks http://deeplearning.net/tutorial/lenet.html http://on-demand.gputechconf.com/gtc/2015/webinar/deep-learning-geoint.pdf

Recurrent Neural Networks yt yt-1 yt yt+1 ht ht-1 ht ht+1 xt xt-1 xt xt+1

Are neural networks neural? feedforward strict layers deterministic supervised training But can they tell still tell us something about representations in the brain? backpropagation violate Dale s law