Horseshoe Priors for Bayesian Neural Networks Soumya Ghosh Jiayu Yao Finale Doshi-Velez IBM Research Harvard Harvard MIT-IBM Watson AI lab 1
<latexit sha1_base64="/xlkhul+a7calscycscwtlr3ztm=">aaacbxicbzdlssnafiyn9vbrlepsf4nfqagleue3qtgnywr2am0ik+mkhtqzctmtmyru3pgqblwo4tz3cofbogmz0oopax//oyc55w9irpv2nc+rtlc4tlxsxq2srw9sbtnbo20leoljcwsmzddaijdksuttzug3lgrfasodyhyv1zt3rcoq+k1oy+jfamhpsdhsxvlt/rrewjhfj5aeycsyzqr2fwspirdued+uonvnkvgx3akqofdttz/7a4gtihcngvkq5zqx9jikncwmtcr9rjey4teakp5bjikivgx6xqqegmcaqyhn4xpo3z8tgyqusqpadobbqvlabv5x6yu6ppcyyunee45nh4ujg1rapbi4ojjgzvidcetqdov4hctc2grxmsg48yf/hfzj3tv8c1ptxbzxlmeeoaa14iiz0adxoalaaimh <latexit sha1_base64="hj78bvbvqvoeleceyi6+lo5phbm=">aaacmnicbvdlsgmxfm3uvx1foy7dbitqozqzexrzdko7cvybnahcyartaozbklhl0g9y45cilnshifs/wvsbansdgcm555j7j59wjpvtvxq5pewv1bx8urmxubw9y+3u1wwcckjrjoaxapogkwcrrsmmog0mgkloc9rw+5cjv3fhhwrxdksgcfvc6easwwgolbwt66tohqb6bhjwggi3zaf2uz4p4bi7gnv7cosi780fuuejp5r+022ryjftmfa8caakgkaotq1nn4hjgtjieq5sthw7uv4gqjhc6db0u0ktih3o0pameyruetn45ce+0kqao7hql1j4rp6dyccuchd6ojlaw856i3gr10pv59zlwjskikzk8len5vjfenqfdpigrpgbjkae07ti0gmbromwtv2cm3 <latexit sha1_base64="ic+y4dkt/lqwr1v58xz1j6qtqrm=">aaacb3icbvdlssnafj3uv62vqetbbotqsymjclosunfzwt6gcweymbrdz5iwm5gg0p0bf8wnc0xc+gvu/bunbrbaeudc4zx7z+49fskovjb1brrwvtfwn4qbpa3tnd09c/+glenuynlcmytf10esmbqrlqkkkw4icoi+ix1/edp1ow9esbph9yplimtrp6ihxuhpytopk0rmvahdaqbhxrugsxoc1add9bmbovpmslw3zodlxm5jgeroeuaxe8q45srs Bayesian Neural Networks y = h W (x)+noise Being Bayesian: p(w ) p(w y, x, ) p(y x,y,x, ) Inputs x Layers, parameterized by weights W Outputs y!2
Why Bother? Need to guard against unintended consequences. Need to know when the model doesn t know.!3
Predictive Uncertainty + +!4
Larger Data and Modern Architectures Convolution Neural Network ( LeNet variant ) Train: 60,000 Handwritten digits Test: 10,000 heldout digits Test Error ~ 1%!5
Are the predictions robust? 71 45 0.85 0.99 0.95 71 9 0.1 70!6
Bayesian Neural Networks + + Distribution on weights: Z p(y W,x )p(w y train,x train )dw!7 source: Ghosh et al., AAAI 2016 Balan et al., NIPS 2016
Bayesian Neural Networks 5 9 3 0.30 0.28 0.26!8 experiment inspired by: Gal et al., 2015
<latexit sha1_base64="xq9dxc4/0b1k2gmmlddrninulli=">aaacjxicbvdlsgmxfm34rpvvdekmwis6ktmi6ekh6mzlbfuatprmeqcnztxi7kjlmd/jxl9x48iigit/xxtahbyecbzoote5ow4khubb/rjwvtfwnzzzw/ntnd29/clbyv2hsejq46emvdnlgqqioiycjtqjbcx3jttc4d3ubzyb0iimhnecqcdn/ub4gjm0urdw3uyyyxzpoqcxjl5pdjbsg0oh3wtra/smb5zjpjgmqcl1c0w7bgegy8szkykzo9ottnq9kmc+bmgl07rl2bf2eqzqcalpvh1ribgfsj60da2yd7qtzbuk9nqopeqfypwaaab+nkiyr/xyd01yuqde9kbif14rru+qk4ggihecpnviiyxfke4roz2hgkmcg8k4emzxygdmmy6m2lwpwvn88jkpn5cdwx8uipxber05ckxosik45jjuyd2pkhrh5jm8kncysv6sn+vd+pxfv6z5zbh5a+v7b948pri=</latexit> Distribution over Functions f(x) = h W (x) random variable random variable!9
BNNs - applications Model stochastic functions Depweg et al., ICLR 2017 Model uncertainty in deterministic functions Killian et al., NIPS 2017 Predictive uncertainties for active learning, sequential decision making Hernández-Lobato et al., ICML 2015, Gal et al., ICML 2017, Joshi et al., CVPR 2017, Zhang et al., AISTATS 2018, Depweg et al., ICML 2018, Riquelme et al., ICLR 2018!10
f(x) <latexit sha1_base64="xfl22regolqcu3yv8/uzt4o7oq8=">aaacdhicbvdlsgmxfm34rpvvdekmwmqwspkrqzdffwpuktghdiasstntadizjhekzeghupfx3lhqxk0f4m6/mx0stpva4hdoudzc48eca7dtb2tpewv1bt2zkd3c2t7zze3t13wukmpqnbkravpem8fdvgmogjvjxyj0bwv4/aux33hgsvmovidhzdxjuiepocvgphyuhxqgrexqlrelbadpdxvukeyr4dvcoiqhj8wisdllewk8sjwzyamzqu3cl9ujacjzcfqqrvuohyoxegwccjbkuolmmaf90mutq0mimfbsytejfgyudg4izv4iekl+nkij1hoofzoubhp63hul/3mtbiill+vhnaal6xrrkagmer43gztcmqpiaaihipu/ytojilaw/wvncc78yyukflp2dl87y1cuz3vk0ce6qgxkohnuqteoimqiokf0jf7rm/vkvvjv1sc0umtnzg7qh1ifp2rlmj8=</latexit> <latexit sha1_base64="1qsafpd3upojzsivuy/6/h3+f9a=">aaacbxicbzc7tsmwfiydrqxcaowwwfrizaksharjbqtjkehfaqlicz3wqu1etooooiwsvaolawix8g5sva1omwfafsnsp/+ci5/zhwmjsjvot7w0vlk6tl7zqg5ube/s2nv7hrwnepm2jlkseyfshffb2ppqrnqjjiihjhtd8xvr794tqwgs7vqkit5hq0ejipe2vmafrfwhu+gpyueoydyo9agjlnxz3pibxxmazlrwedwsaqbuk7c/vegmu06exgwp1xedrpszkppirvkqlyqsidxgq9i3kbanys <latexit sha1_base64="uph3app7mcnmemiern/5sbxflay=">aaacghicbzbns8naeiy39avwr6hhl4tfaefqugq9fr14kgr2a9pynttnu3q3cbsbmyt+dc/+fs8efpham//gtzqdtg4sppvoddpzuigjulnwt1fywv1b3yhulra2d3b3zp2dtgwiguklbywqxrdjwqhpwooqrrqhiii7jhtcyxwa7zwsiwng36s4ja5hi596fcolpyf5fleeqrdp6rb6gunk9repmuysuz1wuvu0luccpdsra7ns1aws4dlyozrbhs2boespaxxx4ivmkjq92wqvkychkgzkwuphkoqit9ci9dt6ibppjnlhu3iifb1yiptzfczu3x0j4llg3nwv6cpymzek/+v6kfiunyt6yasij+edvihbfcdujtikgmdfyg0ic6p3hximbmjke1nsjtiljy9du16znd+dlxtxur1fcasoqqxy4ai0wa1oghba4bm8gnfwybwyb8an8tuvlrh5zyh4e8bsbx/8nfi=</latexit> Alternate distribution over functions Gaussian Processes f(x) GP(m(x),K(x, x )) Bayesian Neural Networks h W (x) Rest of this talk, Noisy data model: y(x) f(x) N (f(x), 2 ) Gaussian likelihoods!11
f(x) <latexit sha1_base64="xfl22regolqcu3yv8/uzt4o7oq8=">aaacdhicbvdlsgmxfm34rpvvdekmwmqwspkrqzdffwpuktghdiasstntadizjhekzeghupfx3lhqxk0f4m6/mx0stpva4hdoudzc48eca7dtb2tpewv1bt2zkd3c2t7zze3t13wukmpqnbkravpem8fdvgmogjvjxyj0bwv4/aux33hgsvmovidhzdxjuiepocvgphyuhxqgrexqlrelbadpdxvukeyr4dvcoiqhj8wisdllewk8sjwzyamzqu3cl9ujacjzcfqqrvuohyoxegwccjbkuolmmaf90mutq0mimfbsytejfgyudg4izv4iekl+nkij1hoofzoubhp63hul/3mtbiill+vhnaal6xrrkagmer43gztcmqpiaaihipu/ytojilaw/wvncc78yyukflp2dl87y1cuz3vk0ce6qgxkohnuqteoimqiokf0jf7rm/vkvvjv1sc0umtnzg7qh1ifp2rlmj8=</latexit> <latexit sha1_base64="tu5zzde4qbcx4woxfsybids+sh8=">aaacanicbzdlssnafizp6q3ww9svubksqgupiqgkbopd6lkcvuabymq6aydojmfmiprq3pgqblwo4tancofbog2z0nyfbj7+cw5nzu/hncnton9wbml5zxutv17y2nza3rf39xoqsishdrlxslz8rchngty105y2yklx6hpa9ifvsb35qkvikbjxo5h6ie4lfjcctbg69kgaooqf6kzwkp+i8gm6qszhalw7dtepo1ohrxazkekmwtf+6vqikoruamkxum3xibwxyqkz4xrc6cskxpgmcz+2dqocuuwl0xpg6ng4prre0jyh0dt9pzhiuklr6jvoeoubmq9nzp9q7uqhl17krjxokshsuzbwpcm0yqp1mkre85ebtcqzf0vkgcum2qrwmcg48ycvquos7bq+oy9wrrm48nair1acfy6gardqgzoqeirneiu <latexit sha1_base64="1qsafpd3upojzsivuy/6/h3+f9a=">aaacbxicbzc7tsmwfiydrqxcaowwwfrizaksharjbqtjkehfaqlicz3wqu1etooooiwsvaolawix8g5sva1omwfafsnsp/+ci5/zhwmjsjvot7w0vlk6tl7zqg5ube/s2nv7hrwnepm2jlkseyfshffb2ppqrnqjjiihjhtd8xvr794tqwgs7vqkit5hq0ejipe2vmafrfwhu+gpyueoydyo9agjlnxz3pibxxmazlrwedwsaqbuk7c/vegmu06exgwp1xedrpszkppirvkqlyqsidxgq9i3kbanys Gaussian Processes f(x) GP(m(x),K(x, x )) Bayesian Neural Networks h W (x) Exact Inference* * Only for Gaussian Likelihoods Approximate Inference Scales Poorly with n Well calibrated uncertainties f GP (.,.); f C Constraining the space of functions can be difficult Scales Well Predictive uncertainties can be poor Some are easy; depends on C!12
f(x) <latexit sha1_base64="xfl22regolqcu3yv8/uzt4o7oq8=">aaacdhicbvdlsgmxfm34rpvvdekmwmqwspkrqzdffwpuktghdiasstntadizjhekzeghupfx3lhqxk0f4m6/mx0stpva4hdoudzc48eca7dtb2tpewv1bt2zkd3c2t7zze3t13wukmpqnbkravpem8fdvgmogjvjxyj0bwv4/aux33hgsvmovidhzdxjuiepocvgphyuhxqgrexqlrelbadpdxvukeyr4dvcoiqhj8wisdllewk8sjwzyamzqu3cl9ujacjzcfqqrvuohyoxegwccjbkuolmmaf90mutq0mimfbsytejfgyudg4izv4iekl+nkij1hoofzoubhp63hul/3mtbiill+vhnaal6xrrkagmer43gztcmqpiaaihipu/ytojilaw/wvncc78yyukflp2dl87y1cuz3vk0ce6qgxkohnuqteoimqiokf0jf7rm/vkvvjv1sc0umtnzg7qh1ifp2rlmj8=</latexit> K(x, <latexit sha1_base64="ly+c2pwi7meiezn55tuyyah5v5o=">aaab63icbzdnsgmxfixv1l9a/6ou3qsluddlrgrdft24roc0hxyomttthiazicmizegruhghiftfyj1vy6adhbyechycey+594qjz9q47 <latexit sha1_base64="mirb8bwte7onglg/k4dhfbuwb5o=">aaab73icbzdlsgmxfibpek31vnxpjljecljmrnbl0y3gpok9qduutjppqzpjmgskzehluhghiftfx51vy9roqlt/chz85xxyzh/engnjut/o0vlk6tp6bio/ubw9s1vy269rmshca0ryqzob1pqzqwuggu6bsai4cjhtbiobsb3xrjvmu <latexit sha1_base64="qgyxbtcfijyjlr1gvwqo <latexit sha1_base64="1qsafpd3upojzsivuy/6/h3+f9a=">aaacbxicbzc7tsmwfiydrqxcaowwwfrizaksharjbqtjkehfaqlicz3wqu1etooooiwsvaolawix8g5sva1omwfafsnsp/+ci5/zhwmjsjvot7w0vlk6tl7zqg5ube/s2nv7hrwnepm2jlkseyfshffb2ppqrnqjjiihjhtd8xvr794tqwgs7vqkit5hq0ejipe2vmafrfwhu+gpyueoydyo9agjlnxz3pibxxmazlrwedwsaqbuk7c/vegmu06exgwp1xedrpszkppirvkqlyqsidxgq9i3kbanys <latexit sha1_base64="l3/kentrr0jr3ojd4k4pxcsmygg=">aaacbhicbvdlssnafl2pr1pfuzfddbahbkoigi6lblxwsa9oqplmpu3qystmtiqsundjr7hxoyhbp8kdf+okzujbdwwczrmhufcecwdko863vvpb39jckm9xdnb39g/sw6ooilnjajvepja9acvkmabtztsnvurshawcdopjte53 Gaussian Processes f(x) GP(m(x),K(x, x )) Bayesian Neural Networks h W (x) Completely specified by: Need to specify: m(x) x ) h<latexit sha1_base64="qgyxbtcfijyjlr1gvwqo architecture non-linearity Intuitive, well understood parameterization p(w ) prior on weights Implied distribution on functions is poorly understood!13
<latexit sha1_base64="acn79w3nn9sinlcy7zzvbl+veis=">aaachxicbzdlssnafiynxmu9rv26gsxcbsmjfhrzdkmbqwav0iqymu7aoznjmjkijerf3pgqblwo4skn+dzo0ija+spax3/oyc75vyhrqszr01hyxfpews2tldc3nre2zz3dtgxjguklh Predictive Uncertainties? Single layer network, with prior: W N (0, I) (Same results across many initialization strategies) What is happening?!14
<latexit sha1_base64="sbii8cwmntobu8wl7y5voluoxv4=">aaab9hicbzdlssnafizp6q3ww9wlm8ei1e1jpkabosicywr2amkok+mkhtqzxjljtyq+hxsxirj1ydz5nk7bllt1h4gp/5zdofp7mwdk2/a3lvtzxvvfyg8wtrz3dvek+wdnfsws0aajectbplaum0ebmmlo27gkopq5bfnd62m9najssujc63fmvrd3bqsywdpy3k330q3kt6ceukr2t1iyk/zmabmcdeqqqd4tfnv6eulckjthwcnxswptpvhqrjidfdqjojemq9ynrkgbq6q8dhb0bj0yp4ecsjonnjq5vydshco <latexit sha1_base64="ck8pdc+ekzh4numsp+zg7r8leyk="> sha1_base64="n9coyac86bbptqcwzaqpk7pnaxc="> sha1_base64="46alfbowdxh9bhturssrpejo9qi="> sha1_base64="tcd8huirq8kglfyviyop6bpo/bi="> <latexit sha1_base64="hv75shivfqdfewffa+6e7+s+1lk=">aaacq3icbvbdaxnbfj2tby3rtleffrkashikzbcufeqoskb8qmca4o52mz29m0w7+8hmxzuw5l/1px/an/+alz4o4qvgzdeunvxadgfouzd754s5fbpt+5u19mb9y/nh41hz8zot7z3w02enoisuhwhpzkzgidmgrqodfchhlctgsshhgf68w/jdl6c0ynjpomvbt9g4fbhgdi0utd73g0s37k571fydnk/fue+lcclodooq7lmpyyrvmfjbnc8/zg/ss9opctflj6i7fu2l4lxx80798inw2z6wk9d7xfmsnlnijgh99akmfwmkycxt2nxshp2skrrcwrzpfrpyxi/ygfxdu5aa9stqutndm0pe40yzkykt1nsdjuu0niwhquwytvsqtxd/57kfxq/8uqr5gzdyelbcsiozxqrki6gao5wzwrgszlfkj0wxjib2pgnbwf3yfxj6eoay/vgoffx2guedvcc7pesc8pick/fkhawij1fko/ljflnx1g/rt/wnll2zlj3pyr1yf/8bgzyu6w==</latexit> <latexit sha1_base64="hwpgbp1ztlzgn+y1a19x6nusnsu=">aaab6nicbz <latexit sha1_base64="kafgq2qtdghwbatjmw5k1otkpxw=">aaacl3icbvbbs8mwge29znmr+uhlcagbygihoodlubcfzik7wfplmmvbxnkwjhwmsn/ki39llykk+oq/mn2g6lydgcm55ypfd/yiuaks681ywl5zxvvpbgq3t7z3ds29/zomy4fjfycsfa0fscjoqkqkkkyaksci+4zu/d5v6tefija0do7viciur52atilgskueed33hqejkycor6qleutuh3nrjnu6hd2uvh7hakj/uey34hc8m2cvrthgplgnjaemqhjmygmfooykujghkzu2fsk3qujrzmgw68ssraj3uic0nq0qj9jnxvco4bfwwradcv0cbcfq34kecskh3nfjdge566xiiq8zq/a5m9agihuj8osjdsygcmfahmxrqbbia00qfltvcnexcysvrjirs7bnt54ntvlr1vzunfe+nnarayfgcosbdc5agdyacqgcdj7bclydd+pfedu+ja9jdmmyzhyafzc+fwa3oadr</latexit> <latexit sha1_base64="f19ystsgzfbk19qglgh9199r8ue=">aaacexicbzdlssnafiynxmu9rv26gsxcilaserskuhqjriryczqxtkatdtrjhzmjtos+ghtfxy0lrdy6c+fbog2z0nyfbj7+cw5nzu/fjappmt/awuls8spqbi2/vrg5ta3v7nzflhbmajhiew96sbbgq1ktvdlsjdlbgcdiw+tfjuune8ifjcjboyyje6bosh2kkvswqxu+msjcmvtgebrferhpr2yn7q7hg9uddtylxuacjm6v6oofs2robofbyqaamlvd/cturzgjscgxq0k0ldowtoq4pjirud5obikr7qmoaskmuucek04ugsfd5bshh3h1qgkn7u+jfavcdanpdqzidsvsbwz+v2sl0j9zuhrgisqhni7yewzlbmfxwdblbes2viawp+qvehcrr1iqepmqbgv25hmoh5csxtcnhcpffkco7imdyaalniikuajvuamypijn8aretcftrxvxpqatc1o2swf+spv8aszvmrs=</latexit> Prior uncertainty u j w j w j N (0, f(x) =b + J j=1 2 w); b N (0, w j (x; u j ) 2 b ) + + E w [f(x)] = 0 E w [f(x)f(x )] = 2 b + J 2 we u [ (x; u j ) (x ; u j )]!15 Computing with Infinite Networks, C. Williams, NIPS 1997 Bayesian leaning for Neural Networks, R. Neal, LNS, 1996
<latexit sha1_base64="acn79w3nn9sinlcy7zzvbl+veis=">aaachxicbzdlssnafiynxmu9rv26gsxcbsmjfhrzdkmbqwav0iqymu7aoznjmjkijerf3pgqblwo4skn+dzo0ija+spax3/oyc75vyhrqszr01hyxfpews2tldc3nre2zz3dtgxjguklh Predictive Uncertainties? Single layer network, with prior: W N (0, I) (Same results across many initialization strategies) What is happening? Bigger network More parameters, same data More parameter uncertainty Higher predictive variance!16
<latexit sha1_base64="ck8pdc+ekzh4numsp+zg sha1_base64="n9coyac86bbptqcwzaqp sha1_base64="46alfbowdxh9bhturssr sha1_base64="tcd8huirq8kglfyviyop <latexit sha1_base64="hwpgbp1ztlzgn+y1a19x6nusnsu= <latexit sha1_base64="hv75shivfqdfewffa+6e7+s+1lk=">aaacq3icbvbdaxnbfj2tby3rtleffrkashikzbcufeqoskb8qmca4o52mz29m0w7+8hmxzuw5l/1px/an/+alz4o4qvgzdeunvxadgfouzd754s5fbpt+5u19mb9y/nh41hz8zot7z3w02enoisuhwhpzkzgidmgrqodfchhlctgsshhgf68w/jdl6c0ynjpomvbt9g4fbhgdi0utd73g0s37k571fydnk/fue+lcclodooq7lmpyyrvmfjbnc8/zg/ss9opctflj6i7fu2l4lxx80798inw2z6wk9d7xfmsnlnijgh99akmfwmkycxt2nxshp2skrrcwrzpfrpyxi/ygfxdu5aa9stqutndm0pe40yzkykt1nsdjuu0niwhquwytvsqtxd/57kfxq/8uqr5gzdyelbcsiozxqrki6gao5wzwrgszlfkj0wxjib2pgnbwf3yfxj6eoay/vgoffx2guedvcc7pesc8pick/fkhawij1fko/ljflnx1g/rt/wnll2zlj3pyr1yf/8bgzyu6w==</latexit> <latexit sha1_base64="kafgq2qtdghwbatjmw5k1otkpxw=">aaacl3icbvbbs8mwge29znmr+uhlcagbygihoodlubcfzik7wfplmmvbxnkwjhwmsn/ki39llykk+oq/mn2g6lydgcm55ypfd/yiuaks681ywl5zxvvpbgq3t7z3ds29/zomy4fjfycsfa0fscjoqkqkkkyaksci+4zu/d5v6tefija0do7viciur52atilgskueed33hqejkycor6qleutuh3nrjnu6hd2uvh7hakj/uey34hc8m2cvrthgplgnjaemqhjmygmfooykujghkzu2fsk3qujrzmgw68ssraj3uic0nq0qj9jnxvco4bfwwradcv0cbcfq34kecskh3nfjdge566xiiq8zq/a5m9agihuj8osjdsygcmfahmxrqbbia00qfltvcnexcysvrjirs7bnt54ntvlr1vzunfe+nnarayfgcosbdc5agdyacqgcdj7bclydd+pfedu+ja9jdmmyzhyafzc+fwa3oadr</latexit> <latexit sha1_base64="f19ystsgzfbk19qglgh9199r8ue=">aaacexicbzdlssnafiynxmu9rv26gsxcilaserskuhqjriryczqxtkatdtrjhzmjtos+ghtfxy0lrdy6c+fbog2z0nyfbj7+cw5nzu/fjappmt/awuls8spqbi2/vrg5ta3v7nzflhbmajhiew96sbbgq1ktvdlsjdlbgcdiw+tfjuune8ifjcjboyyje6bosh2kkvswqxu+msjcmvtgebrferhpr2yn7q7hg9uddtylxuacjm6v6oofs2robofbyqaamlvd/cturzgjscgxq0k0ldowtoq4pjirud5obikr7qmoaskmuucek04ugsfd5bshh3h1qgkn7u+jfavcdanpdqzidsvsbwz+v2sl0j9zuhrgisqhni7yewzlbmfxwdblbes2viawp+qvehcrr1iqepmqbgv25hmoh5csxtcnhcpffkco7imdyaal <latexit sha1_base64="7gh/5hid6omc/cbnzfwytkka9ji=">aaacaxicbzdlssnafiynxmu9rd0ibgal4kokrdcnuhqjriryczqxtkatdujmjmxmlblixldx40irt76fo9/gazuftv4w8pgfczhz/jbhvgnh+bywfpewv1zla Bounding Prior variance u j w j w j N (0, f(x) =b + J j=1 2 w); b N (0, w j (x; u j ) 2 b ) + + E w [f(x)f(x )] = 2 b + J 2 we u [ (x; u j ) (x ; u j )] Could scale by J 2 w = a J C. Williams, NIPS 1997, R. Neal, LNS, 1996 or Force J to be small, by turning units off.!17
<latexit sha1_base64="hp+6lruf2d3tzaldqaqqvekmxyw=">aaab2xicbzdnsgmxfixv1l86vq1rn8eiucozbnqpuhfzwbzco5rm5k4bmskmyr2hdh0bf25efc93vo3pz0jbdwq+zknivscullqubn9ebwd3b/+gfugfnfzjk9nmo2fz0gjsilzl5jnmfpxu2cvjcp8lgzylffbj6f0i77+gstlxtzqrmmr4wmtuck7o6oyaraadlmw2ivxdc9yanb+gss7kddujxa0dhefbucunsafw7g9liwuxuz7ggupnm7rrtrxzzi6dk7a0n+5oykv394ukz9bostjdzdhn7ga2mp/lbiwlt1eldvesarh6kc0vo5wtdmajnchizrxwyasblykjn1yqa8z3hysbg29d77odbu3wmya6 sha1_base64="4fj6vtyjzduadso/magslypzyri=">aaacihicbvbltwixgpwwx4io6nvli9fangsxix6nxdwztosrsmumwwo0db9puxiy4f948a940iop+gmsyx4unatjzgaadsalojpknd+m3nr6xuzwfruwu9zd2y8dfnsyjawhlrlyuhq9lclnaw0ppjjtroji3+o0400ac78zpukymlhxs4g6ph4fbmgivlpys9cp7gsd2pl5ypaxghpmk9t5xtxhtskxo+nx0brfrylbliyflnzoj2dpzkq6pbjzm1ogv8tksbkynn3siz0iseztqbgopexzzqscbavfckfzgh1lgmeywspa0ztappvoknadoxotdnawfpoecqxqzxsj9qwc+z5olvrivw8h/uf1yjw8dbiw sha1_base64="jpobdzubmmnoo0tukhh9b8e6d6q=">aaack3icbvdltsjafj3ic+sldelmitfankrlo0scg1cge3kktdttyqotpo/mtdgk4x/c+csudoejbv0ph9kfgiez5oscc3pnhjdivejd+nbya+sbm1v5bx1nd2//ohb41bzhzdfp4zcfvosiqrgnsetsyug34gt5limdd9yy+50j4ykgwb2crst20tcghsvikskp1b+cmty3bpwh5sm5woglt7oscqktiwjn3k/csb9ahpall4qs3ognf2nmldufolexusbvymakcdi0ncklnqhx7jnayoae6jlgjo0ecukxizpdigwjeb6jiekpgicfcdtjb53bm6umobdy9qiju/x3rij8iaa+q5lze8synxf/83qx9k7t sha1_base64="yktkhfexyyy2euay7vunzhbcor4=">aaack3icbvdlsgmxfm3uvx1fvzdugkvpucpmexrz2o0rqwaf0gmhtjq2yzkzicluytd/ceovuncfd9z6h6btllr6iha451xu7veirqwyrhcjs7k6tr6r3ts3tnd293l7b00zxgktbg5zknoekotrgdquvyy0i0eq9xhpex5t5rfgregabndqepeur8oadihgskturnrv+vdukzrdhym1wogln9ocdq4dhwlx75xhufcuqscxf0iarvwss3nmlrq5vfwy5ob/iz2spehrd3pptj/emsebwgxj2bgtshutjbtfjexnj5ykqthhq9lrneccyg4yv3ukt7tsh4nq6bcoofd/tisisznhnk7o7phl3kz8z+veandv Horseshoe Priors for Model Selection The horseshoe prior is a scale mixture of normals: w k N (0, k C + (0, 1) 2 k v 2 )!18
<latexit sha1_base64="rk6dohrqchouf/wot7lv4lyewle=">aaacixicfvfdsxtbfj1dw42xamoffbk0fbiazff8oikioebdhyiyi+wuyxyysqznz9ezu8gw7h/pb+pb/42td4pr6ywbwznnfsy9csafqc/767hrhz6ub1q2q1uftnd2a5/37kyaa8a7ljwpvo+p4vio3kwbkt9nmtmklrwxp7rnem/ctrgpusvpxqoejpqyckbruv3a7zchogzufr/krpinrrmu4ge/egz8e3plc0kkeqsm5zc3legi/aklqadmgtmweymv91ozcd/hhqferdamros6+g+rqf+rewfepoat8jegtpbr6df+hiou5qlxycq1jvc9dkocahrm8ria5oznld3qeq8svdthjirmmyzhm2ugmey1fqphzr7mkghizdsjrxm2s3mtzcj3tcdh4vlucjxlybvbnbrmejcf2vlgidrnkkcwukafnrxymgrk0b6vapfgv/7yw3b3eobbfhnuv7xarqnc9slx0ia+oswx5jp0sjcwz91pocfoibvl+u6z+2nhdz1lzheyem77gwzywws=</latexit> v l<latexit sha1_b l<latexit sha1_b <latexit sha1_base64="fyjv0emn+tyalcv/fh0u60y14qe=">aaab/xicbvdlssnafl2pr1pf8bfzm1ieilisexrz7mzlbfuanobjdniozirhzlkopfgrblwo4tb/coffog2z0nydfw7n3mu99wqjz0o7zrevw1pewv3lrxc2nre2d+zdvyaku0loncq8lq0ak8pzrouaau5biarybjw2g4fqxg8oqfqsju70mkgewl2ihyxgbstfphj4hhuue6h6f1pyzldg9058u+iunsnqinezuoqmnd/+6nrjkgoaackxum3xsbq3wlizwum40ekvttb5wd3anjtcgipvnl1+ji6n0kvhle1fgk3v3xmj <latexit sha1_base64="j4xkndenb36+9rwby7b1axazf6g=">aaaca3icbvdlssnafj3uv62vqdvddbahopsjclosduoygn1ae8nkommhzirhzikuuhdjr7hxoyhbf8kdf+o0zujbd1w4nhmv994tjjwpjdc3vvhaxlldk66xnja3tnfs3b2wilnjajpepjadacvkwusbmmloo4mkwasctonhfek3h6huli7u9cihnsd9iiwmyg0k3z5wnu79bmjh0fvmwpr9aqwdwcbhj75drlu0bvwktk7kiefdt7/cxkxsqsnnofaq66beexmwmhfoxyu3vttbzij7tgtohavvxjb9yqypjdkdysxnrrpo1d8tgrzkjurgogxwazxvtct <latexit sha1_base64="7peml37drfipcb5u45s4wmqadx8=">aaacghicbvdlssnafj3uv62vqes3g0wsugpsbf0w3biscvybtrom00k7zpjgzlipiz/hxl9x40irt935n07alrr6yobwzrnmvcengrxsml60wsrq2vpgcbo0tb2zu6fvh7rflhbmwjhiee+6sbbgq9ksvdlsjtlbgctix/vvcr8zjlzqkhyqk5jyarqg1kmyssu5+vmjk/r+azvl0bi0gfaa5agjlt5lfamklyksfwbzvz52wl9+5uhlo2bmap8sc0hkyigmo0+tqystgiqsmyrezzriaaeis4ozyupwikimsi+gpkdoiaii7hr2wazpldkaxstvcywcqt8nuhqimqlclcz3fstelv7n9rlpxdkpdenekhdpp/isbmue85bgghkcjzsogjcnaleir4gjlfwxjvwcuxzyx9ku10yjzt5flbvxizqk4agcgwowwsvogfvqbc2 Group Horseshoe Priors for BNNs Horseshoe BNN: For each layer, draw a global scale: For node k in layer : Draw a local scale for the node: For each incident weight: l C + (0,b g ) kl C + (0,b 0 ) w kk,l N (0, 2 klvl 2 ) Inference: Stochastic gradient variational Bayes / BBVI + reparameterized gradients L( )=E q(w,,v; ) [ln p(y W,x)) + ln p(w,,v)] + H[q(W,,v; )] NN; intractable expectation!19 Model Selection in Bayesian Neural Networks via Horseshoe Priors; Ghosh & Doshi-Velez, 2017 Bayesian deep compression; Louizos et. al., 2017
n<latexit sha1_base64="yqeyt1atxrua A<latexit sha1_base64="glsiyp92psmcw8dknnhz8p <latexit sha1_base64="pvpkxizazmf6sngh5socm2kubey=">aaab7hicbzbns8n <latexit sha1_base64="icbz4nuoqv8buptsqpxta09k6e0=">aaab83icbvbnswmxem3wr1q/qh69bbfbu9kvqcfl0yvhcvydukvjptk2njssyuqopx/diwdfvppnvplvtns9aoudgcd7m8zm <latexit sha1_base64="pvpkxizazmf6sngh5socm2kubey=">aaab7hicbzbns8n AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOepMxszPLzKwQQv7Biw <latexit sha1_base64="icbz4nuoqv8buptsqpxta09k6e0=">aaab83icbvbnswmxem3wr1q/qh69bbfbu9kvqcfl0yvhcvydukvjptk2njssyuqopx/diwdfvppnvplvtns9aoudgcd7m8zm <latexit sha1_base64="8udf8dveqnsaqmi3yn <latexit sha1_base64="gi+pvon5gxzegi6kvnllk4n/7fk=">aaab6 <latexit sha1_base64="ohdjd2q9s0psvgycn7cejqrt1rw=">aaab/nicbzbps8mwgmbt+w/of1xx5cu4ho0ywhh0opticyjbb1stazzuyulak1qypebx8ejbea9+dm9+g7otb918ipdjed+x980tjowq7tjfvmlldw19o7xz2dre2d2z9w86kk4ljm0cs1h2q6qio4k0ndwmdbnjea8z8clx9btuprkpaczu9cqhpkddqsokktzwyb95qcby+6ym6jnsk8rhq82rb3bvatgzwwvwc6icqq3a/uopypxyijrmskme6yt <latexit sha1_base64="egfudgcg6vg2qsfphj26rsjdih8=">aaab/hicbzdlssnafizpvnz6i3bpzrai7aykiuiy1i3lcvycbqyt6bqdoppemykqqn0vny4uceudupntnlzzaospax//oydz5g9izpr2ng9rbx1jc2u7sfpc3ds/olspjtsqsishlrlxshydrchniw1ppjntxpjiexdacsbxs3rnkurfovbopzh1bb6fbmgi1sby7vld5/dzrvwnqk+yqa+vrtw3y07nmqutgptdgxi1ffurp4hiimioccdk9vwn1 Local Reparameterization Continuous weights and variances > reparameterization trick # outputs # inputs W l weights W (s) l q(w ) naive application Local reparameterization provably lower variance # outputs # inputs inputs W l =<latexit sha1_base64="8udf8dveqnsaqmi3yn B l pre-activations B (s) l q(b) For certain q(w), form of implied q(b = AW) is known!20 Variational dropout and the local reparameterization trick, Kingma 2015
<latexit sha1_base64="ui5qede9kvqgsy5cuxgrev1twai=">aaacnhicbvftaxnben47rdzyndvpishiebii5a4ilrsxad9exflbjivcpob2nsma3bvr7l4khper+k/6rf/gsxixfxfg2weeexl2z6jmcmm979pxhzzcevr4+0nt6c6z5y/quy97js01412wylsfr2c4fanvwmelp880bxvj3o+mxxbx/oxri9lkp51nfkhgniirygcrcuuxf81agz0wkew/bnpaqt6msymazbproh/x1mkcfql9uy3lf6mnzfpphhqxdvs+9rgjewmfv/bxtktqmkwmeiixwqtkfaqldvm21pqypbfnwshvpkfxtsj6w9vzlkbva78cdvlzwvi/cuku5yonlkkwzub7mr0wok1gkpe1idc8azafmr8gtebxmyywwy3pe2rioko1nstsjxuzogblzfxfmlmyjrkbw5d/iw1yozocfiljcsstthia5zlalc42rwohobnyjgcyfvhwyiaggvnczw2h4n/98n3q29/zef/40dj+xi1jm7wh70it+osahjmoosndwpzxzien43x137on7jf3+yrvdaqav+swub2/azbmrq==</latexit> Variational Family Fully factorized variational approximation q(w,,v; )= i,j,l N (w i,j,l µ i,j,l, 2 i,j,l) k,l q( k,l k,l ) l q(v l v l ) Louizos et. al., 2017 Ghosh & Doshi-Velez 2017 But Horseshoe shrinkage stems from coupling between weights and scales Retaining this structure is important for strong shrinkage!!21
v l<latexit sha1_b l<latexit sha1_b <latexit sha1_base64="fyjv0emn+tyalcv/fh0u60y14qe=">aaab/xicbvdlssnafl2pr1pf8bfzm1ieilisexrz7mzlbfuanobjdniozirhzlkopfgrblwo4tb/coffog2z0nydfw7n3mu99wqjz0o7zrevw1pewv3lrxc2nre2d+zdvyaku0loncq8lq0ak8pzrouaau5biarybjw2g4fqxg8oqfqsju70mkgewl2ihyxgbstfphj4hhuue6h6f1pyzldg9058u+iunsnqinezuoqmnd/+6nrjkgoaackxum3xsbq3wlizwum40ekvttb5wd3anjtcgipvnl1+ji6n0kvhle1fgk3v3xmj <latexit sha1_base64="j4xkndenb36+9rwby7b1axazf6g=">aaaca3icbvdlssnafj3uv62vqdvddbahopsjclosduoygn1ae8nkommhzirhzikuuhdjr7hxoyhbf8kdf+o0zujbd1w4nhmv994tjjwpjdc3vvhaxlldk66xnja3tnfs3b2wilnjajpepjadacvkwusbmmloo4mkwasctonhfek3h6huli7u9cihnsd9iiwmyg0k3z5wnu79bmjh0fvmwpr9aqwdwcbhj75drlu0bvwktk7kiefdt7/cxkxsqsnnofaq66beexmwmhfoxyu3vttbzij7tgtohavvxjb9yqypjdkdysxnrrpo1d8tgrzkjurgogxwazxvtct <latexit sha1_base64="7peml37drfipcb5u45s4wmqadx8=">aaacghicbvdlssnafj3uv62vqes3g0wsugpsbf0w3biscvybtrom00k7zpjgzlipiz/hxl9x40irt935n07alrr6yobwzrnmvcengrxsml60wsrq2vpgcbo0tb2zu6fvh7rflhbmwjhiee+6sbbgq9ksvdlsjtlbgctix/vvcr8zjlzqkhyqk5jyarqg1kmyssu5+vmjk/r+azvl0bi0gfaa5agjlt5lfamklyksfwbzvz52wl9+5uhlo2bmap8sc0hkyigmo0+tqystgiqsmyrezzriaaeis4ozyupwikimsi+gpkdoiaii7hr2wazpldkaxstvcywcqt8nuhqimqlclcz3fstelv7n9rlpxdkpdenekhdpp/isbmue85bgghkcjzsogjcnaleir4gjlfwxjvwcuxzyx9ku10yjzt5flbvxizqk4agcgwowwsvogfvqbc2 Group Horseshoe Priors for BNNs Horseshoe BNN: Regularized Horseshoe BNN For each layer, draw a global scale: For node k in layer : Draw a local scale for the node: For each incident weight: l C + (0,b g ) kl C + (0,b 0 ) w kk,l N (0, 2 klvl 2 ) Inference: structured Stochastic gradient variational Bayes with naive fully factorized variational approximations.!22
N <latexit sha1_base64="6zz3dsd1xktff4ybawwv2r75gus=">aaacshicbvbnswmxfmzwr1q/qh69biuoumpuefry9ojjktgqdovynpu2odndmgqrzdmf58wjn3+dfw+kedp9alu6ebhm3vdyxheckw3bz1zubn5hcsm/xfhzxvvfkg5unvwcseibjoaxvpfbuc4i2tbmc3ojjixq5/ta75+n/osblyrf0zuectoooruxdiogjeqvpxfw76x9/n6zz27iauxqsizaszieelym3bseaowpsstklhsm3rb0jwbpl7kzqf3+tt9wtfq2eugvs3bfhgp/jc6ulnauda/45ayxsuiaacjbqzzjc91oqwpgom0kbqkoankhlm0zgkfivtsdf5hhpameubnl8yknx+rpraqhuspqn5ojk9ssnxl/81qj7py0uxajrnoitbz1eo5nh6nwccakjzopdqeimfkrjj2qqltpvmbkcgzp/kua1ypjv5zlo1ltdfphhu2gxxsahhsmaugc1vedefsaxtabercervfrw/qcjoasawyb/uiu9wwxhreh</latexit> <latexit sha1_base64="5remflreduk1epl3x73huvswhti=">aaacq3icbvdlssnafj3uv62vqks3g0wsuepsbau3rteupij9ynogywtadpk8mjm0ljb/c+mpupmh3lhqxk3gjo1cqxcgdufbnxvskfehdf1zyy0tr6yu5dclg5tb2zvf3b2wcckosrmhloadgwncqe+akkpgoienylmzadvuvaq3x4qlgvh3chqsnoegph1qjksirol9xipd97jcemh61ig4ak2jikwypalhfoomoj7skbxhxokbpdzjxmytp27khbknosqt+jwv6tdolqxisa/q2cc/wjidephpwyo+mu6ai4/4ejmkrnfqq9mlezcum5iuzeiqegexdulxqr95rptirimehinggyoaq+dlmle/ezhyhjh6tnkmt4hflsx/07qrhjz3yuqhksq+ni0arazkakafqodygiwbkoawp+qvei8qr1iq2guqbgpx5l+gvasaetw4ps3vl+d15meboarlyiazuafxoagaaimh8alewlv2ql1qh9rnzjrt5pl98gu0r2/hardx</latexit> c<latexit sha1_base64="z <latexit sha1_base64="tzgl7rqic17zl1xjxdr55qoqahu=">aaachhicbvdlsgmxfm3ud31vxbojflfbykwvdcm6csuvbct0aslkbm2yzipkjlqg+ra3/oobf4q4csh4n6apha8dgcm59yynx0uk0gjbn1zhynjqemz2rji/sli0xfpzbeg4vrzqpjaxuvsybikiqknaczejahz6eppecdlwmzegtiijc+wn0a7zdss6gjm0uqe05ylc4fcetigfz9qngfy4k9lzvn3byyjga1fmbih8au9sflxdytulsl2xh6 <latexit sha1_base64="7ccojkfibcfvnplovvprmv81diw=">aaacpnicbvbns8mwge79npor6tflcaicmnoh6euyeve4ww6dtstpmm5h6qdjohilv8ylv8gbry8efphq0xqrmjdfchl4pkjex0syfdiwxrsv1bx1jc3kvnv7z3dvxz84bis45zhyogyx73pieeyjykkqgekmnkdqy6tjjw4lvtmmxna4epcthdghgkq0obhjrbm6zqcc4czmm1ts5hn1otr3sxhl+1kjh7us38jhnfy14ylo4tmcc6j0iedqnanutacua7menvboy9wfbt/gaugiirksomcaixqyxcxfjorvoxukqxiebqsny Regularized Horseshoe p(w kk,l kl,v l, ) N (w kk,l 0, 2 klv 2 l ) (w kk,l 0,c 2 ) Equivalently, w kk,l c, kl,v l N (w kl 0, 2klv 2 l ); 1 2kl v2 l = 1 c 2 + 1 2 kl v2 l!23 Piironen & Vehtari, 2017
<latexit sha1_base64="7ccojkfibcfvnplovvprmv81diw=">aaacpnicbvbns8mwge79npor6tflcaicmnoh6euyeve4ww6dtstpmm5h6qdjohilv8ylv8gbry8efphq0xqrmjdfchl4pkjex0syfdiwxrsv1bx1jc3kvnv7z3dvxz84bis45zhyogyx73pieeyjykkqgekmnkdqy6tjjw4lvtmmxna4epcthdghgkq0obhjrbm6zqcc4czmm1ts5hn1otr3sxhl+1kjh7us38jhnfy14ylo4tmcc6j0iedqnanutacua7menvboy9wfbt/gaugiirksomcaixqyxcxfjorvoxukqxiebqsny Regularized Horseshoe BNNs 1 2kl v2 l = 1 c 2 + 1 2 kl v2 l Regularized Horseshoe Horseshoe 500 units 50 units Random functions from single hidden layer (tanh) network with HS and reg-hs priors!24
<latexit sha1_base64="oath+xcwnh8fkozrvldjupofwwy=">aaab9hicbvdlsgnbejynrxhfuy9ebomql2fxbd0gvximyb6qlgf20k <latexit sha1_base64="7ccojkfibcfvnplovvprmv81diw=">aaacpnicbvbns8mwge79npor6tflcaicmnoh6euyeve4ww6dtstpmm5h6qdjohilv8ylv8gbry8efphq0xqrmjdfchl4pkjex0syfdiwxrsv1bx1jc3kvnv7z3dvxz84bis45zhyogyx73pieeyjykkqgekmnkdqy6tjjw4lvtmmxna4epcthdghgkq0obhjrbm6zqcc4czmm1ts5hn1otr3sxhl+1kjh7us38jhnfy14ylo4tmcc6j0iedqnanutacua7menvboy9wfbt/gaugiirksomcaixqyxcxfjorvoxukqxiebqsny <latexit sha1_base64="oath+xcwnh8fkozrvldjupofwwy=">aaab9hicbvdlsgnbejynrxhfuy9ebomql2fxbd0gvximyb6qlgf20k Regularized Horseshoe BNNs 1 2kl v2 l = 1 c 2 + 1 2 kl v2 l UCI Regression Benchmarks (Hernández-Lobato and Adams 2015) log(n) log(n) reg-hs BNNs improves predictive performance over HS BNNs for smaller datasets. Relative improvement: (x - y)/ max( x, y )!25
<latexit sha1_base64="8fkcze5wmz5lvmp2rorbssyrnyc=">aaa < ln <latexit sha1_base64="bur6 <latexit sha1_base64="bur6kezthbpcow5mgpanigovlw0=">a <latexit sha1_base64="geracnghs530qj2gkh2dnoybw6w=">aaab/hicbvbns8naen3ur1q/oj16wsycp5kio <latexit sha1_base64="icbz4nuoqv8buptsqpxta09k6e0=">aaab83icbvbnswmxem3wr1q/qh69bbfbu9kvqcfl0yvhcvydukvjptk2njssyuqopx/diwdfvppnvplvtns9aoudgcd7m8zm <latexit sha1_base64="geracnghs530qj2gkh2dnoybw6w=">aaa ynrxhfuy9ebopgkeykomegf48rzaosjcxoepmxszplzkwqqv7biwdfvpo/3vwbj8kenlggoajqprsrsgu31ve/vcla+sb w <latexit sha1_base64="ldu0sxvkxoqnqvxiwf2q+bbrx/q=">aaachhicbzdlssnafiyn9vbrlerszwarkkhjvnbl0y1upik9qbpkzdpph84kywyiljahceoruhghibsxgm/jji2grt8mfpznhoac34sylcqyvozswuls8kp5tbk2vrg5zw7vtguyc0xaogsh6hpiekyd0ljumdknbehcy6tjjs+zeueecend4e5niujynayotzfs2uqbj45hfoony5zcr <latexit sha1_base6 <latexit sha1_base64="wt6poi+w0ytono+rkszbuoc8pxe=">aaacchicbzdlssnafiynxmu9rv26clairkoigm6eohuxfewf2ham00k7ddijmyeverp046u4cagiwx/bnw/jnm1cw38y+oy/5zbz/iarxipjfftlyyura+uljflm1vborr2339rxqihr0fjeqh0qzqsxraecbgsnipeoekwvdg+m9daikc1jeq/jhhkr6useckrawl599obnqzhbv7gljm155ituwidkf9+uofunf14et4akklt37a9ul6 <latexit sha1_base64="9jubdhltbijtjtj38v9niablqme=">aaacdxicbvbns0jbfj1nx2zfvss2qxyomlwxqw0csu0bwya/qb8ybxx1cn5hm/ef8vaptomvtglrrnv27fo3jfowpr24cdjnxu69xwkev2ca30ziaxllds25ntry3nrese/u1zqfssqq1be+bdhemce9vguogjucyyjrcfz3blctv/7apok+dwejgnku6xm8yykblbxtr/fzug5f4hawiurlapipt2586rixzpbzujrhtvw7ntel5hr4kvgxyaaylxb6q9xxaegyd6ggsjutmwa7ihi4fwycaowkbyqosi81nfwiy5qdtb8z42otdhdxl7o8wfp190rexkvgrqm7xqj9ne9nxp+8zgjdczvixhac8+hsutcughw8iqz3ugquxegtqixxt2laj5jq0agmdajw/mulphzasmycdxuwkzbiojloab2illlqosqia1rbvutri3pgr+jnedjejhfjy9aamokzffqhxucpai+zmq==</latexit> <latexit sha1_base64="odopvjmaych4eiv5pocox3fdtrq=">aaab9xicbvbns8naej3ur1q/oh69lbzreeoigl6eohepfuxbagpzbdfn0s0m7g6uevo/vhhqxkv/xzv/xm2bg7y+ghi8n8pmvcdltgnh+bzks8srq2vl9crg5tb2jr2711rjjgn1smit2q6wopwj6mmmow2nkui44lqvdg8mfuurssusca9hkfvjpbaszarriz146apf0te6rd2gyj276tsckdai <latexit sha1_base64="telqwaczbhajrnykxrqht697vtq=">aaacohicbzdlsitbeiz71omlx0vupzvgcedhggze0i0gunhjduwumihudcraphtm6k4rw5dhcunjubm3lhrx6xpyivlo4g8np19vuv1/lcppyfcfvbhxit+tu9mzhb+zc/mlxcwlqk0yi7aiepwyywgskhljhsqpvewngo4uxkttg1794ganlul8tp0u6xquytmsasihrvfkl4eet5qfarl5u3gcga2qu3buymmicuitmyfb1v3pk6oooozwg8wsx/b74qmmgjgsg+i0uxwim4ninmykffhbc/yu6jkykkjhtxbmflmqbbjcmrmxalt1vh94l/9zpmlbixevjt6n3ydy0nz2doq6ndc1ha714g+1wkatnxou4zqjjmxxolamocw8lyjvsoocvmczeea6v3jxdqyeuawllorg+orru90sb345onsq7e0p4phmk2yvrbgabbm9dshowyujdsee2at79e69z+/ne/9qhfmgm8vsh7yptxm6rcq=</latexit> Structured Variational Approximation Weights incident on a unit: Non-centered Parameterization: kl N (0, I) kl = kl v l kl Layer specific structured variational approximations: # outputs # inputs ln kl kl weights scales q(b) = Matrix-Normal(M,U,V ) U = hh + Low dimensional covariance maintains posterior structure between weights and scales. Local re-parameterization: q( kl kl )<latexit sha1_base6 = Matrix-Normal(M,U,V )!26
Synthetic Data: Better Fits 20 Training Points 2.5 0.0 2.5 Factorized; 1000 Node 1 0 1 Structured; 1000 Node 1 0 1 1000 Unit reg HS, structured VI Both using regularized Horseshoe priors 100 Training Points 2.5 0.0 2.5 1 0 1 1 0 1 200 Training Points 2.5 0.0 2.5 1 0 1 1 0 1!27
Structured vs Factorized!28
Structured vs Factorized Five hundred training points!29
<latexit sha1_base64="o9zzjp9syj/idv2rfhvglfave9a=">aaacb3icbvdlssnafj3uv62vqetbbotqnyurqrcirtcuk9ghncfmjtn26othze2hho7c+ctuxcji1l9w5984bbpq1gmxdufcy733+ingcizr2ygsla+srhxxsxubw9s75u5eu8wppkxbyxhltk8uezxideagwdurjis+yc1/cdpxw0mmfy+jexglza1jl+jdtgloytmphyookntlbmi89as+xe7abjatfiutz/lmslw1psclxm5jgewoe+axe8q0dvkevbcloravgjsrczwkni45qwijoqpsyx1nixiy5wbtp8b4wcsb7szsvwr4qv6eyeio1cj0dwdiok/mvyn4n9djoxvhzjxkumarns3qpgjdjceh4iblrkgmncfucn0rpn0icqudxumhym+/veiap1xbqtp3z+xadr5her2gi1rbnjphnxsl6qibkhpez+gvvrlpxovxb UCI Regression Tasks Structured variational approximation -> stronger shrinkage, similar predictive performance Predictive Performance: Comparisons with Variational matrix Gaussian (Louizos & Welling, ICML 2016)!30 q( kl v l < ) >p 0 Pruning rule uses the variational posterior
Summary (Regularized) Horseshoe Priors for BNNs can assist with model-selection Recover small networks with similar performance to larger networks. Careful modeling of posterior structure between weights and scales is essential for reliable shrinkage.!31
We are hiring! http://mitibmwatsonailab.mit.edu/careers/ http://www.research.ibm.com/labs/cambridge/ 75 Binney Street, Cambridge, MA 02142!32