Horseshoe Priors for Bayesian Neural Networks

Podobne dokumenty
Bayesian graph convolutional neural networks

Neural Networks (The Machine-Learning Kind) BCS 247 March 2019

tum.de/fall2018/ in2357

Gradient Coding using the Stochastic Block Model

Linear Classification and Logistic Regression. Pascal Fua IC-CVLab

Machine Learning for Data Science (CS4786) Lecture 11. Spectral Embedding + Clustering

deep learning for NLP (5 lectures)

Label-Noise Robust Generative Adversarial Networks

Hard-Margin Support Vector Machines

Logistic Regression. Machine Learning CS5824/ECE5424 Bert Huang Virginia Tech

Machine Learning for Data Science (CS4786) Lecture11. Random Projections & Canonical Correlation Analysis

Previously on CSCI 4622

TTIC 31210: Advanced Natural Language Processing. Kevin Gimpel Spring Lecture 9: Inference in Structured Prediction

Maximum A Posteriori Chris Piech CS109, Stanford University

TTIC 31210: Advanced Natural Language Processing. Kevin Gimpel Spring Lecture 8: Structured PredicCon 2

Analysis of Movie Profitability STAT 469 IN CLASS ANALYSIS #2

Inverse problems - Introduction - Probabilistic approach

Revenue Maximization. Sept. 25, 2018

MoA-Net: Self-supervised Motion Segmentation. Pia Bideau, Rakesh R Menon, Erik Learned-Miller

DEFINING REGIONS THAT CONTAIN COMPLEX ASTRONOMICAL STRUCTURES

Learning to find good correspondences

Few-fermion thermometry

Convolution semigroups with linear Jacobi parameters

Machine Learning for Data Science (CS4786) Lecture 24. Differential Privacy and Re-useable Holdout

Internet of Things Devices

Nonlinear data assimilation for ocean applications

Prices and Volumes on the Stock Market

Wrocław University of Technology. Uczenie głębokie. Maciej Zięba

Supervised Hierarchical Clustering with Exponential Linkage. Nishant Yadav

OpenPoland.net API Documentation

Towards Stability Analysis of Data Transport Mechanisms: a Fluid Model and an Application

Emilka szuka swojej gwiazdy / Emily Climbs (Emily, #2)

Strangeness in nuclei and neutron stars: many-body forces and the hyperon puzzle

Zarządzanie sieciami telekomunikacyjnymi

ERASMUS + : Trail of extinct and active volcanoes, earthquakes through Europe. SURVEY TO STUDENTS.

Zakopane, plan miasta: Skala ok. 1: = City map (Polish Edition)

Maszyny wektorów podpierajacych w regresji rangowej

The impact of the global gravity field models on the orbit determination of LAGEOS satellites

Wprowadzenie do programu RapidMiner Studio 7.6, część 9 Modele liniowe Michał Bereta

OSI Physical Layer. Network Fundamentals Chapter 8. Version Cisco Systems, Inc. All rights reserved. Cisco Public 1

HOW MASSIVE ARE PROTOPLANETARY/ PLANET HOSTING/PLANET FORMING DISCS?

MATLAB Neural Network Toolbox przegląd

Wprowadzenie do programu RapidMiner, część 3 Michał Bereta

Miedzy legenda a historia: Szlakiem piastowskim z Poznania do Gniezna (Biblioteka Kroniki Wielkopolski) (Polish Edition)

Effective Governance of Education at the Local Level

Miedzy legenda a historia: Szlakiem piastowskim z Poznania do Gniezna (Biblioteka Kroniki Wielkopolski) (Polish Edition)

New Roads to Cryptopia. Amit Sahai. An NSF Frontier Center

Output Channels: N (a) A Convolutional Layer. # Layers: Expand Channels: Q

MATLAB Neural Network Toolbox uczenie sieci (dogłębnie)

Tychy, plan miasta: Skala 1: (Polish Edition)

Poland) Wydawnictwo "Gea" (Warsaw. Click here if your download doesn"t start automatically

Stargard Szczecinski i okolice (Polish Edition)

Karpacz, plan miasta 1:10 000: Panorama Karkonoszy, mapa szlakow turystycznych (Polish Edition)

Domy inaczej pomyślane A different type of housing CEZARY SANKOWSKI

Wojewodztwo Koszalinskie: Obiekty i walory krajoznawcze (Inwentaryzacja krajoznawcza Polski) (Polish Edition)

Conference programme

The Lorenz System and Chaos in Nonlinear DEs

Application Layer Functionality and Protocols

Wojewodztwo Koszalinskie: Obiekty i walory krajoznawcze (Inwentaryzacja krajoznawcza Polski) (Polish Edition)

ARNOLD. EDUKACJA KULTURYSTY (POLSKA WERSJA JEZYKOWA) BY DOUGLAS KENT HALL

MULTI CRITERIA EVALUATION OF WIRELESS LOCAL AREA NETWORK DESIGNS

Rozpoznawanie twarzy metodą PCA Michał Bereta 1. Testowanie statystycznej istotności różnic między jakością klasyfikatorów

reinforcement learning through the optimization lens Benjamin Recht University of California, Berkeley

Dolny Slask 1: , mapa turystycznosamochodowa: Plan Wroclawia (Polish Edition)

CENNIK (bazowy) INWERTERY SIECIOWE FOTTON PLATINIUM

Symmetry and Geometry of Generalized Higgs Sectors

ABOUT NEW EASTERN EUROPE BESTmQUARTERLYmJOURNAL

Nonconvex Variance Reduced Optimization with Arbitrary Sampling

arxiv: v1 [cs.lg] 22 Dec 2018

Katowice, plan miasta: Skala 1: = City map = Stadtplan (Polish Edition)

Installation of EuroCert software for qualified electronic signature

Cracow University of Economics Poland. Overview. Sources of Real GDP per Capita Growth: Polish Regional-Macroeconomic Dimensions

PRZYKŁAD ZASTOSOWANIA DOKŁADNEGO NIEPARAMETRYCZNEGO PRZEDZIAŁU UFNOŚCI DLA VaR. Wojciech Zieliński

Analiza Sieci Społecznych Pajek

Wybrzeze Baltyku, mapa turystyczna 1: (Polish Edition)

TTIC 31190: Natural Language Processing

Baptist Church Records

Proposal of thesis topic for mgr in. (MSE) programme in Telecommunications and Computer Science

Unit of Social Gerontology, Institute of Labour and Social Studies ageing and its consequences for society

Machine learning Lecture 5

Egzamin maturalny z języka angielskiego na poziomie dwujęzycznym Rozmowa wstępna (wyłącznie dla egzaminującego)

Wojewodztwo Koszalinskie: Obiekty i walory krajoznawcze (Inwentaryzacja krajoznawcza Polski) (Polish Edition)

EGZAMIN MATURALNY Z JĘZYKA ANGIELSKIEGO CZERWIEC 2013 POZIOM ROZSZERZONY CZĘŚĆ I. Czas pracy: 120 minut. Liczba punktów do uzyskania: 23

Wstęp do Metod Systemowych i Decyzyjnych Opracowanie: Jakub Tomczak

Niezawodne Systemy Informatyczne

Weronika Mysliwiec, klasa 8W, rok szkolny 2018/2019

Why do I need a CSIRT?

Wprowadzenie do programu RapidMiner, część 5 Michał Bereta

SPOTKANIE 3: Regresja: Regresja liniowa


Electromagnetism Q =) E I =) B E B. ! Q! I B t =) E E t =) B. 05/06/2018 Physics 0

Pielgrzymka do Ojczyzny: Przemowienia i homilie Ojca Swietego Jana Pawla II (Jan Pawel II-- pierwszy Polak na Stolicy Piotrowej) (Polish Edition)

OSI Network Layer. Network Fundamentals Chapter 5. Version Cisco Systems, Inc. All rights reserved. Cisco Public 1

Chmura prywatna od podstaw Nowoczesny storage w rozwiązaniach chmury prywatnej z NetApp

EGZAMIN MATURALNY Z JĘZYKA ANGIELSKIEGO POZIOM ROZSZERZONY MAJ 2010 CZĘŚĆ I. Czas pracy: 120 minut. Liczba punktów do uzyskania: 23 WPISUJE ZDAJĄCY

OSI Transport Layer. Network Fundamentals Chapter 4. Version Cisco Systems, Inc. All rights reserved. Cisco Public 1

Wprowadzenie do sieci neuronowych i zagadnień deep learning

Podejście agentowe w projektowaniu sieci RBF Agent-based approach to design of the RBFNs. I.Czarnowski

IBM Skills Academy SZKOLENIA I CERTYFIKATY

A hybrid quantum-classical recommender system. Andrea Skolik

Transkrypt:

Horseshoe Priors for Bayesian Neural Networks Soumya Ghosh Jiayu Yao Finale Doshi-Velez IBM Research Harvard Harvard MIT-IBM Watson AI lab 1

<latexit sha1_base64="/xlkhul+a7calscycscwtlr3ztm=">aaacbxicbzdlssnafiyn9vbrlepsf4nfqagleue3qtgnywr2am0ik+mkhtqzctmtmyru3pgqblwo4tz3cofbogmz0oopax//oyc55w9irpv2nc+rtlc4tlxsxq2srw9sbtnbo20leoljcwsmzddaijdksuttzug3lgrfasodyhyv1zt3rcoq+k1oy+jfamhpsdhsxvlt/rrewjhfj5aeycsyzqr2fwspirdued+uonvnkvgx3akqofdttz/7a4gtihcngvkq5zqx9jikncwmtcr9rjey4teakp5bjikivgx6xqqegmcaqyhn4xpo3z8tgyqusqpadobbqvlabv5x6yu6ppcyyunee45nh4ujg1rapbi4ojjgzvidcetqdov4hctc2grxmsg48yf/hfzj3tv8c1ptxbzxlmeeoaa14iiz0adxoalaaimh <latexit sha1_base64="hj78bvbvqvoeleceyi6+lo5phbm=">aaacmnicbvdlsgmxfm3uvx1foy7dbitqozqzexrzdko7cvybnahcyartaozbklhl0g9y45cilnshifs/wvsbansdgcm555j7j59wjpvtvxq5pewv1bx8urmxubw9y+3u1wwcckjrjoaxapogkwcrrsmmog0mgkloc9rw+5cjv3fhhwrxdksgcfvc6easwwgolbwt66tohqb6bhjwggi3zaf2uz4p4bi7gnv7cosi780fuuejp5r+022ryjftmfa8caakgkaotq1nn4hjgtjieq5sthw7uv4gqjhc6db0u0ktih3o0pameyruetn45ce+0kqao7hql1j4rp6dyccuchd6ojlaw856i3gr10pv59zlwjskikzk8len5vjfenqfdpigrpgbjkae07ti0gmbromwtv2cm3 <latexit sha1_base64="ic+y4dkt/lqwr1v58xz1j6qtqrm=">aaacb3icbvdlssnafj3uv62vqetbbotqsymjclosunfzwt6gcweymbrdz5iwm5gg0p0bf8wnc0xc+gvu/bunbrbaeudc4zx7z+49fskovjb1brrwvtfwn4qbpa3tnd09c/+glenuynlcmytf10esmbqrlqkkkw4icoi+ix1/edp1ow9esbph9yplimtrp6ihxuhpytopk0rmvahdaqbhxrugsxoc1add9bmbovpmslw3zodlxm5jgeroeuaxe8q45srs Bayesian Neural Networks y = h W (x)+noise Being Bayesian: p(w ) p(w y, x, ) p(y x,y,x, ) Inputs x Layers, parameterized by weights W Outputs y!2

Why Bother? Need to guard against unintended consequences. Need to know when the model doesn t know.!3

Predictive Uncertainty + +!4

Larger Data and Modern Architectures Convolution Neural Network ( LeNet variant ) Train: 60,000 Handwritten digits Test: 10,000 heldout digits Test Error ~ 1%!5

Are the predictions robust? 71 45 0.85 0.99 0.95 71 9 0.1 70!6

Bayesian Neural Networks + + Distribution on weights: Z p(y W,x )p(w y train,x train )dw!7 source: Ghosh et al., AAAI 2016 Balan et al., NIPS 2016

Bayesian Neural Networks 5 9 3 0.30 0.28 0.26!8 experiment inspired by: Gal et al., 2015

<latexit sha1_base64="xq9dxc4/0b1k2gmmlddrninulli=">aaacjxicbvdlsgmxfm34rpvvdekmwis6ktmi6ekh6mzlbfuatprmeqcnztxi7kjlmd/jxl9x48iigit/xxtahbyecbzoote5ow4khubb/rjwvtfwnzzzw/ntnd29/clbyv2hsejq46emvdnlgqqioiycjtqjbcx3jttc4d3ubzyb0iimhnecqcdn/ub4gjm0urdw3uyyyxzpoqcxjl5pdjbsg0oh3wtra/smb5zjpjgmqcl1c0w7bgegy8szkykzo9ottnq9kmc+bmgl07rl2bf2eqzqcalpvh1ribgfsj60da2yd7qtzbuk9nqopeqfypwaaab+nkiyr/xyd01yuqde9kbif14rru+qk4ggihecpnviiyxfke4roz2hgkmcg8k4emzxygdmmy6m2lwpwvn88jkpn5cdwx8uipxber05ckxosik45jjuyd2pkhrh5jm8kncysv6sn+vd+pxfv6z5zbh5a+v7b948pri=</latexit> Distribution over Functions f(x) = h W (x) random variable random variable!9

BNNs - applications Model stochastic functions Depweg et al., ICLR 2017 Model uncertainty in deterministic functions Killian et al., NIPS 2017 Predictive uncertainties for active learning, sequential decision making Hernández-Lobato et al., ICML 2015, Gal et al., ICML 2017, Joshi et al., CVPR 2017, Zhang et al., AISTATS 2018, Depweg et al., ICML 2018, Riquelme et al., ICLR 2018!10

f(x) <latexit sha1_base64="xfl22regolqcu3yv8/uzt4o7oq8=">aaacdhicbvdlsgmxfm34rpvvdekmwmqwspkrqzdffwpuktghdiasstntadizjhekzeghupfx3lhqxk0f4m6/mx0stpva4hdoudzc48eca7dtb2tpewv1bt2zkd3c2t7zze3t13wukmpqnbkravpem8fdvgmogjvjxyj0bwv4/aux33hgsvmovidhzdxjuiepocvgphyuhxqgrexqlrelbadpdxvukeyr4dvcoiqhj8wisdllewk8sjwzyamzqu3cl9ujacjzcfqqrvuohyoxegwccjbkuolmmaf90mutq0mimfbsytejfgyudg4izv4iekl+nkij1hoofzoubhp63hul/3mtbiill+vhnaal6xrrkagmer43gztcmqpiaaihipu/ytojilaw/wvncc78yyukflp2dl87y1cuz3vk0ce6qgxkohnuqteoimqiokf0jf7rm/vkvvjv1sc0umtnzg7qh1ifp2rlmj8=</latexit> <latexit sha1_base64="1qsafpd3upojzsivuy/6/h3+f9a=">aaacbxicbzc7tsmwfiydrqxcaowwwfrizaksharjbqtjkehfaqlicz3wqu1etooooiwsvaolawix8g5sva1omwfafsnsp/+ci5/zhwmjsjvot7w0vlk6tl7zqg5ube/s2nv7hrwnepm2jlkseyfshffb2ppqrnqjjiihjhtd8xvr794tqwgs7vqkit5hq0ejipe2vmafrfwhu+gpyueoydyo9agjlnxz3pibxxmazlrwedwsaqbuk7c/vegmu06exgwp1xedrpszkppirvkqlyqsidxgq9i3kbanys <latexit sha1_base64="uph3app7mcnmemiern/5sbxflay=">aaacghicbzbns8naeiy39avwr6hhl4tfaefqugq9fr14kgr2a9pynttnu3q3cbsbmyt+dc/+fs8efpham//gtzqdtg4sppvoddpzuigjulnwt1fywv1b3yhulra2d3b3zp2dtgwiguklbywqxrdjwqhpwooqrrqhiii7jhtcyxwa7zwsiwng36s4ja5hi596fcolpyf5fleeqrdp6rb6gunk9repmuysuz1wuvu0luccpdsra7ns1aws4dlyozrbhs2boespaxxx4ivmkjq92wqvkychkgzkwuphkoqit9ci9dt6ibppjnlhu3iifb1yiptzfczu3x0j4llg3nwv6cpymzek/+v6kfiunyt6yasij+edvihbfcdujtikgmdfyg0ic6p3hximbmjke1nsjtiljy9du16znd+dlxtxur1fcasoqqxy4ai0wa1oghba4bm8gnfwybwyb8an8tuvlrh5zyh4e8bsbx/8nfi=</latexit> Alternate distribution over functions Gaussian Processes f(x) GP(m(x),K(x, x )) Bayesian Neural Networks h W (x) Rest of this talk, Noisy data model: y(x) f(x) N (f(x), 2 ) Gaussian likelihoods!11

f(x) <latexit sha1_base64="xfl22regolqcu3yv8/uzt4o7oq8=">aaacdhicbvdlsgmxfm34rpvvdekmwmqwspkrqzdffwpuktghdiasstntadizjhekzeghupfx3lhqxk0f4m6/mx0stpva4hdoudzc48eca7dtb2tpewv1bt2zkd3c2t7zze3t13wukmpqnbkravpem8fdvgmogjvjxyj0bwv4/aux33hgsvmovidhzdxjuiepocvgphyuhxqgrexqlrelbadpdxvukeyr4dvcoiqhj8wisdllewk8sjwzyamzqu3cl9ujacjzcfqqrvuohyoxegwccjbkuolmmaf90mutq0mimfbsytejfgyudg4izv4iekl+nkij1hoofzoubhp63hul/3mtbiill+vhnaal6xrrkagmer43gztcmqpiaaihipu/ytojilaw/wvncc78yyukflp2dl87y1cuz3vk0ce6qgxkohnuqteoimqiokf0jf7rm/vkvvjv1sc0umtnzg7qh1ifp2rlmj8=</latexit> <latexit sha1_base64="tu5zzde4qbcx4woxfsybids+sh8=">aaacanicbzdlssnafizp6q3ww9svubksqgupiqgkbopd6lkcvuabymq6aydojmfmiprq3pgqblwo4tancofbog2z0nyfbj7+cw5nzu/hncnton9wbml5zxutv17y2nza3rf39xoqsishdrlxslz8rchngty105y2yklx6hpa9ifvsb35qkvikbjxo5h6ie4lfjcctbg69kgaooqf6kzwkp+i8gm6qszhalw7dtepo1ohrxazkekmwtf+6vqikoruamkxum3xibwxyqkz4xrc6cskxpgmcz+2dqocuuwl0xpg6ng4prre0jyh0dt9pzhiuklr6jvoeoubmq9nzp9q7uqhl17krjxokshsuzbwpcm0yqp1mkre85ebtcqzf0vkgcum2qrwmcg48ycvquos7bq+oy9wrrm48nair1acfy6gardqgzoqeirneiu <latexit sha1_base64="1qsafpd3upojzsivuy/6/h3+f9a=">aaacbxicbzc7tsmwfiydrqxcaowwwfrizaksharjbqtjkehfaqlicz3wqu1etooooiwsvaolawix8g5sva1omwfafsnsp/+ci5/zhwmjsjvot7w0vlk6tl7zqg5ube/s2nv7hrwnepm2jlkseyfshffb2ppqrnqjjiihjhtd8xvr794tqwgs7vqkit5hq0ejipe2vmafrfwhu+gpyueoydyo9agjlnxz3pibxxmazlrwedwsaqbuk7c/vegmu06exgwp1xedrpszkppirvkqlyqsidxgq9i3kbanys Gaussian Processes f(x) GP(m(x),K(x, x )) Bayesian Neural Networks h W (x) Exact Inference* * Only for Gaussian Likelihoods Approximate Inference Scales Poorly with n Well calibrated uncertainties f GP (.,.); f C Constraining the space of functions can be difficult Scales Well Predictive uncertainties can be poor Some are easy; depends on C!12

f(x) <latexit sha1_base64="xfl22regolqcu3yv8/uzt4o7oq8=">aaacdhicbvdlsgmxfm34rpvvdekmwmqwspkrqzdffwpuktghdiasstntadizjhekzeghupfx3lhqxk0f4m6/mx0stpva4hdoudzc48eca7dtb2tpewv1bt2zkd3c2t7zze3t13wukmpqnbkravpem8fdvgmogjvjxyj0bwv4/aux33hgsvmovidhzdxjuiepocvgphyuhxqgrexqlrelbadpdxvukeyr4dvcoiqhj8wisdllewk8sjwzyamzqu3cl9ujacjzcfqqrvuohyoxegwccjbkuolmmaf90mutq0mimfbsytejfgyudg4izv4iekl+nkij1hoofzoubhp63hul/3mtbiill+vhnaal6xrrkagmer43gztcmqpiaaihipu/ytojilaw/wvncc78yyukflp2dl87y1cuz3vk0ce6qgxkohnuqteoimqiokf0jf7rm/vkvvjv1sc0umtnzg7qh1ifp2rlmj8=</latexit> K(x, <latexit sha1_base64="ly+c2pwi7meiezn55tuyyah5v5o=">aaab63icbzdnsgmxfixv1l9a/6ou3qsluddlrgrdft24roc0hxyomttthiazicmizegruhghiftfyj1vy6adhbyechycey+594qjz9q47 <latexit sha1_base64="mirb8bwte7onglg/k4dhfbuwb5o=">aaab73icbzdlsgmxfibpek31vnxpjljecljmrnbl0y3gpok9qduutjppqzpjmgskzehluhghiftfx51vy9roqlt/chz85xxyzh/engnjut/o0vlk6tp6bio/ubw9s1vy269rmshca0ryqzob1pqzqwuggu6bsai4cjhtbiobsb3xrjvmu <latexit sha1_base64="qgyxbtcfijyjlr1gvwqo <latexit sha1_base64="1qsafpd3upojzsivuy/6/h3+f9a=">aaacbxicbzc7tsmwfiydrqxcaowwwfrizaksharjbqtjkehfaqlicz3wqu1etooooiwsvaolawix8g5sva1omwfafsnsp/+ci5/zhwmjsjvot7w0vlk6tl7zqg5ube/s2nv7hrwnepm2jlkseyfshffb2ppqrnqjjiihjhtd8xvr794tqwgs7vqkit5hq0ejipe2vmafrfwhu+gpyueoydyo9agjlnxz3pibxxmazlrwedwsaqbuk7c/vegmu06exgwp1xedrpszkppirvkqlyqsidxgq9i3kbanys <latexit sha1_base64="l3/kentrr0jr3ojd4k4pxcsmygg=">aaacbhicbvdlssnafl2pr1pfuzfddbahbkoigi6lblxwsa9oqplmpu3qystmtiqsundjr7hxoyhbp8kdf+okzujbdwwczrmhufcecwdko863vvpb39jckm9xdnb39g/sw6ooilnjajvepja9acvkmabtztsnvurshawcdopjte53 Gaussian Processes f(x) GP(m(x),K(x, x )) Bayesian Neural Networks h W (x) Completely specified by: Need to specify: m(x) x ) h<latexit sha1_base64="qgyxbtcfijyjlr1gvwqo architecture non-linearity Intuitive, well understood parameterization p(w ) prior on weights Implied distribution on functions is poorly understood!13

<latexit sha1_base64="acn79w3nn9sinlcy7zzvbl+veis=">aaachxicbzdlssnafiynxmu9rv26gsxcbsmjfhrzdkmbqwav0iqymu7aoznjmjkijerf3pgqblwo4skn+dzo0ija+spax3/oyc75vyhrqszr01hyxfpews2tldc3nre2zz3dtgxjguklh Predictive Uncertainties? Single layer network, with prior: W N (0, I) (Same results across many initialization strategies) What is happening?!14

<latexit sha1_base64="sbii8cwmntobu8wl7y5voluoxv4=">aaab9hicbzdlssnafizp6q3ww9wlm8ei1e1jpkabosicywr2amkok+mkhtqzxjljtyq+hxsxirj1ydz5nk7bllt1h4gp/5zdofp7mwdk2/a3lvtzxvvfyg8wtrz3dvek+wdnfsws0aajectbplaum0ebmmlo27gkopq5bfnd62m9najssujc63fmvrd3bqsywdpy3k330q3kt6ceukr2t1iyk/zmabmcdeqqqd4tfnv6eulckjthwcnxswptpvhqrjidfdqjojemq9ynrkgbq6q8dhb0bj0yp4ecsjonnjq5vydshco <latexit sha1_base64="ck8pdc+ekzh4numsp+zg7r8leyk="> sha1_base64="n9coyac86bbptqcwzaqpk7pnaxc="> sha1_base64="46alfbowdxh9bhturssrpejo9qi="> sha1_base64="tcd8huirq8kglfyviyop6bpo/bi="> <latexit sha1_base64="hv75shivfqdfewffa+6e7+s+1lk=">aaacq3icbvbdaxnbfj2tby3rtleffrkashikzbcufeqoskb8qmca4o52mz29m0w7+8hmxzuw5l/1px/an/+alz4o4qvgzdeunvxadgfouzd754s5fbpt+5u19mb9y/nh41hz8zot7z3w02enoisuhwhpzkzgidmgrqodfchhlctgsshhgf68w/jdl6c0ynjpomvbt9g4fbhgdi0utd73g0s37k571fydnk/fue+lcclodooq7lmpyyrvmfjbnc8/zg/ss9opctflj6i7fu2l4lxx80798inw2z6wk9d7xfmsnlnijgh99akmfwmkycxt2nxshp2skrrcwrzpfrpyxi/ygfxdu5aa9stqutndm0pe40yzkykt1nsdjuu0niwhquwytvsqtxd/57kfxq/8uqr5gzdyelbcsiozxqrki6gao5wzwrgszlfkj0wxjib2pgnbwf3yfxj6eoay/vgoffx2guedvcc7pesc8pick/fkhawij1fko/ljflnx1g/rt/wnll2zlj3pyr1yf/8bgzyu6w==</latexit> <latexit sha1_base64="hwpgbp1ztlzgn+y1a19x6nusnsu=">aaab6nicbz <latexit sha1_base64="kafgq2qtdghwbatjmw5k1otkpxw=">aaacl3icbvbbs8mwge29znmr+uhlcagbygihoodlubcfzik7wfplmmvbxnkwjhwmsn/ki39llykk+oq/mn2g6lydgcm55ypfd/yiuaks681ywl5zxvvpbgq3t7z3ds29/zomy4fjfycsfa0fscjoqkqkkkyaksci+4zu/d5v6tefija0do7viciur52atilgskueed33hqejkycor6qleutuh3nrjnu6hd2uvh7hakj/uey34hc8m2cvrthgplgnjaemqhjmygmfooykujghkzu2fsk3qujrzmgw68ssraj3uic0nq0qj9jnxvco4bfwwradcv0cbcfq34kecskh3nfjdge566xiiq8zq/a5m9agihuj8osjdsygcmfahmxrqbbia00qfltvcnexcysvrjirs7bnt54ntvlr1vzunfe+nnarayfgcosbdc5agdyacqgcdj7bclydd+pfedu+ja9jdmmyzhyafzc+fwa3oadr</latexit> <latexit sha1_base64="f19ystsgzfbk19qglgh9199r8ue=">aaacexicbzdlssnafiynxmu9rv26gsxcilaserskuhqjriryczqxtkatdtrjhzmjtos+ghtfxy0lrdy6c+fbog2z0nyfbj7+cw5nzu/fjappmt/awuls8spqbi2/vrg5ta3v7nzflhbmajhiew96sbbgq1ktvdlsjdlbgcdiw+tfjuune8ifjcjboyyje6bosh2kkvswqxu+msjcmvtgebrferhpr2yn7q7hg9uddtylxuacjm6v6oofs2robofbyqaamlvd/cturzgjscgxq0k0ldowtoq4pjirud5obikr7qmoaskmuucek04ugsfd5bshh3h1qgkn7u+jfavcdanpdqzidsvsbwz+v2sl0j9zuhrgisqhni7yewzlbmfxwdblbes2viawp+qvehcrr1iqepmqbgv25hmoh5csxtcnhcpffkco7imdyaalniikuajvuamypijn8aretcftrxvxpqatc1o2swf+spv8aszvmrs=</latexit> Prior uncertainty u j w j w j N (0, f(x) =b + J j=1 2 w); b N (0, w j (x; u j ) 2 b ) + + E w [f(x)] = 0 E w [f(x)f(x )] = 2 b + J 2 we u [ (x; u j ) (x ; u j )]!15 Computing with Infinite Networks, C. Williams, NIPS 1997 Bayesian leaning for Neural Networks, R. Neal, LNS, 1996

<latexit sha1_base64="acn79w3nn9sinlcy7zzvbl+veis=">aaachxicbzdlssnafiynxmu9rv26gsxcbsmjfhrzdkmbqwav0iqymu7aoznjmjkijerf3pgqblwo4skn+dzo0ija+spax3/oyc75vyhrqszr01hyxfpews2tldc3nre2zz3dtgxjguklh Predictive Uncertainties? Single layer network, with prior: W N (0, I) (Same results across many initialization strategies) What is happening? Bigger network More parameters, same data More parameter uncertainty Higher predictive variance!16

<latexit sha1_base64="ck8pdc+ekzh4numsp+zg sha1_base64="n9coyac86bbptqcwzaqp sha1_base64="46alfbowdxh9bhturssr sha1_base64="tcd8huirq8kglfyviyop <latexit sha1_base64="hwpgbp1ztlzgn+y1a19x6nusnsu= <latexit sha1_base64="hv75shivfqdfewffa+6e7+s+1lk=">aaacq3icbvbdaxnbfj2tby3rtleffrkashikzbcufeqoskb8qmca4o52mz29m0w7+8hmxzuw5l/1px/an/+alz4o4qvgzdeunvxadgfouzd754s5fbpt+5u19mb9y/nh41hz8zot7z3w02enoisuhwhpzkzgidmgrqodfchhlctgsshhgf68w/jdl6c0ynjpomvbt9g4fbhgdi0utd73g0s37k571fydnk/fue+lcclodooq7lmpyyrvmfjbnc8/zg/ss9opctflj6i7fu2l4lxx80798inw2z6wk9d7xfmsnlnijgh99akmfwmkycxt2nxshp2skrrcwrzpfrpyxi/ygfxdu5aa9stqutndm0pe40yzkykt1nsdjuu0niwhquwytvsqtxd/57kfxq/8uqr5gzdyelbcsiozxqrki6gao5wzwrgszlfkj0wxjib2pgnbwf3yfxj6eoay/vgoffx2guedvcc7pesc8pick/fkhawij1fko/ljflnx1g/rt/wnll2zlj3pyr1yf/8bgzyu6w==</latexit> <latexit sha1_base64="kafgq2qtdghwbatjmw5k1otkpxw=">aaacl3icbvbbs8mwge29znmr+uhlcagbygihoodlubcfzik7wfplmmvbxnkwjhwmsn/ki39llykk+oq/mn2g6lydgcm55ypfd/yiuaks681ywl5zxvvpbgq3t7z3ds29/zomy4fjfycsfa0fscjoqkqkkkyaksci+4zu/d5v6tefija0do7viciur52atilgskueed33hqejkycor6qleutuh3nrjnu6hd2uvh7hakj/uey34hc8m2cvrthgplgnjaemqhjmygmfooykujghkzu2fsk3qujrzmgw68ssraj3uic0nq0qj9jnxvco4bfwwradcv0cbcfq34kecskh3nfjdge566xiiq8zq/a5m9agihuj8osjdsygcmfahmxrqbbia00qfltvcnexcysvrjirs7bnt54ntvlr1vzunfe+nnarayfgcosbdc5agdyacqgcdj7bclydd+pfedu+ja9jdmmyzhyafzc+fwa3oadr</latexit> <latexit sha1_base64="f19ystsgzfbk19qglgh9199r8ue=">aaacexicbzdlssnafiynxmu9rv26gsxcilaserskuhqjriryczqxtkatdtrjhzmjtos+ghtfxy0lrdy6c+fbog2z0nyfbj7+cw5nzu/fjappmt/awuls8spqbi2/vrg5ta3v7nzflhbmajhiew96sbbgq1ktvdlsjdlbgcdiw+tfjuune8ifjcjboyyje6bosh2kkvswqxu+msjcmvtgebrferhpr2yn7q7hg9uddtylxuacjm6v6oofs2robofbyqaamlvd/cturzgjscgxq0k0ldowtoq4pjirud5obikr7qmoaskmuucek04ugsfd5bshh3h1qgkn7u+jfavcdanpdqzidsvsbwz+v2sl0j9zuhrgisqhni7yewzlbmfxwdblbes2viawp+qvehcrr1iqepmqbgv25hmoh5csxtcnhcpffkco7imdyaal <latexit sha1_base64="7gh/5hid6omc/cbnzfwytkka9ji=">aaacaxicbzdlssnafiynxmu9rd0ibgal4kokrdcnuhqjriryczqxtkatdujmjmxmlblixldx40irt76fo9/gazuftv4w8pgfczhz/jbhvgnh+bywfpewv1zla Bounding Prior variance u j w j w j N (0, f(x) =b + J j=1 2 w); b N (0, w j (x; u j ) 2 b ) + + E w [f(x)f(x )] = 2 b + J 2 we u [ (x; u j ) (x ; u j )] Could scale by J 2 w = a J C. Williams, NIPS 1997, R. Neal, LNS, 1996 or Force J to be small, by turning units off.!17

<latexit sha1_base64="hp+6lruf2d3tzaldqaqqvekmxyw=">aaab2xicbzdnsgmxfixv1l86vq1rn8eiucozbnqpuhfzwbzco5rm5k4bmskmyr2hdh0bf25efc93vo3pz0jbdwq+zknivscullqubn9ebwd3b/+gfugfnfzjk9nmo2fz0gjsilzl5jnmfpxu2cvjcp8lgzylffbj6f0i77+gstlxtzqrmmr4wmtuck7o6oyaraadlmw2ivxdc9yanb+gss7kddujxa0dhefbucunsafw7g9liwuxuz7ggupnm7rrtrxzzi6dk7a0n+5oykv394ukz9bostjdzdhn7ga2mp/lbiwlt1eldvesarh6kc0vo5wtdmajnchizrxwyasblykjn1yqa8z3hysbg29d77odbu3wmya6 sha1_base64="4fj6vtyjzduadso/magslypzyri=">aaacihicbvbltwixgpwwx4io6nvli9fangsxix6nxdwztosrsmumwwo0db9puxiy4f948a940iop+gmsyx4unatjzgaadsalojpknd+m3nr6xuzwfruwu9zd2y8dfnsyjawhlrlyuhq9lclnaw0ppjjtroji3+o0400ac78zpukymlhxs4g6ph4fbmgivlpys9cp7gsd2pl5ypaxghpmk9t5xtxhtskxo+nx0brfrylbliyflnzoj2dpzkq6pbjzm1ogv8tksbkynn3siz0iseztqbgopexzzqscbavfckfzgh1lgmeywspa0ztappvoknadoxotdnawfpoecqxqzxsj9qwc+z5olvrivw8h/uf1yjw8dbiw sha1_base64="jpobdzubmmnoo0tukhh9b8e6d6q=">aaack3icbvdltsjafj3ic+sldelmitfankrlo0scg1cge3kktdttyqotpo/mtdgk4x/c+csudoejbv0ph9kfgiez5oscc3pnhjdivejd+nbya+sbm1v5bx1nd2//ohb41bzhzdfp4zcfvosiqrgnsetsyug34gt5limdd9yy+50j4ykgwb2crst20tcghsvikskp1b+cmty3bpwh5sm5woglt7oscqktiwjn3k/csb9ahpall4qs3ognf2nmldufolexusbvymakcdi0ncklnqhx7jnayoae6jlgjo0ecukxizpdigwjeb6jiekpgicfcdtjb53bm6umobdy9qiju/x3rij8iaa+q5lze8synxf/83qx9k7t sha1_base64="yktkhfexyyy2euay7vunzhbcor4=">aaack3icbvdlsgmxfm3uvx1fvzdugkvpucpmexrz2o0rqwaf0gmhtjq2yzkzicluytd/ceovuncfd9z6h6btllr6iha451xu7veirqwyrhcjs7k6tr6r3ts3tnd293l7b00zxgktbg5zknoekotrgdquvyy0i0eq9xhpex5t5rfgregabndqepeur8oadihgskturnrv+vdukzrdhym1wogln9ocdq4dhwlx75xhufcuqscxf0iarvwss3nmlrq5vfwy5ob/iz2spehrd3pptj/emsebwgxj2bgtshutjbtfjexnj5ykqthhq9lrneccyg4yv3ukt7tsh4nq6bcoofd/tisisznhnk7o7phl3kz8z+veandv Horseshoe Priors for Model Selection The horseshoe prior is a scale mixture of normals: w k N (0, k C + (0, 1) 2 k v 2 )!18

<latexit sha1_base64="rk6dohrqchouf/wot7lv4lyewle=">aaacixicfvfdsxtbfj1dw42xamoffbk0fbiazff8oikioebdhyiyi+wuyxyysqznz9ezu8gw7h/pb+pb/42td4pr6ywbwznnfsy9csafqc/767hrhz6ub1q2q1uftnd2a5/37kyaa8a7ljwpvo+p4vio3kwbkt9nmtmklrwxp7rnem/ctrgpusvpxqoejpqyckbruv3a7zchogzufr/krpinrrmu4ge/egz8e3plc0kkeqsm5zc3legi/aklqadmgtmweymv91ozcd/hhqferdamros6+g+rqf+rewfepoat8jegtpbr6df+hiou5qlxycq1jvc9dkocahrm8ria5oznld3qeq8svdthjirmmyzhm2ugmey1fqphzr7mkghizdsjrxm2s3mtzcj3tcdh4vlucjxlybvbnbrmejcf2vlgidrnkkcwukafnrxymgrk0b6vapfgv/7yw3b3eobbfhnuv7xarqnc9slx0ia+oswx5jp0sjcwz91pocfoibvl+u6z+2nhdz1lzheyem77gwzywws=</latexit> v l<latexit sha1_b l<latexit sha1_b <latexit sha1_base64="fyjv0emn+tyalcv/fh0u60y14qe=">aaab/xicbvdlssnafl2pr1pf8bfzm1ieilisexrz7mzlbfuanobjdniozirhzlkopfgrblwo4tb/coffog2z0nydfw7n3mu99wqjz0o7zrevw1pewv3lrxc2nre2d+zdvyaku0loncq8lq0ak8pzrouaau5biarybjw2g4fqxg8oqfqsju70mkgewl2ihyxgbstfphj4hhuue6h6f1pyzldg9058u+iunsnqinezuoqmnd/+6nrjkgoaackxum3xsbq3wlizwum40ekvttb5wd3anjtcgipvnl1+ji6n0kvhle1fgk3v3xmj <latexit sha1_base64="j4xkndenb36+9rwby7b1axazf6g=">aaaca3icbvdlssnafj3uv62vqdvddbahopsjclosduoygn1ae8nkommhzirhzikuuhdjr7hxoyhbf8kdf+o0zujbd1w4nhmv994tjjwpjdc3vvhaxlldk66xnja3tnfs3b2wilnjajpepjadacvkwusbmmloo4mkwasctonhfek3h6huli7u9cihnsd9iiwmyg0k3z5wnu79bmjh0fvmwpr9aqwdwcbhj75drlu0bvwktk7kiefdt7/cxkxsqsnnofaq66beexmwmhfoxyu3vttbzij7tgtohavvxjb9yqypjdkdysxnrrpo1d8tgrzkjurgogxwazxvtct <latexit sha1_base64="7peml37drfipcb5u45s4wmqadx8=">aaacghicbvdlssnafj3uv62vqes3g0wsugpsbf0w3biscvybtrom00k7zpjgzlipiz/hxl9x40irt935n07alrr6yobwzrnmvcengrxsml60wsrq2vpgcbo0tb2zu6fvh7rflhbmwjhiee+6sbbgq9ksvdlsjtlbgctix/vvcr8zjlzqkhyqk5jyarqg1kmyssu5+vmjk/r+azvl0bi0gfaa5agjlt5lfamklyksfwbzvz52wl9+5uhlo2bmap8sc0hkyigmo0+tqystgiqsmyrezzriaaeis4ozyupwikimsi+gpkdoiaii7hr2wazpldkaxstvcywcqt8nuhqimqlclcz3fstelv7n9rlpxdkpdenekhdpp/isbmue85bgghkcjzsogjcnaleir4gjlfwxjvwcuxzyx9ku10yjzt5flbvxizqk4agcgwowwsvogfvqbc2 Group Horseshoe Priors for BNNs Horseshoe BNN: For each layer, draw a global scale: For node k in layer : Draw a local scale for the node: For each incident weight: l C + (0,b g ) kl C + (0,b 0 ) w kk,l N (0, 2 klvl 2 ) Inference: Stochastic gradient variational Bayes / BBVI + reparameterized gradients L( )=E q(w,,v; ) [ln p(y W,x)) + ln p(w,,v)] + H[q(W,,v; )] NN; intractable expectation!19 Model Selection in Bayesian Neural Networks via Horseshoe Priors; Ghosh & Doshi-Velez, 2017 Bayesian deep compression; Louizos et. al., 2017

n<latexit sha1_base64="yqeyt1atxrua A<latexit sha1_base64="glsiyp92psmcw8dknnhz8p <latexit sha1_base64="pvpkxizazmf6sngh5socm2kubey=">aaab7hicbzbns8n <latexit sha1_base64="icbz4nuoqv8buptsqpxta09k6e0=">aaab83icbvbnswmxem3wr1q/qh69bbfbu9kvqcfl0yvhcvydukvjptk2njssyuqopx/diwdfvppnvplvtns9aoudgcd7m8zm <latexit sha1_base64="pvpkxizazmf6sngh5socm2kubey=">aaab7hicbzbns8n AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOepMxszPLzKwQQv7Biw <latexit sha1_base64="icbz4nuoqv8buptsqpxta09k6e0=">aaab83icbvbnswmxem3wr1q/qh69bbfbu9kvqcfl0yvhcvydukvjptk2njssyuqopx/diwdfvppnvplvtns9aoudgcd7m8zm <latexit sha1_base64="8udf8dveqnsaqmi3yn <latexit sha1_base64="gi+pvon5gxzegi6kvnllk4n/7fk=">aaab6 <latexit sha1_base64="ohdjd2q9s0psvgycn7cejqrt1rw=">aaab/nicbzbps8mwgmbt+w/of1xx5cu4ho0ywhh0opticyjbb1stazzuyulak1qypebx8ejbea9+dm9+g7otb918ipdjed+x980tjowq7tjfvmlldw19o7xz2dre2d2z9w86kk4ljm0cs1h2q6qio4k0ndwmdbnjea8z8clx9btuprkpaczu9cqhpkddqsokktzwyb95qcby+6ym6jnsk8rhq82rb3bvatgzwwvwc6icqq3a/uopypxyijrmskme6yt <latexit sha1_base64="egfudgcg6vg2qsfphj26rsjdih8=">aaab/hicbzdlssnafizpvnz6i3bpzrai7aykiuiy1i3lcvycbqyt6bqdoppemykqqn0vny4uceudupntnlzzaospax//oydz5g9izpr2ng9rbx1jc2u7sfpc3ds/olspjtsqsishlrlxshydrchniw1ppjntxpjiexdacsbxs3rnkurfovbopzh1bb6fbmgi1sby7vld5/dzrvwnqk+yqa+vrtw3y07nmqutgptdgxi1ffurp4hiimioccdk9vwn1 Local Reparameterization Continuous weights and variances > reparameterization trick # outputs # inputs W l weights W (s) l q(w ) naive application Local reparameterization provably lower variance # outputs # inputs inputs W l =<latexit sha1_base64="8udf8dveqnsaqmi3yn B l pre-activations B (s) l q(b) For certain q(w), form of implied q(b = AW) is known!20 Variational dropout and the local reparameterization trick, Kingma 2015

<latexit sha1_base64="ui5qede9kvqgsy5cuxgrev1twai=">aaacnhicbvftaxnben47rdzyndvpishiebii5a4ilrsxad9exflbjivcpob2nsma3bvr7l4khper+k/6rf/gsxixfxfg2weeexl2z6jmcmm979pxhzzcevr4+0nt6c6z5y/quy97js01412wylsfr2c4fanvwmelp880bxvj3o+mxxbx/oxri9lkp51nfkhgniirygcrcuuxf81agz0wkew/bnpaqt6msymazbproh/x1mkcfql9uy3lf6mnzfpphhqxdvs+9rgjewmfv/bxtktqmkwmeiixwqtkfaqldvm21pqypbfnwshvpkfxtsj6w9vzlkbva78cdvlzwvi/cuku5yonlkkwzub7mr0wok1gkpe1idc8azafmr8gtebxmyywwy3pe2rioko1nstsjxuzogblzfxfmlmyjrkbw5d/iw1yozocfiljcsstthia5zlalc42rwohobnyjgcyfvhwyiaggvnczw2h4n/98n3q29/zef/40dj+xi1jm7wh70it+osahjmoosndwpzxzien43x137on7jf3+yrvdaqav+swub2/azbmrq==</latexit> Variational Family Fully factorized variational approximation q(w,,v; )= i,j,l N (w i,j,l µ i,j,l, 2 i,j,l) k,l q( k,l k,l ) l q(v l v l ) Louizos et. al., 2017 Ghosh & Doshi-Velez 2017 But Horseshoe shrinkage stems from coupling between weights and scales Retaining this structure is important for strong shrinkage!!21

v l<latexit sha1_b l<latexit sha1_b <latexit sha1_base64="fyjv0emn+tyalcv/fh0u60y14qe=">aaab/xicbvdlssnafl2pr1pf8bfzm1ieilisexrz7mzlbfuanobjdniozirhzlkopfgrblwo4tb/coffog2z0nydfw7n3mu99wqjz0o7zrevw1pewv3lrxc2nre2d+zdvyaku0loncq8lq0ak8pzrouaau5biarybjw2g4fqxg8oqfqsju70mkgewl2ihyxgbstfphj4hhuue6h6f1pyzldg9058u+iunsnqinezuoqmnd/+6nrjkgoaackxum3xsbq3wlizwum40ekvttb5wd3anjtcgipvnl1+ji6n0kvhle1fgk3v3xmj <latexit sha1_base64="j4xkndenb36+9rwby7b1axazf6g=">aaaca3icbvdlssnafj3uv62vqdvddbahopsjclosduoygn1ae8nkommhzirhzikuuhdjr7hxoyhbf8kdf+o0zujbd1w4nhmv994tjjwpjdc3vvhaxlldk66xnja3tnfs3b2wilnjajpepjadacvkwusbmmloo4mkwasctonhfek3h6huli7u9cihnsd9iiwmyg0k3z5wnu79bmjh0fvmwpr9aqwdwcbhj75drlu0bvwktk7kiefdt7/cxkxsqsnnofaq66beexmwmhfoxyu3vttbzij7tgtohavvxjb9yqypjdkdysxnrrpo1d8tgrzkjurgogxwazxvtct <latexit sha1_base64="7peml37drfipcb5u45s4wmqadx8=">aaacghicbvdlssnafj3uv62vqes3g0wsugpsbf0w3biscvybtrom00k7zpjgzlipiz/hxl9x40irt935n07alrr6yobwzrnmvcengrxsml60wsrq2vpgcbo0tb2zu6fvh7rflhbmwjhiee+6sbbgq9ksvdlsjtlbgctix/vvcr8zjlzqkhyqk5jyarqg1kmyssu5+vmjk/r+azvl0bi0gfaa5agjlt5lfamklyksfwbzvz52wl9+5uhlo2bmap8sc0hkyigmo0+tqystgiqsmyrezzriaaeis4ozyupwikimsi+gpkdoiaii7hr2wazpldkaxstvcywcqt8nuhqimqlclcz3fstelv7n9rlpxdkpdenekhdpp/isbmue85bgghkcjzsogjcnaleir4gjlfwxjvwcuxzyx9ku10yjzt5flbvxizqk4agcgwowwsvogfvqbc2 Group Horseshoe Priors for BNNs Horseshoe BNN: Regularized Horseshoe BNN For each layer, draw a global scale: For node k in layer : Draw a local scale for the node: For each incident weight: l C + (0,b g ) kl C + (0,b 0 ) w kk,l N (0, 2 klvl 2 ) Inference: structured Stochastic gradient variational Bayes with naive fully factorized variational approximations.!22

N <latexit sha1_base64="6zz3dsd1xktff4ybawwv2r75gus=">aaacshicbvbnswmxfmzwr1q/qh69biuoumpuefry9ojjktgqdovynpu2odndmgqrzdmf58wjn3+dfw+kedp9alu6ebhm3vdyxheckw3bz1zubn5hcsm/xfhzxvvfkg5unvwcseibjoaxvpfbuc4i2tbmc3ojjixq5/ta75+n/osblyrf0zuectoooruxdiogjeqvpxfw76x9/n6zz27iauxqsizaszieelym3bseaowpsstklhsm3rb0jwbpl7kzqf3+tt9wtfq2eugvs3bfhgp/jc6ulnauda/45ayxsuiaacjbqzzjc91oqwpgom0kbqkoankhlm0zgkfivtsdf5hhpameubnl8yknx+rpraqhuspqn5ojk9ssnxl/81qj7py0uxajrnoitbz1eo5nh6nwccakjzopdqeimfkrjj2qqltpvmbkcgzp/kua1ypjv5zlo1ltdfphhu2gxxsahhsmaugc1vedefsaxtabercervfrw/qcjoasawyb/uiu9wwxhreh</latexit> <latexit sha1_base64="5remflreduk1epl3x73huvswhti=">aaacq3icbvdlssnafj3uv62vqks3g0wsuepsbau3rteupij9ynogywtadpk8mjm0ljb/c+mpupmh3lhqxk3gjo1cqxcgdufbnxvskfehdf1zyy0tr6yu5dclg5tb2zvf3b2wcckosrmhloadgwncqe+akkpgoienylmzadvuvaq3x4qlgvh3chqsnoegph1qjksirol9xipd97jcemh61ig4ak2jikwypalhfoomoj7skbxhxokbpdzjxmytp27khbknosqt+jwv6tdolqxisa/q2cc/wjidephpwyo+mu6ai4/4ejmkrnfqq9mlezcum5iuzeiqegexdulxqr95rptirimehinggyoaq+dlmle/ezhyhjh6tnkmt4hflsx/07qrhjz3yuqhksq+ni0arazkakafqodygiwbkoawp+qvei8qr1iq2guqbgpx5l+gvasaetw4ps3vl+d15meboarlyiazuafxoagaaimh8alewlv2ql1qh9rnzjrt5pl98gu0r2/hardx</latexit> c<latexit sha1_base64="z <latexit sha1_base64="tzgl7rqic17zl1xjxdr55qoqahu=">aaachhicbvdlsgmxfm3ud31vxbojflfbykwvdcm6csuvbct0aslkbm2yzipkjlqg+ra3/oobf4q4csh4n6apha8dgcm59yynx0uk0gjbn1zhynjqemz2rji/sli0xfpzbeg4vrzqpjaxuvsybikiqknaczejahz6eppecdlwmzegtiijc+wn0a7zdss6gjm0uqe05ylc4fcetigfz9qngfy4k9lzvn3byyjga1fmbih8au9sflxdytulsl2xh6 <latexit sha1_base64="7ccojkfibcfvnplovvprmv81diw=">aaacpnicbvbns8mwge79npor6tflcaicmnoh6euyeve4ww6dtstpmm5h6qdjohilv8ylv8gbry8efphq0xqrmjdfchl4pkjex0syfdiwxrsv1bx1jc3kvnv7z3dvxz84bis45zhyogyx73pieeyjykkqgekmnkdqy6tjjw4lvtmmxna4epcthdghgkq0obhjrbm6zqcc4czmm1ts5hn1otr3sxhl+1kjh7us38jhnfy14ylo4tmcc6j0iedqnanutacua7menvboy9wfbt/gaugiirksomcaixqyxcxfjorvoxukqxiebqsny Regularized Horseshoe p(w kk,l kl,v l, ) N (w kk,l 0, 2 klv 2 l ) (w kk,l 0,c 2 ) Equivalently, w kk,l c, kl,v l N (w kl 0, 2klv 2 l ); 1 2kl v2 l = 1 c 2 + 1 2 kl v2 l!23 Piironen & Vehtari, 2017

<latexit sha1_base64="7ccojkfibcfvnplovvprmv81diw=">aaacpnicbvbns8mwge79npor6tflcaicmnoh6euyeve4ww6dtstpmm5h6qdjohilv8ylv8gbry8efphq0xqrmjdfchl4pkjex0syfdiwxrsv1bx1jc3kvnv7z3dvxz84bis45zhyogyx73pieeyjykkqgekmnkdqy6tjjw4lvtmmxna4epcthdghgkq0obhjrbm6zqcc4czmm1ts5hn1otr3sxhl+1kjh7us38jhnfy14ylo4tmcc6j0iedqnanutacua7menvboy9wfbt/gaugiirksomcaixqyxcxfjorvoxukqxiebqsny Regularized Horseshoe BNNs 1 2kl v2 l = 1 c 2 + 1 2 kl v2 l Regularized Horseshoe Horseshoe 500 units 50 units Random functions from single hidden layer (tanh) network with HS and reg-hs priors!24

<latexit sha1_base64="oath+xcwnh8fkozrvldjupofwwy=">aaab9hicbvdlsgnbejynrxhfuy9ebomql2fxbd0gvximyb6qlgf20k <latexit sha1_base64="7ccojkfibcfvnplovvprmv81diw=">aaacpnicbvbns8mwge79npor6tflcaicmnoh6euyeve4ww6dtstpmm5h6qdjohilv8ylv8gbry8efphq0xqrmjdfchl4pkjex0syfdiwxrsv1bx1jc3kvnv7z3dvxz84bis45zhyogyx73pieeyjykkqgekmnkdqy6tjjw4lvtmmxna4epcthdghgkq0obhjrbm6zqcc4czmm1ts5hn1otr3sxhl+1kjh7us38jhnfy14ylo4tmcc6j0iedqnanutacua7menvboy9wfbt/gaugiirksomcaixqyxcxfjorvoxukqxiebqsny <latexit sha1_base64="oath+xcwnh8fkozrvldjupofwwy=">aaab9hicbvdlsgnbejynrxhfuy9ebomql2fxbd0gvximyb6qlgf20k Regularized Horseshoe BNNs 1 2kl v2 l = 1 c 2 + 1 2 kl v2 l UCI Regression Benchmarks (Hernández-Lobato and Adams 2015) log(n) log(n) reg-hs BNNs improves predictive performance over HS BNNs for smaller datasets. Relative improvement: (x - y)/ max( x, y )!25

<latexit sha1_base64="8fkcze5wmz5lvmp2rorbssyrnyc=">aaa < ln <latexit sha1_base64="bur6 <latexit sha1_base64="bur6kezthbpcow5mgpanigovlw0=">a <latexit sha1_base64="geracnghs530qj2gkh2dnoybw6w=">aaab/hicbvbns8naen3ur1q/oj16wsycp5kio <latexit sha1_base64="icbz4nuoqv8buptsqpxta09k6e0=">aaab83icbvbnswmxem3wr1q/qh69bbfbu9kvqcfl0yvhcvydukvjptk2njssyuqopx/diwdfvppnvplvtns9aoudgcd7m8zm <latexit sha1_base64="geracnghs530qj2gkh2dnoybw6w=">aaa ynrxhfuy9ebopgkeykomegf48rzaosjcxoepmxszplzkwqqv7biwdfvpo/3vwbj8kenlggoajqprsrsgu31ve/vcla+sb w <latexit sha1_base64="ldu0sxvkxoqnqvxiwf2q+bbrx/q=">aaachhicbzdlssnafiyn9vbrlerszwarkkhjvnbl0y1upik9qbpkzdpph84kywyiljahceoruhghibsxgm/jji2grt8mfpznhoac34sylcqyvozswuls8kp5tbk2vrg5zw7vtguyc0xaogsh6hpiekyd0ljumdknbehcy6tjjs+zeueecend4e5niujynayotzfs2uqbj45hfoony5zcr <latexit sha1_base6 <latexit sha1_base64="wt6poi+w0ytono+rkszbuoc8pxe=">aaacchicbzdlssnafiynxmu9rv26clairkoigm6eohuxfewf2ham00k7ddijmyeverp046u4cagiwx/bnw/jnm1cw38y+oy/5zbz/iarxipjfftlyyura+uljflm1vborr2339rxqihr0fjeqh0qzqsxraecbgsnipeoekwvdg+m9daikc1jeq/jhhkr6useckrawl599obnqzhbv7gljm155ituwidkf9+uofunf14et4akklt37a9ul6 <latexit sha1_base64="9jubdhltbijtjtj38v9niablqme=">aaacdxicbvbns0jbfj1nx2zfvss2qxyomlwxqw0csu0bwya/qb8ybxx1cn5hm/ef8vaptomvtglrrnv27fo3jfowpr24cdjnxu69xwkev2ca30ziaxllds25ntry3nrese/u1zqfssqq1be+bdhemce9vguogjucyyjrcfz3blctv/7apok+dwejgnku6xm8yykblbxtr/fzug5f4hawiurlapipt2586rixzpbzujrhtvw7ntel5hr4kvgxyaaylxb6q9xxaegyd6ggsjutmwa7ihi4fwycaowkbyqosi81nfwiy5qdtb8z42otdhdxl7o8wfp190rexkvgrqm7xqj9ne9nxp+8zgjdczvixhac8+hsutcughw8iqz3ugquxegtqixxt2laj5jq0agmdajw/mulphzasmycdxuwkzbiojloab2illlqosqia1rbvutri3pgr+jnedjejhfjy9aamokzffqhxucpai+zmq==</latexit> <latexit sha1_base64="odopvjmaych4eiv5pocox3fdtrq=">aaab9xicbvbns8naej3ur1q/oh69lbzreeoigl6eohepfuxbagpzbdfn0s0m7g6uevo/vhhqxkv/xzv/xm2bg7y+ghi8n8pmvcdltgnh+bzks8srq2vl9crg5tb2jr2711rjjgn1smit2q6wopwj6mmmow2nkui44lqvdg8mfuurssusca9hkfvjpbaszarriz146apf0te6rd2gyj276tsckdai <latexit sha1_base64="telqwaczbhajrnykxrqht697vtq=">aaacohicbzdlsitbeiz71omlx0vupzvgcedhggze0i0gunhjduwumihudcraphtm6k4rw5dhcunjubm3lhrx6xpyivlo4g8np19vuv1/lcppyfcfvbhxit+tu9mzhb+zc/mlxcwlqk0yi7aiepwyywgskhljhsqpvewngo4uxkttg1794ganlul8tp0u6xquytmsasihrvfkl4eet5qfarl5u3gcga2qu3buymmicuitmyfb1v3pk6oooozwg8wsx/b74qmmgjgsg+i0uxwim4ninmykffhbc/yu6jkykkjhtxbmflmqbbjcmrmxalt1vh94l/9zpmlbixevjt6n3ydy0nz2doq6ndc1ha714g+1wkatnxou4zqjjmxxolamocw8lyjvsoocvmczeea6v3jxdqyeuawllorg+orru90sb345onsq7e0p4phmk2yvrbgabbm9dshowyujdsee2at79e69z+/ne/9qhfmgm8vsh7yptxm6rcq=</latexit> Structured Variational Approximation Weights incident on a unit: Non-centered Parameterization: kl N (0, I) kl = kl v l kl Layer specific structured variational approximations: # outputs # inputs ln kl kl weights scales q(b) = Matrix-Normal(M,U,V ) U = hh + Low dimensional covariance maintains posterior structure between weights and scales. Local re-parameterization: q( kl kl )<latexit sha1_base6 = Matrix-Normal(M,U,V )!26

Synthetic Data: Better Fits 20 Training Points 2.5 0.0 2.5 Factorized; 1000 Node 1 0 1 Structured; 1000 Node 1 0 1 1000 Unit reg HS, structured VI Both using regularized Horseshoe priors 100 Training Points 2.5 0.0 2.5 1 0 1 1 0 1 200 Training Points 2.5 0.0 2.5 1 0 1 1 0 1!27

Structured vs Factorized!28

Structured vs Factorized Five hundred training points!29

<latexit sha1_base64="o9zzjp9syj/idv2rfhvglfave9a=">aaacb3icbvdlssnafj3uv62vqetbbotqnyurqrcirtcuk9ghncfmjtn26othze2hho7c+ctuxcji1l9w5984bbpq1gmxdufcy733+ingcizr2ygsla+srhxxsxubw9s75u5eu8wppkxbyxhltk8uezxideagwdurjis+yc1/cdpxw0mmfy+jexglza1jl+jdtgloytmphyookntlbmi89as+xe7abjatfiutz/lmslw1psclxm5jgewoe+axe8q0dvkevbcloravgjsrczwkni45qwijoqpsyx1nixiy5wbtp8b4wcsb7szsvwr4qv6eyeio1cj0dwdiok/mvyn4n9djoxvhzjxkumarns3qpgjdjceh4iblrkgmncfucn0rpn0icqudxumhym+/veiap1xbqtp3z+xadr5her2gi1rbnjphnxsl6qibkhpez+gvvrlpxovxb UCI Regression Tasks Structured variational approximation -> stronger shrinkage, similar predictive performance Predictive Performance: Comparisons with Variational matrix Gaussian (Louizos & Welling, ICML 2016)!30 q( kl v l < ) >p 0 Pruning rule uses the variational posterior

Summary (Regularized) Horseshoe Priors for BNNs can assist with model-selection Recover small networks with similar performance to larger networks. Careful modeling of posterior structure between weights and scales is essential for reliable shrinkage.!31

We are hiring! http://mitibmwatsonailab.mit.edu/careers/ http://www.research.ibm.com/labs/cambridge/ 75 Binney Street, Cambridge, MA 02142!32