reinforcement learning through the optimization lens Benjamin Recht University of California, Berkeley
trustable, scalable, predictable
Control Theory! Reinforcement Learning is the study of how to use past data to enhance the future manipulation of a dynamical system
Disciplinary Biases AE/CE/EE/ME CS Control Theory Reinforcement Learning RL Control continuous discrete model action data action IEEE Transactions Science Magazine
Disciplinary Biases AE/CE/EE/ME CS Control Theory Reinforcement Learning Today s talk will try to unify these camps and point out how to merge their perspectives. RL Control continuous discrete model action data action IEEE Transactions Science Magazine
Main research challenge: What are the fundamental limits of learning systems that interact with the physical environment? How well must we understand a system in order to control it? statistical learning theory theoretical foundations robust control theory core optimization
Control theory is the study of dynamical systems with inputs y G xt u x t+1 = Ax t + Bu t y t = Cx t + Du t Simplest case of such systems are linear systems xt is called the state, and the dimension of the state is called the degree, d. ut is called the input, and the dimension is p. yt is called the output, and the dimension is q. For today, will only consider C=I, D=0 (xt observed)
Reinforcement Learning Control theory is the study of dynamical systems with inputs ^ discrete y G x t u p(x t+1 past) =p(x t+1 x t, u t ) p(y t past) =p(y t x t, u t ) Simplest example: Partially Observed Markov Decision Process (POMDP) xt is the state, and it takes values in [d] ut is called the input, and takes values in [p]. yt is called the output, and takes values in [q]. For today, will only consider when xt observed (MDP).
<latexit sha1_base64="otgopnlc3lpbujxkzhalqk3gehs=">aaacoxicbvhbahsxejw3tzs9oe1jx0rnwqhx7izc8xiibaet5mg9oanyyzkrhdsiwmmrrsvm8u/0a/ra/kx/plrhgdrugobwzlw0z/jksudx/kcv3bp95+69vfv7dx4+evykffd03blvbq6fucze5ubqsy1dkqtwsriiza7wir961+gx39e6afq3wlsyljdvciifukcydm9m4dpij7zrs6q3vouh1/nzta+szw+extfupkpdrn2j+/eq+c5i1qdd1jhidlrpuddcl6hjkhbulmqvptvykklhcn/shvygrmckowa1lojserxwkr8mtmenxoania/yfytqkj1blhnilifmbltryp9pi0+t47swuvkewlwpmnjfyfdgi15ii4luigaqvoa/cjedc4kckxttvr0rfbub1hovptafbrgk5mqhka6pbkmbreopuin+fbtjz3i6oxs1tg3k7ns5lch9s3aufbitha6sbnu/c86p+knctz6/7py+xz9mjz1nl1ixjewno2uf2yanmwa/2e/2i/2ootgnabb9uu6nwuuaz2wjotffdanq+g==</latexit> <latexit sha1_base64="swbjujk950ux4q9mabdzrzvkgic=">aaacjxicbvfdsxtbfj2s9aoxndg+fpoynfgus9gvrr/aytocffbb0agql+xu5cyznj1dzu6whcx9nx1t/0//twdjbjn4yebwzv2ye0+ckwnj9/9vvkuxyyuray+r669eb9tqm1vxns2nwlzivwpuy7copmy2svj4mxmejfz4e99/k/wbn2istpuvjtime+hr2zmcyffr/e0wkmgvgpmpn/lxpoyi7/ewzyok6g2/6u+cl4jgchpsgufrziw866yit1ctugbtj/azcgswjixccfuut5ibuic+dhzukkani8kky77tmc7vpcy9txzcpq0oilf2lmqumwea2hmtjj/tojn1jsnc <latexit sha1_base64="+5ynezhvzc7ginha9qex+xhlpxg=">aaacihicbvfdsxtbfj1sbavph7e++jiychfk2bxb+icicu2dd4qncsmy3j3cjbdnz5ezu8ww+fv6qj+p/6azmyum6ywbm+fc75swmhyh4e9g8grt9zu36xvnd+8/fnxsbx26cxlpffzurnn7l4jdtqz7tkzxrraiwarxnr0/q/xbn2gd5eyhtwummxgbgpec9lts2i4t <latexit sha1_base64="dox/ktybitgjchwuzwtodyh8jia=">aaacghicbvfdsxtbfj1s1apt/aipvgwgiykkuyk09elaqr98udqqjeu4o7ljls7obmfufspi7+hr/vn+ Controller Design G u xt x t+1 = Ax t + Bu t x K u t = t ( t ) A dynamical system is connected in feedback with a controller that tries to get the closed loop to behave. Actions decided based on observed trajectories t =(u 1,...,u t 1, x 0,...,x t ) A mapping from trajectory to action is called a policy, t ( t ) Optimal control: find policy that minimizes some objective.
<latexit sha1_base64="vs+14vgxeycwqa4/abiirwhhyzg=">aaadgnicbvjnb9naelxnv0n5sohizuvelyooshescfspoia49fbe01bkgmu9gser7q6t3tfkspxpupjhucguxpg3rfmjsmjilmbfe/n2z8zpiyxfmpzlb1euxrt+y+tmz/vw7tt3uzv3tm1egg4jnsvcnkfmghqarihqwnlhgkluwll6cdjwz5/awjhre1wuecs21sitnkgdku5xmsju6iozwxz1jwxdosrn55uswijxgwqys6hioevt6k2dajwq4zjauiuv7kf1xxnymgb/nucgthcpgjgdyuxpa2ohogws5k79okrjpsn+qgfqvndolnehr9fcojiia5w6fpskfvfs7yxdcblkm4napoe1czzs+dgd5lxuojflzu04cgumnr0klse1wvoogl9guxi7vdmfnq6w86zji4dmsjyb92kks/tfioopaxcqdcpmmnada8d/cemss+dxjxrrimh+evfwsoi5azzdjsiar7lwcengulcspmogcxqrxlll6v0ax+mkmpda8hwca6jeorrmqauomnbnv9vbisx5wlqlr83k/rdotqh7r8vuob0cuf9e722i3uki9ffvjqdphle4jn4/7r28alez5t3whnp9l/keeqfeo+/yg3nc3/yj/4x/mvgsfau+bz8upyhf1tz3vil4+rs0rp43</late <latexit sha1_base64="otgopnlc3lpbujxkzhalqk3gehs=">aaacoxicbvhbahsxejw3tzs9oe1jx0rnwqhx7izc8xiibaet5mg9oanyyzkrhdsiwmmrrsvm8u/0a/ra/kx/plrhgdrugobwzlw0z/jksudx/kcv3bp95+69vfv7dx4+evykffd03blvbq6fucze5ubqsy1dkqtwsriiza7wir961+gx39e6afq3wlsyljdvciifukcydm9m4dpij7zrs6q3vouh1/nzta+szw+extfupkpdrn2j+/eq+c5i1qdd1jhidlrpuddcl6hjkhbulmqvptvykklhcn/shvygrmckowa1lojserxwkr8mtmenxoania/yfytqkj1blhnilifmbltryp9pi0+t47swuvkewlwpmnjfyfdgi15ii4luigaqvoa/cjedc4kckxttvr0rfbub1hovptafbrgk5mqhka6pbkmbreopuin+fbtjz3i6oxs1tg3k7ns5lch9s3aufbitha6sbnu/c86p+knctz6/7py+xz9mjz1nl1ixjewno2uf2yanmwa/2e/2i/2ootgnabb9uu6nwuuaz2wjotffdanq+g==</latexit> <latexit sha1_base64="dox/ktybitgjchwuzwtodyh8jia=">aaacghicbvfdsxtbfj1s1apt/aipvgwgiykkuyk09elaqr98udqqjeu4o7ljls7obmfufspi7+hr/vn+ Optimal control h i PT minimize E e t=1 C t(x t, u t ) x G x t u e s.t. x t+1 = f t (x t, u t, e t ) u t = t ( t ) Ct is the cost. If you maximize, it s called a reward. et is a noise process ft is the state-transition function t =(u 1,...,u t 1, x 0,...,x t ) is an observed trajectory t ( t ) is the policy. This is the optimization decision variable.
<latexit sha1_base64="q0lcccsuikrn3vfvuntoazacfvw=">aaacnhicbvhtahnbfj2svwv9su1poqwgmcuydougf4rssy1yoajjc8myzm7ejennz5azo5kw7cp4np61d9k3ctajybivdhpmnpsx9960kmjign42gjtbd+/d336w8/dr4ydpm7vp+ly7w6hhtdtmkmuwpfdqq4esrgodle8lxkbxh2v98gcyk7t6jrmc4pynlrgjztbtsfpvncnxdvtrd/s07rlsop9+e1wl+7dqdgwm0r4kzvbydedgn0g0bc2ytitktxepm81ddgq5znyoordaugqgbzdq7qydhylxazaggyek5wdjct5rrv96jqmjbfxrsofsvxely62d5an3zblo7lpwk//tbg5h7+nsqmihkl4onhksoqb1eggmdhcumw8yn8l/lfijm4yjh+jklxnuavhkj+xukcf1bmusxcka5kklmdoh6q7kt0jk+o0ps8/feij/vz+2ltsnyizqds79pttbhrnfslq+/k3qp+xgytf6+rz1dlxcztz5tl6qnonio3jezsgf6rfofpjf5de5cfadk+bz8gxhgjswmxtkxyl+h5o0zuo=</latexit> <latexit sha1_base64="+5ynezhvzc7ginha9qex+xhlpxg=">aaacihicbvfdsxtbfj1sbavph7e++jiychfk2bxb+icicu2dd4qncsmy3j3cjbdnz5ezu8ww+fv6qj+p/6azmyum6ywbm+fc75swmhyh4e9g8grt9zu36xvnd+8/fnxsbx26cxlpffzurnn7l4jdtqz7tkzxrraiwarxnr0/q/xbn2gd5eyhtwummxgbgpec9lts2i4t Optimal control G u xt x t+1 = F(u t, u t 1, u t 2,...) x K u t = t ( t ) t A dynamical system is connected in feedback with a controller that tries to get the closed loop to behave. Optimal control: find policy that minimizes some objective. Major challenge: how to perform optimal control when the system is unknown? Today: Reinvent RL attempting to answer this question
HVAC ROOM t ( u)+ ( u u + pi) = + g M T = Q +ṁ s c p (T s T ) sensor state action
Identify everything Identify a coarse model We don t need no stinking models! HVAC ROOM t ( u)+ ( u u + pi) = + g M T = Q +ṁ s c p (T s T ) sensor state action PDE control High performance aerodynamics model predictive control reinforcement learning PID control? We need robust fundamentals to distinguish these approaches
But PID control works 50 Bode Diagram 40 30 20 One decade Magnitude (db) 10 0-10 2 6dB Gain crossover point 0.5-6dB -20 Loglog slope = -1.5-30 -40-50 10-2 10-1 10 0 10 1 10 2 Frequency (rad/sec) 2 parameters suffice for 95% of all control applications. How much needs to be modeled for more advanced control? Can we learn to compensate for poor models, changing conditions?
<latexit sha1_base64="vs+14vgxeycwqa4/abiirwhhyzg=">aaadgnicbvjnb9naelxnv0n5sohizuvelyooshescfspoia49fbe01bkgmu9gser7q6t3tfkspxpupjhucguxpg3rfmjsmjilmbfe/n2z8zpiyxfmpzlb1euxrt+y+tmz/vw7tt3uzv3tm1egg4jnsvcnkfmghqarihqwnlhgkluwll6cdjwz5/awjhre1wuecs21sitnkgdku5xmsju6iozwxz1jwxdosrn55uswijxgwqys6hioevt6k2dajwq4zjauiuv7kf1xxnymgb/nucgthcpgjgdyuxpa2ohogws5k79okrjpsn+qgfqvndolnehr9fcojiia5w6fpskfvfs7yxdcblkm4napoe1czzs+dgd5lxuojflzu04cgumnr0klse1wvoogl9guxi7vdmfnq6w86zji4dmsjyb92kks/tfioopaxcqdcpmmnada8d/cemss+dxjxrrimh+evfwsoi5azzdjsiar7lwcengulcspmogcxqrxlll6v0ax+mkmpda8hwca6jeorrmqauomnbnv9vbisx5wlqlr83k/rdotqh7r8vuob0cuf9e722i3uki9ffvjqdphle4jn4/7r28alez5t3whnp9l/keeqfeo+/yg3nc3/yj/4x/mvgsfau+bz8upyhf1tz3vil4+rs0rp43</late <latexit sha1_base64="otgopnlc3lpbujxkzhalqk3gehs=">aaacoxicbvhbahsxejw3tzs9oe1jx0rnwqhx7izc8xiibaet5mg9oanyyzkrhdsiwmmrrsvm8u/0a/ra/kx/plrhgdrugobwzlw0z/jksudx/kcv3bp95+69vfv7dx4+evykffd03blvbq6fucze5ubqsy1dkqtwsriiza7wir961+gx39e6afq3wlsyljdvciifukcydm9m4dpij7zrs6q3vouh1/nzta+szw+extfupkpdrn2j+/eq+c5i1qdd1jhidlrpuddcl6hjkhbulmqvptvykklhcn/shvygrmckowa1lojserxwkr8mtmenxoania/yfytqkj1blhnilifmbltryp9pi0+t47swuvkewlwpmnjfyfdgi15ii4luigaqvoa/cjedc4kckxttvr0rfbub1hovptafbrgk5mqhka6pbkmbreopuin+fbtjz3i6oxs1tg3k7ns5lch9s3aufbitha6sbnu/c86p+knctz6/7py+xz9mjz1nl1ixjewno2uf2yanmwa/2e/2i/2ootgnabb9uu6nwuuaz2wjotffdanq+g==</latexit> <latexit sha1_base64="dox/ktybitgjchwuzwtodyh8jia=">aaacghicbvfdsxtbfj1s1apt/aipvgwgiykkuyk09elaqr98udqqjeu4o7ljls7obmfufspi7+hr/vn+ Optimal control h i PT minimize E e t=1 C t(x t, u t ) x G x t u e s.t. x t+1 = f t (x t, u t, e t ) u t = t ( t ) Ct is the cost. If you maximize, it s called a reward. et is a noise process ft is the state-transition function t =(u 1,...,u t 1, x 0,...,x t ) is an observed trajectory t ( t ) is the policy. This is the optimization decision variable.
<latexit sha1_base64="mr94ezqth17vzwjopx3thjsdtck=">aaacg3icbvfbaxnred5ztdz6aaqpvhwmqousdktbfsguffshdxwnlsrlmd2zjeppztlntiqu+sw+6o/y33g2jwasbwy+vm/uu5saaqfp71zy5+69nfu7d/yepnr8zl998prbcjvx2fnoo39vqebnfntm <latexit sha1_base64="ljvoamm0zxufsbpjh2gmnfj3dfs=">aaacl3icbvfdaxnbfj2svwv9avwtvgwgiquju0wwfvccfu1dh1psbcfzl7utm2tofcwzdyvhmx/gr/fvf4r/xtk0bzn4yebwzrn3zr03l5t0fmd/gtgdjbv37m8+2hr46pgtp9s7z756wzqbxwgvdzc5eftsyjckkbwshilofv7kvx9r/ei7oi+toadpgamgkzfdkyaclw03+77uwuxv4tm3c55k1xvrktfelly/72phpbjmfl <latexit sha1_base64="oi5ov9kcoehyn9bwcwwjat8txu4=">aaac4hicbvhlihnbfk1ux2p7yujstwfqrkyyxslmufagfhqxixgnm5buqnx1taeyquqm6ryknp0brsstn+xop3fpdrjlknih4hdouy+6n6uudbjhv4lwytvr12/s3ixu3b5z915v9/5nv9zwwfcuqrtngxegpiehslrwxlngolnwll286fszl2cdlm0nnfeqal4yozgco6fgptubn7jpwvqkjhku0jsz5mjlri0yfuiztzioxgiw+t+xy2rppo02k+ikyqfa85fdto7suttu9enbvai6ddgk9mkqtse7qzrkpag1gbskozdicyvpwy1koacnktpbxcufl2dkoeeaxnos1tlsx57j6as0/hmkc/zyrso1c3odeacfdoo2ty78nzaqcxkuntjunyiry0atwlesabdjmkslatxcay6s9lnsmewwc/sxwouyqf2bwptjm6unfguog6zcgvrusqeouttdr5p3uin6krtht2qxxb+ql9vje29lide9o/hnnk+3zp4gbhp922d4fpbywd686b+/xl <latexit sha1_base64="+ojsm2yurvuosb1y5f0dyh5w9ea=">aaaco3icbvhbbtnaen2ywym3fb55wrgbioscjzbahkcvqikhpbroakxyisbritpqem3tjqsek5/b1/akh8hfse4digkjrfbonllpwmlyhia/osgvq9eu39i5uxvr9p2 Newton s Laws z t+1 = z t + v t v t+1 = v t + a t ma t = u t minimize TX 1 t=0 subject to x t+1 = 1 (xt ) 1 > apple 1 1 0 1 x t + apple 0 1/m u t x t = apple zt v t
<latexit sha1_base64="mr94ezqth17vzwjopx3thjsdtck=">aaacg3icbvfbaxnred5ztdz6aaqpvhwmqousdktbfsguffshdxwnlsrlmd2zjeppztlntiqu+sw+6o/y33g2jwasbwy+vm/uu5saaqfp71zy5+69nfu7d/yepnr8zl998prbcjvx2fnoo39vqebnfntm <latexit sha1_base64="c3vasflhsyp2zenoal07qvhm1sk=">aaaci3icbvfdsxtbfj1sbwttq7e++ncxoaeqoyrdeaqlgvhbffdbulofzf3utm6swznzzeaujcz5nx1tf5d/xtmyqpp0wsdhnpsx9540v9jrgd7ugmcrz1+8xh219vrn2/wn+ua7ny4r <latexit sha1_base64="oi5ov9kcoehyn9bwcwwjat8txu4=">aaac4hicbvhlihnbfk1ux2p7yujstwfqrkyyxslmufagfhqxixgnm5buqnx1taeyquqm6ryknp0brsstn+xop3fpdrjlknih4hdouy+6n6uudbjhv4lwytvr12/s3ixu3b5z915v9/5nv9zwwfcuqrtngxegpiehslrwxlngolnwll286fszl2cdlm0nnfeqal4yozgco6fgptubn7jpwvqkjhku0jsz5mjlri0yfuiztzioxgiw+t+xy2rppo02k+ikyqfa85fdto7suttu9enbvai6ddgk9mkqtse7qzrkpag1gbskozdicyvpwy1koacnktpbxcufl2dkoeeaxnos1tlsx57j6as0/hmkc/zyrso1c3odeacfdoo2ty78nzaqcxkuntjunyiry0atwlesabdjmkslatxcay6s9lnsmewwc/sxwouyqf2bwptjm6unfguog6zcgvrusqeouttdr5p3uin6krtht2qxxb+ql9vje29lide9o/hnnk+3zp4gbhp922d4fpbywd686b+/xl <latexit sha1_base64="+ojsm2yurvuosb1y5f0dyh5w9ea=">aaaco3icbvhbbtnaen2ywym3fb55wrgbioscjzbahkcvqikhpbroakxyisbritpqem3tjqsek5/b1/akh8hfse4digkjrfbonllpwmlyhia/osgvq9eu39i5uxvr9p2 Newton s Laws z t+1 = z t + v t v t+1 = v t + a t ma t = u t minimize TX (x t ) 2 1 t=0 subject to x t+1 = +ru 2 t apple 1 1 0 1 x t + apple 0 1/m u t x t = apple zt v t
<latexit sha1_base64="j4lebcdojzuwdwucyrzfilabtyq=">aaadhhicbvlfb9mwehbcr1eydpdii0ufggxvcsdba5xgamhdhjzot0lnfjmu01qznci+obyr/wqv/co8iv6r+g+wuydrjpoiu3zf3dl3n/nkcanr9dsil12+cvxaxvxojzubt253t+4cmblwli1okup9khpdbfdsbbweo6k0izix7dg/e+p5489mg16qiswqlkoyvbzglicdsu63jgdtrizrmiwak0ttswrezq3kikv+htx4iu4kgvme23dnilgb46tqhnq4scmgj6awmyvb3jwo8tyd08f40hu8g+vl30fve82nm0itpo1u+td3neeudcdu8ac/bov2flrzlinowtskvvnw7ux9agn4yhc3qq+1dpbtbwkykwktmqiqidhjokogde2au8hcmlvhfafnzmrgllremppa5uyb/mahe1yu2n0k8bl9t8isacxc5i7t78ascx78hzeuoxizwq6qgpii5wcvtcbqyi8pnndnkiifcwjv3n0v0xlx6wyn4sopy94voyut2hmtoc0nba0vmadnhggysmkvn8q+50lgt0qzvo/v+cu6tp7efsunhmytffds1kmlyu6qeh39f4ojp/046sehz3u7e600g+geuo+2uyxeof30ar2gealbzvasebumwq/h9/bh+pm8nqzamrtoxcjffwdk/f3k</latexit> <latexit sha1_base64="mr94ezqth17vzwjopx3thjsdtck=">aaacg3icbvfbaxnred5ztdz6aaqpvhwmqousdktbfsguffshdxwnlsrlmd2zjeppztlntiqu+sw+6o/y33g2jwasbwy+vm/uu5saaqfp71zy5+69nfu7d/yepnr8zl998prbcjvx2fnoo39vqebnfntm <latexit sha1_base64="c3vasflhsyp2zenoal07qvhm1sk=">aaaci3icbvfdsxtbfj1sbwttq7e++ncxoaeqoyrdeaqlgvhbffdbulofzf3utm6swznzzeaujcz5nx1tf5d/xtmyqpp0wsdhnpsx9540v9jrgd7ugmcrz1+8xh219vrn2/wn+ua7ny4r <latexit sha1_base64="oi5ov9kcoehyn9bwcwwjat8txu4=">aaac4hicbvhlihnbfk1ux2p7yujstwfqrkyyxslmufagfhqxixgnm5buqnx1taeyquqm6ryknp0brsstn+xop3fpdrjlknih4hdouy+6n6uudbjhv4lwytvr12/s3ixu3b5z915v9/5nv9zwwfcuqrtngxegpiehslrwxlngolnwll286fszl2cdlm0nnfeqal4yozgco6fgptubn7jpwvqkjhku0jsz5mjlri0yfuiztzioxgiw+t+xy2rppo02k+ikyqfa85fdto7suttu9enbvai6ddgk9mkqtse7qzrkpag1gbskozdicyvpwy1koacnktpbxcufl2dkoeeaxnos1tlsx57j6as0/hmkc/zyrso1c3odeacfdoo2ty78nzaqcxkuntjunyiry0atwlesabdjmkslatxcay6s9lnsmewwc/sxwouyqf2bwptjm6unfguog6zcgvrusqeouttdr5p3uin6krtht2qxxb+ql9vje29lide9o/hnnk+3zp4gbhp922d4fpbywd686b+/xl <latexit sha1_base64="+ojsm2yurvuosb1y5f0dyh5w9ea=">aaaco3icbvhbbtnaen2ywym3fb55wrgbioscjzbahkcvqikhpbroakxyisbritpqem3tjqsek5/b1/akh8hfse4digkjrfbonllpwmlyhia/osgvq9eu39i5uxvr9p2 Simplest Example: LQR minimize s.t. E h 1 T P T t=1 x t Qx t + u t Ru t i x t+1 = Ax t + Bu t + e t minimize TX (x t ) 2 1 t=0 subject to x t+1 = x t = +ru 2 t apple 1 1 0 1 x t + apple zt v t apple 0 1/m u t
The Linearization Principle If a machine learning algorithm does crazy things when restricted to linear models, it s going to do crazy things on complex nonlinear models too. Would you believe someone had a good SAT solver if it couldn t solve 2-SAT? This has been a fruitful research direction: Recurrent neural networks (Hardt, Ma, R. 2016) Generalization and Margin in Neural Nets (Zhang et al 2017) Residual Networks (Hardt and Ma 2017) Bayesian Optimization (Jamieson et al 2017) Adaptive gradient methods (Wilson et al 2017)
<latexit sha1_base64="j4lebcdojzuwdwucyrzfilabtyq=">aaadhhicbvlfb9mwehbcr1eydpdii0ufggxvcsdba5xgamhdhjzot0lnfjmu01qznci+obyr/wqv/co8iv6r+g+wuydrjpoiu3zf3dl3n/nkcanr9dsil12+cvxaxvxojzubt253t+4cmblwli1okup9khpdbfdsbbweo6k0izix7dg/e+p5489mg16qiswqlkoyvbzglicdsu63jgdtrizrmiwak0ttswrezq3kikv+htx4iu4kgvme23dnilgb46tqhnq4scmgj6awmyvb3jwo8tyd08f40hu8g+vl30fve82nm0itpo1u+td3neeudcdu8ac/bov2flrzlinowtskvvnw7ux9agn4yhc3qq+1dpbtbwkykwktmqiqidhjokogde2au8hcmlvhfafnzmrgllremppa5uyb/mahe1yu2n0k8bl9t8isacxc5i7t78ascx78hzeuoxizwq6qgpii5wcvtcbqyi8pnndnkiifcwjv3n0v0xlx6wyn4sopy94voyut2hmtoc0nba0vmadnhggysmkvn8q+50lgt0qzvo/v+cu6tp7efsunhmytffds1kmlyu6qeh39f4ojp/046sehz3u7e600g+geuo+2uyxeof30ar2gealbzvasebumwq/h9/bh+pm8nqzamrtoxcjffwdk/f3k</latexit> <latexit sha1_base64="mr94ezqth17vzwjopx3thjsdtck=">aaacg3icbvfbaxnred5ztdz6aaqpvhwmqousdktbfsguffshdxwnlsrlmd2zjeppztlntiqu+sw+6o/y33g2jwasbwy+vm/uu5saaqfp71zy5+69nfu7d/yepnr8zl998prbcjvx2fnoo39vqebnfntm <latexit sha1_base64="c3vasflhsyp2zenoal07qvhm1sk=">aaaci3icbvfdsxtbfj1sbwttq7e++ncxoaeqoyrdeaqlgvhbffdbulofzf3utm6swznzzeaujcz5nx1tf5d/xtmyqpp0wsdhnpsx9540v9jrgd7ugmcrz1+8xh219vrn2/wn+ua7ny4r <latexit sha1_base64="oi5ov9kcoehyn9bwcwwjat8txu4=">aaac4hicbvhlihnbfk1ux2p7yujstwfqrkyyxslmufagfhqxixgnm5buqnx1taeyquqm6ryknp0brsstn+xop3fpdrjlknih4hdouy+6n6uudbjhv4lwytvr12/s3ixu3b5z915v9/5nv9zwwfcuqrtngxegpiehslrwxlngolnwll286fszl2cdlm0nnfeqal4yozgco6fgptubn7jpwvqkjhku0jsz5mjlri0yfuiztzioxgiw+t+xy2rppo02k+ikyqfa85fdto7suttu9enbvai6ddgk9mkqtse7qzrkpag1gbskozdicyvpwy1koacnktpbxcufl2dkoeeaxnos1tlsx57j6as0/hmkc/zyrso1c3odeacfdoo2ty78nzaqcxkuntjunyiry0atwlesabdjmkslatxcay6s9lnsmewwc/sxwouyqf2bwptjm6unfguog6zcgvrusqeouttdr5p3uin6krtht2qxxb+ql9vje29lide9o/hnnk+3zp4gbhp922d4fpbywd686b+/xl <latexit sha1_base64="+ojsm2yurvuosb1y5f0dyh5w9ea=">aaaco3icbvhbbtnaen2ywym3fb55wrgbioscjzbahkcvqikhpbroakxyisbritpqem3tjqsek5/b1/akh8hfse4digkjrfbonllpwmlyhia/osgvq9eu39i5uxvr9p2 Simplest Example: LQR minimize s.t. E h 1 T P T t=1 x t Qx t + u t Ru t i x t+1 = Ax t + Bu t + e t minimize TX (x t ) 2 1 t=0 subject to x t+1 = x t = +ru 2 t apple 1 1 0 1 x t + apple zt v t apple 0 1/m u t
<latexit sha1_base64="j4lebcdojzuwdwucyrzfilabtyq=">aaadhhicbvlfb9mwehbcr1eydpdii0ufggxvcsdba5xgamhdhjzot0lnfjmu01qznci+obyr/wqv/co8iv6r+g+wuydrjpoiu3zf3dl3n/nkcanr9dsil12+cvxaxvxojzubt253t+4cmblwli1okup9khpdbfdsbbweo6k0izix7dg/e+p5489mg16qiswqlkoyvbzglicdsu63jgdtrizrmiwak0ttswrezq3kikv+htx4iu4kgvme23dnilgb46tqhnq4scmgj6awmyvb3jwo8tyd08f40hu8g+vl30fve82nm0itpo1u+td3neeudcdu8ac/bov2flrzlinowtskvvnw7ux9agn4yhc3qq+1dpbtbwkykwktmqiqidhjokogde2au8hcmlvhfafnzmrgllremppa5uyb/mahe1yu2n0k8bl9t8isacxc5i7t78ascx78hzeuoxizwq6qgpii5wcvtcbqyi8pnndnkiifcwjv3n0v0xlx6wyn4sopy94voyut2hmtoc0nba0vmadnhggysmkvn8q+50lgt0qzvo/v+cu6tp7efsunhmytffds1kmlyu6qeh39f4ojp/046sehz3u7e600g+geuo+2uyxeof30ar2gealbzvasebumwq/h9/bh+pm8nqzamrtoxcjffwdk/f3k</latexit> Simplest Example: LQR minimize s.t. E h 1 T P T t=1 x t Qx t + u t Ru t i x t+1 = Ax t + Bu t + e t Suppose (A,B) unknown What is the optimal estimation/design scheme? How many samples are needed for near optimal control?
<latexit sha1_base64="vs+14vgxeycwqa4/abiirwhhyzg=">aaadgnicbvjnb9naelxnv0n5sohizuvelyooshescfspoia49fbe01bkgmu9gser7q6t3tfkspxpupjhucguxpg3rfmjsmjilmbfe/n2z8zpiyxfmpzlb1euxrt+y+tmz/vw7tt3uzv3tm1egg4jnsvcnkfmghqarihqwnlhgkluwll6cdjwz5/awjhre1wuecs21sitnkgdku5xmsju6iozwxz1jwxdosrn55uswijxgwqys6hioevt6k2dajwq4zjauiuv7kf1xxnymgb/nucgthcpgjgdyuxpa2ohogws5k79okrjpsn+qgfqvndolnehr9fcojiia5w6fpskfvfs7yxdcblkm4napoe1czzs+dgd5lxuojflzu04cgumnr0klse1wvoogl9guxi7vdmfnq6w86zji4dmsjyb92kks/tfioopaxcqdcpmmnada8d/cemss+dxjxrrimh+evfwsoi5azzdjsiar7lwcengulcspmogcxqrxlll6v0ax+mkmpda8hwca6jeorrmqauomnbnv9vbisx5wlqlr83k/rdotqh7r8vuob0cuf9e722i3uki9ffvjqdphle4jn4/7r28alez5t3whnp9l/keeqfeo+/yg3nc3/yj/4x/mvgsfau+bz8upyhf1tz3vil4+rs0rp43</late Optimal control h i PT minimize E e t=1 C t(x t, u t ) s.t. x t+1 = f t (x t, u t, e t ) x G x t u e u t = t ( t ) generic solutions with known dynamics: Batch Optimization Dynamic Programming
<latexit sha1_base64="vs+14vgxeycwqa4/abiirwhhyzg=">aaadgnicbvjnb9naelxnv0n5sohizuvelyooshescfspoia49fbe01bkgmu9gser7q6t3tfkspxpupjhucguxpg3rfmjsmjilmbfe/n2z8zpiyxfmpzlb1euxrt+y+tmz/vw7tt3uzv3tm1egg4jnsvcnkfmghqarihqwnlhgkluwll6cdjwz5/awjhre1wuecs21sitnkgdku5xmsju6iozwxz1jwxdosrn55uswijxgwqys6hioevt6k2dajwq4zjauiuv7kf1xxnymgb/nucgthcpgjgdyuxpa2ohogws5k79okrjpsn+qgfqvndolnehr9fcojiia5w6fpskfvfs7yxdcblkm4napoe1czzs+dgd5lxuojflzu04cgumnr0klse1wvoogl9guxi7vdmfnq6w86zji4dmsjyb92kks/tfioopaxcqdcpmmnada8d/cemss+dxjxrrimh+evfwsoi5azzdjsiar7lwcengulcspmogcxqrxlll6v0ax+mkmpda8hwca6jeorrmqauomnbnv9vbisx5wlqlr83k/rdotqh7r8vuob0cuf9e722i3uki9ffvjqdphle4jn4/7r28alez5t3whnp9l/keeqfeo+/yg3nc3/yj/4x/mvgsfau+bz8upyhf1tz3vil4+rs0rp43</late RL Methods x G x t u e h i PT minimize E e t=1 C t(x t, u t ) approximate dynamic programming s.t. x t+1 = f t (x t, u t, e t ) u t = t ( t ) model-based direct policy search How to solve optimal control when the model f is unknown? Model-based: fit model from data Model-free - Approximate dynamic programming: estimate cost from data - Direct policy search: search for actions from data
<latexit sha1_base64="wtso4cvxqkkm5sp9ikogdmu8yxs=">aaadghicbvjnb9qwehxcv7t8behixwjftrxvnkfileoliolg0emr3bbsoksod7jr1xeie4j2ifjhupjhucgu3pg3ontuynczydl4vednz4ytqkmlqfdh82/cvhx7zszm5+69+w8edrcendm8nakgile5uui4bsu1dfgigovcam8sbefj5vhdn38by2wut3feqjtxizapfbwdfhe/swqmulfcgd6vk6xqdsusffzlustmfowablowczwmsfwujoepshhebjnffr6e9edtehrjfxbjbhnjdjnymswisdbgdndqwmyc+nly0woaxmt3odgzzjz1g0ewqjojhrzx6tdq4/zvcbcxdijf0pukbjmeaemk3viins5fmyfgobi1ozaomhj2kiucv2jpoedikk9g5flnm7brtehmtz85zezt3lilks7qf09upln2nivo2ftfrnin+d9uvgk6h1vsfywcflcxpawimnnmnhqsdqhuc5dwyar7kxvtbrhan8clwxbebyilsqpzqaxix7cckpyh4q60gbmxuqmqei+vop+4tvs4gdg162wbuv9wtita3wp3s/tomtgnjfxt/3py9miqbopw48ve4zt2nbvkcxlk+iqkr8gh+uboyjaib9pb8/a91/43/4f/0/91jfw99sxjsht+77+2e/1q</late <latexit sha1_base64="qv2lanekunbubcf2z1ek6m/o+og=">aaacnxicbvhbahsxejw3tzs9oe1jhypqcg4jzrcuksfqc+2dksmt44b3wwblss2ilyq0g2ww/0k/pq/tf/rvqnvcqo0oca7nzeuzp7bkeorj363o1u07d+/t3d9/8pdr4yftg6cx3lro4eayzdxlar6v1dggsqovrumoc4xd4updow+v0xlp9ddawmxkmgo5kqiouhm7o89rokqwpavrnznz9bqcncna03gv0ye/4qkoig934l68cr4lkjxoshwc5wetlb0buzwossjwfptelriahemhclmfvh4ticuy4ihadsx6rf6ttosvajpme+pc08rx7l8vnztel8oizjzam7+tnet/tfffk9osltpwhfrcdjpuipphzx34wdoupbybghay/jwlgtgqfk64mwxv26ly2ksev1okm8ytvtgchatsi5ugdbnv/veqxb <latexit sha1_base64="z65z1fewznivdrnx0njfmyzk9iw=">aaaczhicbvfnb9naen2yrxk+0nlksiicpakn7aqjxooqgqshqiqcnjvi15psnvaq67w1o64sbxzlx/fdohof/8a6nyikjlts2/fezozojaspdpr+95z36/adu/e27rcfphz0+elne+fc5kvmfmbymeulmrguheidfcj5rae5zgpjh+ord7u+vobaifx9wxnbowwsjaacatoq7gzdfncg16clvft0iiagkzatkv5lvqshkbpy4pffxdrt/acii8xm3v85te8bx28w414z4+5icxnqjjtdv+8vg26coafd0srzvn2kwknoyowrzbkmgqv+gzefjyjjxrxd0vac2bukfosggoybyc4nunexjpnqaa7duuix7l8zfjjj5tnyotpa1kxrnfk/bvti9dcyqhulcsvugk1lstgn9tjprgjoum4dakafeytlkwhg6ia+0mvzu+bs5sd2virb8glfyyxouimjdccmhkp/zt8ikelnuiaeictfp6orw8u99yirapzo3gbv7obzlsryh/8mod/ob34/+ps6e/y2wc0weuaekx4jybtytd6smzigjhwjp8hp8ss79dczxnvj9vpnzloyet7x35/24uw=</latexit> <latexit sha1_base64="7wws5p4/hlk3102z185pb4izzt8=">aaadjnicbvjnbxmxepuuxyv8pxdgwmuiokrvarwlkoasqajaofrqrnnwipev13e2vm3vyp6telb7f7jyr7ghxi2fgjdzeekyydlovzlnzzynhrqwwvcn51+7fupmra3bntt3791/0n1+egbz0ja+zlnmzuvklzdc8yeikpyimjyqvplz9pkw4c+vulei16cwl3isakbfrdakdkq6x0nkm6eragyd15wudyeonj9vsmihxgde4x1mfivpmlzv64tkimeusd6bebglsioyrpwnu3yyqh+wwh6zwc4xiptcteirzqmigp2zq96lajza5iqayir+dua9vfrowhxtyicnsch6bghddwjx4/ansbcxbuei8gystukptxgsbhsxgeesvfwdk9taurqweds5eexyn3bpeuhzjc34ykwakm7jarhbgj9zybhpcuoobrxa/+2oqlj2rljx2wzjrnmn+d9uvmlkvvwjxztanvtencklhhw3rugxmjybnluemipcwzgbukmzodtxbllof5yttflnsi1ypuzrqiqzgopay0frozupqimhjf5itcxhjxn/wcfb0p03ihng94/dn9g7g8xokgh9/zvj2fmgcopow4vewevwmi30bd1ffrshl+gavucnaiiy99gbeo+8i/+l/83/7v9ylvpe2/miryt/6zcbuwox</latexit> h i PT minimize E e t=1 C t(x t, u t ) s.t. x t+1 = f(x t, u t, e t ) u t = t ( t ) Model-based RL Collect some simulation data. Should have x t+1 '(x t, u t )+ t Fit dynamics with supervised learning: ˆ' = arg min ' NX 1 t=0 x t+1 '(x t, u t ) 2 Solve approximate problem: h i PT minimize E! t=1 C t(x t, u t ) s.t. x t+1 = '(x t, u t )+! t u t = ( t )
<latexit sha1_base64="euyqlm8oqoqnwpvqldlbjjbjanm=">aaadnxicbvjlbxmxepyurxietehixskikhrfuwgjlpvkacehhxastfk8jbyon7fqe1f2lcry+7u48jc4cenc+qt400uicsnzm/7m5znpasgfhsj6horxrl67fmprzuvw7tt3t9s794y2lw3ja5bl3jyl1hipnb+aamnpcsopsiu/ts9e1/7tt9xykes+laqekdrvihomgofg7w8k5vohhtwglionzduiks3ntgktlpjck7ylirrq7preiokmfgt+mqidwaiiisistd3bikiewyhkhjixv65fywjlnwqhcxxex/mxnd/bj7xg+7hc3j7u+rjmqkjt1nahw7ec+9t9umih+fwtdfshe83h0cjct5onj9udqbstbw8acwn0ucph450gizoclypryjjao4qjahjfdgst3m9fwl5qdkgnforntrw3ivuuuskppdlbww780ycx6l8zjiprfyr1kfvu7lqvbv/ng5wqvuyc0eujxlplrlkpmes45g1phoem5miblbnh34rzjpp1g2d3pcuydshzyiruxmrb8glfqyxmwvapwg6kelb9vo6dkbj/pnrixs3ox68vw7v33oipapu057+qfrwr7amj19e/aqyfdeoog5887xwendrsoqfoidpdmxqbdtf7diwgiaw7qs8ybmpwa/gj/bn+ugwngybnplqr8pcfxx4jrw==</latexit> <latexit sha1_base64="uo+1skyh9ob/xn5keqmafdwhdva=">aaac03icbvfdixmxfe3hr3x92k4++hisqotsokxqfvtu0ieck253f5pustn3pmgtzgxyr7beerff/vn+ch+dr/pupq1gwy8edufcj9xzp4wsdnu9h43o2vubn2/t3n69c/fe/b3m/omtl5dwwfdkkrdnu+5asqndlkjgrlda9vtb6ftida2ffglrzg6ocv7awppmyfqkjogancfhlinlylcqbpz7iilisc1sy4vntmaan/dpo3ladcrpvoib8ilnmupmaq+lqao2gyxp0wfqoklyc96vklmym2fn0mz1ur1f0g0qr0clrojost8ysyqxpqadqnhnrngvwlhnfqvquo2y0khbxqxpybsg4rrc2c+cqoitwcq0zw14bumc/bfcc+3cxe9dzr2d29rq8n/aqmt05dhlu5qiriwhpawimnpavppicwlvpaaurax/pwlgg4syzf+bsuhdgfjbx Simple Example: LQR minimize s.t. lim T!1 E h Obvious strategy : Estimate (A,B), build control. 1 T x t+1 = Ax t + Bu t + e t P T t=1 x t Qx t + u t Ru t i Gaussian noise Run an experiment for T steps with random input. Then minimize (A,B) P T i=1 kx i+1 Ax i Bu i k 2 If T Õ 2 (d + p) min( c ) 2 where c = A c A + BB controllability Gramian then A Â B ˆB and w.h.p. [Dean, Mania, Matni, R.,Tu, 2017] [Mania, R., Simchowitz, Tu, 2018]
<latexit sha1_base64="vs+14vgxeycwqa4/abiirwhhyzg=">aaadgnicbvjnb9naelxnv0n5sohizuvelyooshescfspoia49fbe01bkgmu9gser7q6t3tfkspxpupjhucguxpg3rfmjsmjilmbfe/n2z8zpiyxfmpzlb1euxrt+y+tmz/vw7tt3uzv3tm1egg4jnsvcnkfmghqarihqwnlhgkluwll6cdjwz5/awjhre1wuecs21sitnkgdku5xmsju6iozwxz1jwxdosrn55uswijxgwqys6hioevt6k2dajwq4zjauiuv7kf1xxnymgb/nucgthcpgjgdyuxpa2ohogws5k79okrjpsn+qgfqvndolnehr9fcojiia5w6fpskfvfs7yxdcblkm4napoe1czzs+dgd5lxuojflzu04cgumnr0klse1wvoogl9guxi7vdmfnq6w86zji4dmsjyb92kks/tfioopaxcqdcpmmnada8d/cemss+dxjxrrimh+evfwsoi5azzdjsiar7lwcengulcspmogcxqrxlll6v0ax+mkmpda8hwca6jeorrmqauomnbnv9vbisx5wlqlr83k/rdotqh7r8vuob0cuf9e722i3uki9ffvjqdphle4jn4/7r28alez5t3whnp9l/keeqfeo+/yg3nc3/yj/4x/mvgsfau+bz8upyhf1tz3vil4+rs0rp43</late h i PT minimize E e t=1 C t(x t, u t ) s.t. x t+1 = f t (x t, u t, e t ) u t = t ( t ) Approximate Dynamic Programming
<latexit sha1_base64="rli7jlp75guz6h4r9bivhozu6ym=">aaadmhicbvjnj9mwehxc11k+undkyqhytdqqshasxcqtwnby2enxtlsr1svyxke11nyie4jsovcjupjh4is48itwukgwlsnfgr838+yzlzitwkiqfpf8a9dv3ly1c7t15+69+w/auw9pbzobxicslak5j6nlumg+aqgsn2eguxvlfhzfhnb82udurej1gfyznym60cirjikdovzxevof0cu1hq6qusqqrvscfquswijxivd4dxnfyrnh5dsq4ktybkbe5ioqyrhwh8b4mijueue/j6ch990xccdyvb9wpwleygkzqhpvo4bbreh4cdwe4urvc587gf7nigqhhevyww5zfsqtroyarfvbhot589qo3qkgwtrwdhi2sqc1myp2vrmzpyxxxaot1nppggqwc3igmoru9nzyjliluubtl2qquj2v6y1x+jld5jhjjfs04dx6b0djlburfbvkel92k6vb/3hthjjxs1lolaeu2evfss4xpli2dm+f4qzkyiwugeheitmsgsragxvllrv2xtmvscoi14klc76bsijauadadookxu9vhgkp8xuqlt6unfvdotma7r4rcwg2f+z+ht3bknaghjvr305onw/cybcevogcvg6s2ugp0vpursf6iq7qozrce8s8j96rn/jo/c/+n/+h//oy1peankfosvi/fgnhzqxm</latexit> <latexit sha1_base64="vmo6japptd64iivb8nxgxouc0de=">aaackhicbvfdsxtbfj1srdw02qipvgwnhyswsfsk9avovwgpplhqohcx5e7kbji4o7vm3jwejq/9nx1tf47/prnjcibxwsdhnpsx9544v9ks7z/uvgcbzzdfbg3xx77a2x3d2nvv2awwarsiu5m5icgikhq7jenhtw4q0ljhdxx3vunx92iszpqvtximuxhqmugb5kiocbjdi8qrd8g0nw7zz/wsslrjodgur42m3/fnwddbsabntojlak8w3g4yuasossiwth/4oyulgjjc4br+w1jmqdzbepsoakjrhuvsiyl/65gbtzljniy+yx9xljbao0ljl5kcjeyqvpfpaf2ckqowldovclwyd0okxsnj1un4qboupcyogdds/zwlergq5a63ngxwo0extek <latexit sha1_base64="q+v5fer10yz4ggdyxoob2qde5v8=">aaacw3icbvhbattaef2rtzs9oe1jx5aagksmkukheqmepqf9ccwltrowhvitr/ai1ursjornor/qz7sv7yd0zbtq2x1yojxzzmznjq2kmbigpzrenbv37j84ehj46pgtp8+6r89hpqw1hyevzalvu2zacgvdfcjhttlailtctzpftprnn9bgloorliuiczztihocoaos7qdryvpgxwt0je4korjbn/qiyf1fvw7osemyztpuxjajhwyiicmxbxooo8bp1r4+bmfei9kc46tbcwfhkug+idagrzzxnrx14sm05hubcrlkxoyjsmlymo2cs2goj7wbivgczwdsogifmniubm/oa8dmavzq9xtsfftvhmwfmcsidc52drortet/thgn2wlshapqbmxxjbjauixpu0u6fro4yqudjgvh/kr5ngng0e16q8uqdgv8axk7qjxg5rr2wikl1myrbrbgqrvt2q9csvqfkuov2h3/vv3zvvbfi5la079yb1xbntkdjnpd/z4ynqyicbb9ftm7f7c5zqf5sv4rn0tkltknh8k1grjovpof5bf57v16uac9xfu9zibnbdkkr/kdepdecg==</latexit> <latexit sha1_base64="pkddr1xni4717umkasm7i9gzej0=">aaac0hicbvhlattafb2rryr9oe0ym6gmybfjpfjon4xqtlsllnkhnyakxgh8jq8ajctmvberontbv+pn9au6bf+gi8eb2u6fgcm59zh33ksswqdn/ew5n27eun1nb//g7r37dx72dx9ntvlrdhneyljfjsyafaomkfdczawbfymeiyq/7fsll6cnknvnxfyqfsxtihwcoaxifhbwis6hibi6zl36iozmz2ehvnzult210ilor7vlj2lymjwnsfo2jrtoqwkpbnqan/mx3w7t68wrug6ortbhko4pvlg3crol/duykhwcx4e9kjyvvc5aizfmmmd3kowaplfwce1bwbuogm9zbogfihvgomblqkufwmzg01lbp5cu2h8rglyysywsm9ltyra1jvyfftsyvowaoaoaqfgrqwktkza0s5tohaaocmkb41ryv1i+z5pxtmzvtfn1robvbnisaiv4oymtvuicnbokasyyun1wztshjf3elkfnncfxqm3bycm3ihnormf2usrdsbyh8bft3wxtz2pfg/sfng9oxq9ps0eoybmyjd55qu7ie3jojosth+qx+u3+ob+dhfpv+xav6vtwny/jrjjf/wjmcoov</latexit> <latexit sha1_base64="bgz4tmjmns5hqxzoe92twkwzkl8=">aaacghicbvfbsxtbfj6s9rzv0t72zwgqepc4k4iickkf9sghljooxcwcnzwkq2zn15mzkrdkd/ja/qz+m87gfjqkbw58fn+5nyhv0plv/y55ax/wnza3tss7u3v7b5xdo7znmiowjrkvmkcilcq Dynamic Programming h i PT minimize E e t=1 C t(x t, u t )+C f (x T+1 ) s.t. Terminal value: V T+1 (x) =C f (x T+1 ) Recursive formula (recurse backwards): V k (x) =min k ( k ) = arg min u Optimal Policy: x t+1 = f t (x t, u t, e t ), x 1 = x u t = t ( t ) u =: V 1 (x) C k (x, u)+e e [V k+1 (f k (x, u, e))] Value function Cost-to-go C k (x k, u)+e e [V k+1 (f k (x k, u, e))]
<latexit sha1_base64="djbuqd6sqjxgss29bfvvuzlzhwe=">aaadjhicbvjnb9naef2brxk+uhanlisiukeq2qgjjfsplfzw6cgfpk0up9z6m0lw3v1bu2pkypnvcowpcemcupbbwkdgigkjwfvmzcybnr0nmrqwg+cx51+6foxqty3rrrs3b92+0968e2zt3hay8fsm5jrhfqtqmecbek4za0wlek6s83d1/oqzgcts3cd5bipfplpmbgfoqlj9lupgkntjjghzqpsyakuqsytscs2u+aivfuwjxxcwjovbfumy4dcyuypl3amrs7l/pkxoeepzu3pun3sb5gvvy306r4j7zuvf/rpfrkxnoiqipovtyrfuudi5bse0q982intnoctyikcpmxvg7u7qdrzg10hyga5prbdveqnonpjcguyumbxdmmhw5orqcalu3nxcxvg5m8lqqc0u2fg5enmkpnlmme5s4z6ndmh+w1eyze1cjs6zfio7gqvj/8wgou5ej0qhsxxb84tgk1xstgm9jjowbjjkuqomg+husvmmgcbrlxopy0i7a740svnkwvb0dcusxainc6qfvezoeqryvzcsfmla0sn6o3+jtryob+2lqud77nd9mfrjwrjbslj6/ovg+eu3dlrh0cvo7l6zmg3ygdwkwyqkr8gu+ub6zec4d9974+17b/5x/7v/w/95kep7tc09smt+7z8pdp9v</latexit> <latexit sha1_base64="dy8oi6n3bn2d1dkwloypctkty8u=">aaaea3iclvplbtnafhushiw8wtjb5opallqlsisk2bsvl2drrvkstflgicatitpqegznjfgikzcs+rj2ic0fwn/waywtb5gedspzpj733jfvdzbwprtr/iyvk1euxru+c6n689bto3d39+71vzxkqnsk5rg8clcinana00xzepfiiqoa0/pg8k1up/9epwkx6op5qv0ih4jngmhauqo90pf+yhsfelljtnw2ox4mkgjiznilk9lxrkdbyn5zyoled6aqd/c7byidsfccnk1w9qcgp+ra2n0mkiycah+hav14m35aqyzmynninstw0reqfiik8r/rcl+6kakzlb8o1mefhoytdmdrxra+tfurk3ong7ghpepuk7eesowgzf9vel6eymgeh0ea4orkrgmpcernv2ynm7k52q25lxdxybt4bag5xwnbafpohjm0okitjpuaeg6ifyolzottripsrrnmlnfibxbm2zrvfoutwrpljgess3sjdqv2bw+di6xmuwcv+qqotvto/ss2spxkhw+ysfjnbvkmmqqcdaz5fskysuo0n1uaiws2vibtldhrdlfxsixij5ssdwjmqwakhtmnluuzltisiuoim5f3zd4zzuejfgpo85mtrdzsbm68zsht6vdu/hciusw2a/e2p/826b+1plfldz7vtl4wo9lxhjqpnibjoc+de+ed03z6din9kj8o18qpk58rxyvfkt+x0nkp8lnvrj3kj9+u9u7g</latexit> <latexit sha1_base64="adb8qmdtu4tflqnxv6h14vnn31g=">aaacnxicbvfdaxnbfj2svwv9svxrhw4girunuyk0l0lrsvsqssr8fnj1utu5sybozi4zdyvhyv/w1/rv/4f/prpjfkzihyhdofdj7j1xpqql3/9b8+7t3h/wcpfr3umnt589r++/6ns0nwj7ilwpuyzbopiaeyrj4wvmejjy4sc+/llqg59orex1l+yzhglmtbxlaesoqn7sr93mlcq674pfif/ek/jjle+s0b0t1rt+y18g3wzbbrqsik60xwuvrqnie9qkffg7dpymwgimsafwsxevw8xaxmmehw5qsncgxxklbx/jmbefp8y9txzj/ltrqgltpildzgi0tztasf5pg+y0pg4lqbocuivvohguokw8va8fsyoc1nwbeea6v3ixbqoc3bxxpix7zyjwnilmuzyiheegq2hgbhxpkrkqutyqojnk8e+glw/lyztuvne2ljunciljvms7q/thvrizjng8/zbof2gffiv49rfx8rmyzpe9yq9zkwxsij2wc9zhpsbyl3bdfrm/3oh31wt7f6tur1b <latexit sha1_base64="xgebh7uy6gt6q+hya5tae+6qm3i=">aaacjxicbvfdaxnbfj2srdba2krfcr4mbiepenalpx2ouqrqpvqhypig0nw5o7ljhszoljn3jwgpv8zx/t/+g2ftccbxwjchc+73jtmllfn+74r3agv78zodp7vp9vafh1rrl3o2zy3arkhvavoxwfrsy5ckkexnbigjfd7g04+lfvsnjzwp7ta8wzcbszyjkyacfvupe1gnmys6tf6eu+/rew9hnrjf1brf8hfgn0gwbhw2thzuq4r3w1tkcwoscqwdbh5gyqggpfb4v3uxw8xatggmawc1jgjdyjhcpx/jmcefpcy9txzb/htrqgltpimdzwi Simplest Example: LQR minimize s.t. E h PT 1 t=1 x t Qx t + u t Ru t + x T P Tx T i x t+1 = Ax t + Bu t + e t Dynamic Programming: V T 1 (x T 1 )=mine x T u T 1 1Qx T 1 + u T 1Ru T 1 + V T (x T ) =min u T 1 apple xt 1 u T 1 V T (x T )=x T P T x T apple Q 0 0 R + A B PT A B apple xt 1 u T 1 + 2 Tr(P T ) u T 1 = (B P T B + R) 1 B P T Ax T =: K T 1 x T 1 V T (x T 1 )=x T 1P T 1 x T 1 P T 1 = Q + A P T A A P T B (R + B P T B) 1 B P T A
<latexit sha1_base64="euyqlm8oqoqnwpvqldlbjjbjanm=">aaadnxicbvjlbxmxepyurxietehixskikhrfuwgjlpvkacehhxastfk8jbyon7fqe1f2lcry+7u48jc4cenc+qt400uicsnzm/7m5znpasgfhsj6horxrl67fmprzuvw7tt3t9s794y2lw3ja5bl3jyl1hipnb+aamnpcsopsiu/ts9e1/7tt9xykes+laqekdrvihomgofg7w8k5vohhtwglionzduiks3ntgktlpjck7ylirrq7preiokmfgt+mqidwaiiisistd3bikiewyhkhjixv65fywjlnwqhcxxex/mxnd/bj7xg+7hc3j7u+rjmqkjt1nahw7ec+9t9umih+fwtdfshe83h0cjct5onj9udqbstbw8acwn0ucph450gizoclypryjjao4qjahjfdgst3m9fwl5qdkgnforntrw3ivuuuskppdlbww780ycx6l8zjiprfyr1kfvu7lqvbv/ng5wqvuyc0eujxlplrlkpmes45g1phoem5miblbnh34rzjpp1g2d3pcuydshzyiruxmrb8glfqyxmwvapwg6kelb9vo6dkbj/pnrixs3ox68vw7v33oipapu057+qfrwr7amj19e/aqyfdeoog5887xwendrsoqfoidpdmxqbdtf7diwgiaw7qs8ybmpwa/gj/bn+ugwngybnplqr8pcfxx4jrw==</latexit> Simplest Example: LQR minimize s.t. lim T!1 E h 1 T x t+1 = Ax t + Bu t + e t P T t=1 x t Qx t + u t Ru t i When (A,B) known, optimal to build control ut = Kxt u t = (B PB + R) 1 B PAx t =: Kx t P = Q + A PA A PB (R + B PB) 1 B PA Discrete Algebraic Riccati Equation Dynamic programming has simple form because quadratics are miraculous. Solution is independent of noise variance. For finite time horizons, we could solve this with a variety of batch solvers. Note that the solution is only time invariant on the infinite time horizon.
<latexit sha1_base64="vs+14vgxeycwqa4/abiirwhhyzg=">aaadgnicbvjnb9naelxnv0n5sohizuvelyooshescfspoia49fbe01bkgmu9gser7q6t3tfkspxpupjhucguxpg3rfmjsmjilmbfe/n2z8zpiyxfmpzlb1euxrt+y+tmz/vw7tt3uzv3tm1egg4jnsvcnkfmghqarihqwnlhgkluwll6cdjwz5/awjhre1wuecs21sitnkgdku5xmsju6iozwxz1jwxdosrn55uswijxgwqys6hioevt6k2dajwq4zjauiuv7kf1xxnymgb/nucgthcpgjgdyuxpa2ohogws5k79okrjpsn+qgfqvndolnehr9fcojiia5w6fpskfvfs7yxdcblkm4napoe1czzs+dgd5lxuojflzu04cgumnr0klse1wvoogl9guxi7vdmfnq6w86zji4dmsjyb92kks/tfioopaxcqdcpmmnada8d/cemss+dxjxrrimh+evfwsoi5azzdjsiar7lwcengulcspmogcxqrxlll6v0ax+mkmpda8hwca6jeorrmqauomnbnv9vbisx5wlqlr83k/rdotqh7r8vuob0cuf9e722i3uki9ffvjqdphle4jn4/7r28alez5t3whnp9l/keeqfeo+/yg3nc3/yj/4x/mvgsfau+bz8upyhf1tz3vil4+rs0rp43</late <latexit sha1_base64="q+v5fer10yz4ggdyxoob2qde5v8=">aaacw3icbvhbattaef2rtzs9oe1jx5aagksmkukheqmepqf9ccwltrowhvitr/ai1ursjornor/qz7sv7yd0zbtq2x1yojxzzmznjq2kmbigpzrenbv37j84ehj46pgtp8+6r89hpqw1hyevzalvu2zacgvdfcjhttlailtctzpftprnn9bgloorliuiczztihocoaos7qdryvpgxwt0je4korjbn/qiyf1fvw7osemyztpuxjajhwyiicmxbxooo8bp1r4+bmfei9kc46tbcwfhkug+idagrzzxnrx14sm05hubcrlkxoyjsmlymo2cs2goj7wbivgczwdsogifmniubm/oa8dmavzq9xtsfftvhmwfmcsidc52drortet/thgn2wlshapqbmxxjbjauixpu0u6fro4yqudjgvh/kr5ngng0e16q8uqdgv8axk7qjxg5rr2wikl1myrbrbgqrvt2q9csvqfkuov2h3/vv3zvvbfi5la079yb1xbntkdjnpd/z4ynqyicbb9ftm7f7c5zqf5sv4rn0tkltknh8k1grjovpof5bf57v16uac9xfu9zibnbdkkr/kdepdecg==</latexit> <latexit sha1_base64="pkddr1xni4717umkasm7i9gzej0=">aaac0hicbvhlattafb2rryr9oe0ym6gmybfjpfjon4xqtlsllnkhnyakxgh8jq8ajctmvberontbv+pn9au6bf+gi8eb2u6fgcm59zh33ksswqdn/ew5n27eun1nb//g7r37dx72dx9ntvlrdhneyljfjsyafaomkfdczawbfymeiyq/7fsll6cnknvnxfyqfsxtihwcoaxifhbwis6hibi6zl36iozmz2ehvnzult210ilor7vlj2lymjwnsfo2jrtoqwkpbnqan/mx3w7t68wrug6ortbhko4pvlg3crol/duykhwcx4e9kjyvvc5aizfmmmd3kowaplfwce1bwbuogm9zbogfihvgomblqkufwmzg01lbp5cu2h8rglyysywsm9ltyra1jvyfftsyvowaoaoaqfgrqwktkza0s5tohaaocmkb41ryv1i+z5pxtmzvtfn1robvbnisaiv4oymtvuicnbokasyyun1wztshjf3elkfnncfxqm3bycm3ihnormf2usrdsbyh8bft3wxtz2pfg/sfng9oxq9ps0eoybmyjd55qu7ie3jojosth+qx+u3+ob+dhfpv+xav6vtwny/jrjjf/wjmcoov</latexit> h i PT minimize E e t=1 C t(x t, u t ) s.t. x t+1 = f t (x t, u t, e t ) u t = t ( t ) Approximate Dynamic Programming Recursive formula: V k (x) =min u C k (x, u)+e e [V k+1 (f k (x, u, e))] Optimal Policy: k ( k ) = arg min u C k (x k, u)+e e [V k+1 (f k (x k, u, e))]
<latexit sha1_base64="c96nrog7aifhrw62vbldao2qo+c=">aaadihicbvjdb9mwfhxc1ygfa+grf4ukqrntlsck8vkpmba87geiuk2qs8hxndsa7ut2dwqj8md45y/whniex4ptzrjtuvkk63pppfa9j0khhyug+o35n27eun1n527n3v0hd3e7vuenni8n4xowy9ycj9ryktsfgadjzwvdquokp0suj5r62rdurmj1z1gwpfi00yivjikd4u53kvbm6ioaq5d1jwxdisrjf5uswijxldd4dxnfyz4k1bs65ktyfkbeliquybtwf0tofjayzfqpegh4alci4acmyz8ykc0hiqsvtemynnil1/k8rpeip9fca97wswcpu8oifgjagdyahcl1rh1d3o0hw2avedsj26sp2jije15ezjkrfdfajlv2ggyfre4objpcjvpaxlb2stm+dammituowm21xs8cmsnpbtynaa/qfzsqqqxdqsqxm/3yzvod/q82lsf9fvvcfyvwza4uskujicenrxgmdgcgly6hzaj3vszm1fagzsi1w1babwdrk1sluguwz/ggkmebhjrqclduueamqt4lkfenqi0+bsy6rjrzpjx4kzib9udy/s16f4vsdak317+dnl4yhsew/piyp37twrodnqcnaibcdijg6am6qrpevj536i291/43/4f/0/91rfw9tucxwgv/z1+kywfm</latexit> <latexit sha1_base64="w0y04clrch/hfcnyey3rxb6ejpw=">aaacynicbvfnixnbeo2mx+v6ldwjl8ygjbjcjah6erzx0umok5rsqmyyejo1k2a7e4bugklo5ua/8pd49kp/wp4kikksahi896qqqyqrplayht87wbxrn27eorp9fofuvfspuicpp7asdycjl2vpljnmqqonexqo4biywfqm4sk7omv1iy9grcj1z1xvkchwajelztbtaxcyteockcx6ywf9twmldorqhp71l8n6qj/rjeofhossc++a1eets8hxrv/m5q17cinbbesxwctt9sjrua56ckit6jftnkcnnssel7xwojflzu0scitmhdmouitmok4tvixfsqjmhmqmwczupx9dn3pmtvps+kerrtl/mxxt1q5u5p3tghzfa8n/abma81eje7qqettfnmprsbgk7tlpxbjgkfcemg6e/yvlc2yyr7/yns7r2hxwnuncstacl3pyyyuu0tbpwkdfhg6ncu+flpqt05ao2x3/ux3zvu6/fyvaoxz7u+rbgdkfjnpf/ygyph9f4sj6+kj3+mz7mipymdwhfrkrl+sufcdnzei4+uz+kj/kvzaotlak3myadly5j8hobf9/a3dt4nc=</latexit> <latexit sha1_base64="gpskemdf6g9ld6tv+mg9j/edltg=">aaacyhicbvfdi9nafj3gr3x96uqjl4nfslcurar9erzxuwqfvtz2f5oqjtobdnizszi50zaqf/+vp8unx/vfogkr2nyla4dzzr137r1zjyxfmpzr865dv3hz1shtwzt3791/0d96olflbtimesllc5kxc1jogknaczevaayycrfz1umnx3wby0wpp+oygksxqotcciaosvvncsx8rubf05izilzcp03d0hn/mawd+ozgbvokxyrhpmuad23aqbtlyhfkj+la9ppopiqgii0o5pik/ue4cldb90g0aqoyibp0qjfes5lxcjryyaydrmgfscmmci6hpyxrcxxjv6yaqyoakbbjs5q+pu8dm6n5adztsffsvxknu9yuveac3rh2v+vi/2ntgvnxssn0vsnovm6u15jisbtv0pkwwfeuhwdccpdxyufmmi5u4vtdvrur4futnitac17oyievueddhgkbfro6m6p5l6sk50xbetrt+k/qynay/1yuau3w1f1vb3tmd5bod/37ypj8fiwj6nolwfgbzwkoygpyhpgkii/jmflazsiycpkd/cs/yg/vo1d5x73l2ur1njmpyfz43/4ai/xgwg==</latexit> <latexit sha1_base64="/+atgfozwlrx7ggx96wxxxb/mzq=">aaacuhicbvhbahsxejw3l6tpzwkf+yjqcg4jzrcugvou6kl7kieu1k7axrazsrww1g1pttgs/qb+tv+bv6l27ujsd0bwdm6zkwymt1j4jopbvntv/oohb4epjh4/efrsefv4xdcb0je+yeyad5od51jopkcbkt9yx0hlkl/n836tx//kzgujv+ps8lrbocvummbaze3+mbsxobr0f9n8hi7bwmcwtf9fz8qaoqvra73jroanyaqrdpbk7u7ci5ug+ydzga7zxfv23erhe8nkxtuycd6pkthiwofdwsrfhy1lzy2worr8fkagxx1and2u6jvatojuuha00oa9m1gb8n6p8ubugdo/q9xk/7rridpztblalsg1wz80lsvfq+vr0ylwnkfcbgdmifbxymbgggey8nyrtw3l2vyn1alugpkj32elltbbid1hbulxxvwfhzt0g2hpl0uxw39qkfvl3u+ieojplsmw9cmeoswk2r3/phi+7svxl/n6rnpxcboaq/kkvczdkpd35ij8ivdkqbj5rx6tp+q2+hd9iipirk1ra5pzkmxf5p4cvujzow==</latexit> P 1 minimize E e t=1 t C(x t, u t ) s.t. x t+1 = f(x t, u t, e t ) u t = ( t ) discount factor Approximate Dynamic Programming Bellman Equation: V (x) =minc(x, u)+ E e [V (f(x, u, e))] u Optimal Policy: (x) = arg min C(x, u)+ E e [V (f(x, u, e))] u Generate algorithms using the insight: V (x k ) C(x k, u k )+ V (x k+1 )+ k
<latexit sha1_base64="c96nrog7aifhrw62vbldao2qo+c=">aaadihicbvjdb9mwfhxc1ygfa+grf4ukqrntlsck8vkpmba87geiuk2qs8hxndsa7ut2dwqj8md45y/whniex4ptzrjtuvkk63pppfa9j0khhyug+o35n27eun1n527n3v0hd3e7vuenni8n4xowy9ycj9ryktsfgadjzwvdquokp0suj5r62rdurmj1z1gwpfi00yivjikd4u53kvbm6ioaq5d1jwxdisrjf5uswijxldd4dxnfyz4k1bs65ktyfkbeliquybtwf0tofjayzfqpegh4alci4acmyz8ykc0hiqsvtemynnil1/k8rpeip9fca97wswcpu8oifgjagdyahcl1rh1d3o0hw2avedsj26sp2jije15ezjkrfdfajlv2ggyfre4objpcjvpaxlb2stm+dammituowm21xs8cmsnpbtynaa/qfzsqqqxdqsqxm/3yzvod/q82lsf9fvvcfyvwza4uskujicenrxgmdgcgly6hzaj3vszm1fagzsi1w1babwdrk1sluguwz/ggkmebhjrqclduueamqt4lkfenqi0+bsy6rjrzpjx4kzib9udy/s16f4vsdak317+dnl4yhsew/piyp37twrodnqcnaibcdijg6am6qrpevj536i291/43/4f/0/91rfw9tucxwgv/z1+kywfm</latexit> <latexit sha1_base64="azszeztxx0p7gmuxegkrcofcx2k=">aaacvnicbvhbihnbeo2mt914y+qjl41bsdcemuxqfyg4igr7sismu5azhp5ozatzvgzdnziw5jv8gh980v+xj4lgegsaduecquqqykophibhz1zw6/adu/eojtv3hzx89lhz8mtitgu5jlmrxl5nzieugsyoumj1aygptmjvdnpw6fffwdph9fdclpaovmirc87qu2nn8/flbzgo+vqtpdualzrwdodzvn9yptwsygk5tmlcmkuynaqb0msb9wd6/dikyo5jo+10w2g4dnoioi3okm1cpcetjj4zxinqycvzbhqfjsy1syi4hfu7rhyujn+waqyeaqbajfv65hv94zkzzy31tynds/9m1ew5t1szdzbjuh2tif+ntsvm3ys10gwfopmmuv5jioy2c6qzyygjxhraubx+r5tpmwuc/zp3uqxrl8b3jqkxlrbczgcplbhayzzpabutupmq/iikpf+ydvs8wfjf1zdt5n57uqh0g3n/s90/mpudrpvrpwst02eudqplv93ru+1pjsgz8pz0serekxh5rc7imhdynfwgv8jvybtkgqrmxhq0tjlpyu4eiz+p/nre</latexit> <latexit sha1_base64="w0y04clrch/hfcnyey3rxb6ejpw=">aaacynicbvfnixnbeo2mx+v6ldwjl8ygjbjcjah6erzx0umok5rsqmyyejo1k2a7e4bugklo5ua/8pd49kp/wp4kikksahi896qqqyqrplayht87wbxrn27eorp9fofuvfspuicpp7asdycjl2vpljnmqqonexqo4biywfqm4sk7omv1iy9grcj1z1xvkchwajelztbtaxcyteockcx6ywf9twmldorqhp71l8n6qj/rjeofhossc++a1eets8hxrv/m5q17cinbbesxwctt9sjrua56ckit6jftnkcnnssel7xwojflzu0scitmhdmouitmok4tvixfsqjmhmqmwczupx9dn3pmtvps+kerrtl/mxxt1q5u5p3tghzfa8n/abma81eje7qqettfnmprsbgk7tlpxbjgkfcemg6e/yvlc2yyr7/yns7r2hxwnuncstacl3pyyyuu0tbpwkdfhg6ncu+flpqt05ao2x3/ux3zvu6/fyvaoxz7u+rbgdkfjnpf/ygyph9f4sj6+kj3+mz7mipymdwhfrkrl+sufcdnzei4+uz+kj/kvzaotlak3myadly5j8hobf9/a3dt4nc=</latexit> <latexit sha1_base64="jqwqz2wojja3hljcmszje1ktl10=">aaacwxicbvfdixmxfe3hr3x92k4++hissi2wmiocviglvfshd1u0uwudowtso9owswzibqrlmd/lr9fh/svm2gq29uli4zx7b3lutusplibhz1zw6/adu/eo7h8/epjo8un79mmllzzhmogflmx1yixiowgcaivclwaysivcptfdrr/6bsakqn/fvqmjyrkwmeamptvrj8bdzd/16lvh5n5jazwzprinfcnfmlyf61kfdswhw2mshj5v7qym427wfpsh13dnvdiifihjrn0jb+e66cgitqbdtnexo20l8bzgtofglpm10ygsmamyqcel1mexs1ayfsnymhqomqkbvgvbnx3hmtnncuoprrpm/62omlj2pvkf2xix+1pd/k+boszejpxqpupqfpnq5itfgjyzphnhgkncecc4ef6vlc+yyrz9phdewfcuge84qzzoc17myy+vuetdpgkbfro6cvv9ellsl0xbompm/ff1bru5+0hkam1/5nepewfjfihr/vgpwewrqrqoovhrzvn77wqoydpynhrjrn6qc/kzxjaj4eq7+uf+kd/bmbbbgzhnatda1jwloxfufwccm9xj</latexit> <latexit sha1_base64="gpskemdf6g9ld6tv+mg9j/edltg=">aaacyhicbvfdi9nafj3gr3x96uqjl4nfslcurar9erzxuwqfvtz2f5oqjtobdnizszi50zaqf/+vp8unx/vfogkr2nyla4dzzr137r1zjyxfmpzr865dv3hz1shtwzt3791/0d96olflbtimesllc5kxc1jogknaczevaayycrfz1umnx3wby0wpp+oygksxqotcciaosvvncsx8rubf05izilzcp03d0hn/mawd+ozgbvokxyrhpmuad23aqbtlyhfkj+la9ppopiqgii0o5pik/ue4cldb90g0aqoyibp0qjfes5lxcjryyaydrmgfscmmci6hpyxrcxxjv6yaqyoakbbjs5q+pu8dm6n5adztsffsvxknu9yuveac3rh2v+vi/2ntgvnxssn0vsnovm6u15jisbtv0pkwwfeuhwdccpdxyufmmi5u4vtdvrur4futnitac17oyievueddhgkbfro6m6p5l6sk50xbetrt+k/qynay/1yuau3w1f1vb3tmd5bod/37ypj8fiwj6nolwfgbzwkoygpyhpgkii/jmflazsiycpkd/cs/yg/vo1d5x73l2ur1njmpyfz43/4ai/xgwg==</latexit> <latexit sha1_base64="9joy7emzi33vky9ahbtpwqdn0gg=">aaackxicbvfdaxnbfj1s/wjrv9o+6sngebioybcitg9cuehbprro2kj2cxcnn5uhm7plzf1jwplir/fv/43/prnxbzn4yebwzv2ye09akokodh+3gp179x883n3bf/t4ydnn7ypdk5exvubq5cq3nyk4vnlgkcqpvcksgk4vxqe372v9+htaj3pzlryfjhoyi6dsahlq3h4rf7i77/g3paabxvqacvuu+wv3flz2xu1o2a9xwbdb1iaoa+jifnbk4kkuso2ghalnrlfyufkbjskulvfj0meb4hyyhhloqknlqtuas/7kmxm+za1/hvik/beiau3cqqc+uwpn3kzwk//trivnt5nkmqiknolpogmpoow8vgmfsiuc1midefb6v3ixawuc/oxwpqx6fyjwnqnmpzein+agq2hofjzpkdriu29vfzrk8s9ghd+x2yz+qr5tlxc/yeysoz739pjevri3jno8/za4oulhyt+6fn0zvgus2wxp2uv P 1 minimize E e t=1 t C(x t, u t ) s.t. x t+1 = f(x t, u t, e t ) u t = ( t ) Approximate Dynamic Programming Q(x, u) =C(x, u)+e e [ V (f(x, u, e))] Bellman Equation: V (x) =minc(x, u)+ E e [V (f(x, u, e))] u Q(x, u) =C(x, u)+ E e applemin u 0 Q(f(x, u, e), u 0 ) Optimal Policy: (x) = arg min C(x, u)+ E e [V (f(x, u, e))] u (x) = arg min Q(x, u) u
<latexit sha1_base64="c96nrog7aifhrw62vbldao2qo+c=">aaadihicbvjdb9mwfhxc1ygfa+grf4ukqrntlsck8vkpmba87geiuk2qs8hxndsa7ut2dwqj8md45y/whniex4ptzrjtuvkk63pppfa9j0khhyug+o35n27eun1n527n3v0hd3e7vuenni8n4xowy9ycj9ryktsfgadjzwvdquokp0suj5r62rdurmj1z1gwpfi00yivjikd4u53kvbm6ioaq5d1jwxdisrjf5uswijxldd4dxnfyz4k1bs65ktyfkbeliquybtwf0tofjayzfqpegh4alci4acmyz8ykc0hiqsvtemynnil1/k8rpeip9fca97wswcpu8oifgjagdyahcl1rh1d3o0hw2avedsj26sp2jije15ezjkrfdfajlv2ggyfre4objpcjvpaxlb2stm+dammituowm21xs8cmsnpbtynaa/qfzsqqqxdqsqxm/3yzvod/q82lsf9fvvcfyvwza4uskujicenrxgmdgcgly6hzaj3vszm1fagzsi1w1babwdrk1sluguwz/ggkmebhjrqclduueamqt4lkfenqi0+bsy6rjrzpjx4kzib9udy/s16f4vsdak317+dnl4yhsew/piyp37twrodnqcnaibcdijg6am6qrpevj536i291/43/4f/0/91rfw9tucxwgv/z1+kywfm</latexit> <latexit sha1_base64="azszeztxx0p7gmuxegkrcofcx2k=">aaacvnicbvhbihnbeo2mt914y+qjl41bsdcemuxqfyg4igr7sismu5azhp5ozatzvgzdnziw5jv8gh980v+xj4lgegsaduecquqqykophibhz1zw6/adu/eojtv3hzx89lhz8mtitgu5jlmrxl5nzieugsyoumj1aygptmjvdnpw6fffwdph9fdclpaovmirc87qu2nn8/flbzgo+vqtpdualzrwdodzvn9yptwsygk5tmlcmkuynaqb0msb9wd6/dikyo5jo+10w2g4dnoioi3okm1cpcetjj4zxinqycvzbhqfjsy1syi4hfu7rhyujn+waqyeaqbajfv65hv94zkzzy31tynds/9m1ew5t1szdzbjuh2tif+ntsvm3ys10gwfopmmuv5jioy2c6qzyygjxhraubx+r5tpmwuc/zp3uqxrl8b3jqkxlrbczgcplbhayzzpabutupmq/iikpf+ydvs8wfjf1zdt5n57uqh0g3n/s90/mpudrpvrpwst02eudqplv93ru+1pjsgz8pz0serekxh5rc7imhdynfwgv8jvybtkgqrmxhq0tjlpyu4eiz+p/nre</latexit> <latexit sha1_base64="jqwqz2wojja3hljcmszje1ktl10=">aaacwxicbvfdixmxfe3hr3x92k4++hissi2wmiocviglvfshd1u0uwudowtso9owswzibqrlmd/lr9fh/svm2gq29uli4zx7b3lutusplibhz1zw6/adu/eo7h8/epjo8un79mmllzzhmogflmx1yixiowgcaivclwaysivcptfdrr/6bsakqn/fvqmjyrkwmeamptvrj8bdzd/16lvh5n5jazwzprinfcnfmlyf61kfdswhw2mshj5v7qym427wfpsh13dnvdiifihjrn0jb+e66cgitqbdtnexo20l8bzgtofglpm10ygsmamyqcel1mexs1ayfsnymhqomqkbvgvbnx3hmtnncuoprrpm/62omlj2pvkf2xix+1pd/k+boszejpxqpupqfpnq5itfgjyzphnhgkncecc4ef6vlc+yyrz9phdewfcuge84qzzoc17myy+vuetdpgkbfro6cvv9ellsl0xbompm/ff1bru5+0hkam1/5nepewfjfihr/vgpwewrqrqoovhrzvn77wqoydpynhrjrn6qc/kzxjaj4eq7+uf+kd/bmbbbgzhnatda1jwloxfufwccm9xj</latexit> <latexit sha1_base64="9joy7emzi33vky9ahbtpwqdn0gg=">aaackxicbvfdaxnbfj1s/wjrv9o+6sngebioybcitg9cuehbprro2kj2cxcnn5uhm7plzf1jwplir/fv/43/prnxbzn4yebwzv2ye09akokodh+3gp179x883n3bf/t4ydnn7ypdk5exvubq5cq3nyk4vnlgkcqpvcksgk4vxqe372v9+htaj3pzlryfjhoyi6dsahlq3h4rf7i77/g3paabxvqacvuu+wv3flz2xu1o2a9xwbdb1iaoa+jifnbk4kkuso2ghalnrlfyufkbjskulvfj0meb4hyyhhloqknlqtuas/7kmxm+za1/hvik/beiau3cqqc+uwpn3kzwk//trivnt5nkmqiknolpogmpoow8vgmfsiuc1midefb6v3ixawuc/oxwpqx6fyjwnqnmpzein+agq2hofjzpkdriu29vfzrk8s9ghd+x2yz+qr5tlxc/yeysoz739pjevri3jno8/za4oulhyt+6fn0zvgus2wxp2uv <latexit sha1_base64="881ddg/md6xwd8c/m07dn8n1mzg=">aaacu3icbvfnb9naen2yj5bylckry4oinvwjyezi5yjuuqqcemgfasvfljxebjzfu2trdxylsvkp+dxcepwy1m4qjggklz7eezozm5nvulgmwx+d4nbto3d3du/t3x/w8nhj7v6ts1s6w/iilbi01xlyloxmixqo+xvlokhm8qusog30q6/cwfhqz7ioekig12iqgkcn0u77i/48lqyulq5pdfvlyjk9/usd0tghpydgsui0dgdl2ituxvg0hlid1qc9m+32wmhybt0g0qr0ycro0/1oek9k5htxycryo47ccpmadaom+xivdpzxwari+dhddyrbpg4hxtixnpnqawn800hb9t+mgps1c5v5pwkc2u2tif+njr1oxye10jvdrtlno6mtfevabi9ohoem5cidyeb4v1i2awmm/y7xurs1k87wjqnntgtwtvggk3gobjxposoqupmq/ickpj9aw3om8hn+ux3zru6/e7laozjzh9shw2z/kghz/dvg8uuwcofrxaveydvvaxbjm/kc9elejskj+ujoyygw8o18jz/jr+bnwiivgbyxbp1vzloyfoh7dske2bk=</latexit> <latexit sha1_base64="a3vylzvfult7aea2lan03t/af8q=">aaadaxicbvjnbxmxepuux234sssfiytphjqoabslkmofvklicoiheastli1wjneswlg9k3sweq32xju/wg1x5zfwn/gfennunakjwxp6783ym+nhjoxfipjt+bdu37l7b2u7dv/bw0ep6zu75zbndycet2vqlofmghqaeihqwmvmgkmhhivh9ktslz6dsslvn3cewucxsryjwrk6kq5/2+7graqytowqnhwpy+ysnrbzenqib2gzpiwawyt242tpkpmblkna6tssmmlmyt/+gezjpprtlnbxke+xdlngmt0iy3a+34qmge+wvyvrjaatlijugnajgmqzz/gon4islocknhljro2hqyadghkuxejzi3ilgentnoa+g5opsinimbasvnbmqkepcucjxba3mwqmrj2roxnwd7frwkx+t+vnoho9kitocgtnry4a5zjisqsd0eqy4cjndjbuhhsr5rnmgee3qzvbfruz4cudflncc54msmzknkfhjrsaiglddvw8f1lsj0xbelon+vp1zsu5+u6mbdr2qfsourvhdgsj18e/cc5fdskge3zfny7fllezrz6rpdikitkix+qdosm9wskf76n33nvzv/rf/r/+zyur7y1znpcv8h/9bfbf8pc=</latexit> P 1 minimize E e t=1 t C(x t, u t ) s.t. x t+1 = f(x t, u t, e t ) u t = ( t ) Approximate Dynamic Programming Q(x, u) =C(x, u)+e e [ V (f(x, u, e))] Bellman Equation: Q(x, u) =C(x, u)+ E e applemin u 0 Q(f(x, u, e), u 0 ) Optimal Policy: (x) = arg min Q(x, u) u Q(x k, u k ) C(x k, u k )+ min u 0 Q(x k+1, u 0 )+ k Q-learning: Q new (x k, u k )=(1 )Q old (x k, u k ) C(x k, u k )+ min Q old (x k+1, u 0 ) u 0
<latexit sha1_base64="vs+14vgxeycwqa4/abiirwhhyzg=">aaadgnicbvjnb9naelxnv0n5sohizuvelyooshescfspoia49fbe01bkgmu9gser7q6t3tfkspxpupjhucguxpg3rfmjsmjilmbfe/n2z8zpiyxfmpzlb1euxrt+y+tmz/vw7tt3uzv3tm1egg4jnsvcnkfmghqarihqwnlhgkluwll6cdjwz5/awjhre1wuecs21sitnkgdku5xmsju6iozwxz1jwxdosrn55uswijxgwqys6hioevt6k2dajwq4zjauiuv7kf1xxnymgb/nucgthcpgjgdyuxpa2ohogws5k79okrjpsn+qgfqvndolnehr9fcojiia5w6fpskfvfs7yxdcblkm4napoe1czzs+dgd5lxuojflzu04cgumnr0klse1wvoogl9guxi7vdmfnq6w86zji4dmsjyb92kks/tfioopaxcqdcpmmnada8d/cemss+dxjxrrimh+evfwsoi5azzdjsiar7lwcengulcspmogcxqrxlll6v0ax+mkmpda8hwca6jeorrmqauomnbnv9vbisx5wlqlr83k/rdotqh7r8vuob0cuf9e722i3uki9ffvjqdphle4jn4/7r28alez5t3whnp9l/keeqfeo+/yg3nc3/yj/4x/mvgsfau+bz8upyhf1tz3vil4+rs0rp43</late h i PT minimize E e t=1 C t(x t, u t ) s.t. x t+1 = f t (x t, u t, e t ) u t = t ( t ) Direct Policy Search
<latexit sha1_base64="bnc2p7snadzkkyaqpftniewfeuq=">aaact3icbvhbbtnaen2ys9tws+grlxurkjvqzcmkyltvkochd+gstlicovv6bi+6xpvdmwpq+x/4gl6bv2gdgokkjdts0tlzn6huamn3f/e8gzdv3d7z3evfuxvv/opb/sntw1rgwlqwqjdnkbcgumoukbsclwzehik4iy7etprznzawc/2zlixmc5fqtfakctricbxgkkkuhtfi2drknf29mee9qk9c1geukiui+mpzjw74m87dsyajqwped0hhxdjimpth/sr4ngg6mgsdtrb7vxkyf7lkqznuwtpz4jc0d+uipykmh1ywsievr min <latexit sha1_base64="zcjli+x7yefdo <latexit sha1_base64="xncvduignbply0pxcjg0898xebs=">aaac4nicbvhlitrafk1kfizjq0exbgobiq3sjiogimlga0vm0ai9m9ajovk5syqpvelvzta9it/gttz6v678fhdwutthd3uh4hdofds9j6mlmoj7pxx358rva9d3b+zdvhx7zt3b/r1juzwaw5rxstknctmghyipcprwwmtgzslhjdl71esn56cnqnqnnncqlsxxihocoaxiqrkqlkhg33vhodnyalirfuhdkmgrjo2blm5r7/l5x7eljwq4cyef8c5hdfket38sohrkkqfrraeweyerjqddf+wvgm6dyawgzbwten+jwrtitqkkuwtgzak/xqi1jqwx0o2fjyga8toww8xcxuowubu4s0cfwsalwaxtu0gx7l8vlsunmzejzez3nztat/5pmzwypytaoeogqfhlokyrfcvah5mmqgnhobeacs3sxykvmgycrrvruxa9a+brm7qxjrk8smgdlxibmlnsajzmqh6r9q2qkn5kytcj/si/vdu2l73xihdohh9zv9vok9kaemyefxsch4wdfxx8edi8flmyzpc8ia+jrwlylbysd2rcpost7+sn4zo7bup+dr+4x5eprroquu/wwv32c33a6oi=</latexit> min <latexit sha1_base64="54fqrxzqyz <latexit sha1_base64="cihlzcnuhll5+u92cyhc02yuj8o=">aaacuxicbvfbi9nafj7g2269dfxrl8gidefkiokcl4uu6mm+vls7c00ik8lpctzjjmycynaqp+sv8xx9nu66ewzrgqmf33fuj6kuwvl9q4f34+at23f29od3791/8hb08ojulrwrmjelks15iiwo1danjaxnlqfrjarokov3nx72hyzfun+lvqvritkns5schbwpjsmemtsnmeas2kapdrgffqjjppr8ogz5cx4wgvikat60cbxg4sxhj/bogijo+7r4npan/tr4lgh6mga9zekdqrsmpawl0csvshyr+bvfrhyhvnaow9pcjesfygdhobyf2khzr9vyz45j+bi0zjxxnftvrimka1df4ik72e221ph/0xy1ld9edeqqjtdyutgyvpxk <latexit sha1_base64="nigzqa0cicsnamj1ol8sebl/xgu=">aaaczxicbvfnb9naen2yrzz8pxdksiicjrkk7aqjsr1ufagolqictjfiy1pvjvaq67w1o66alubkv+j/cockv4f1ahbjggmlp/dm3uzmjkuubn3/e8e7dv3gzvs7u93bd+7eu9/be3biikpzmpbcfnqamanskjigqantugplewmnydmrrj89b21eot7hsoqoz6ksc8ezoirutcmeuqes05otaytl3d0nc6fig54zjrkgq+ltguymsysxb+ryloplw7/isj7rcjyjwewqrt0q1ly1int9f+svgm6doav90sy43ute4bzgvq4kuwtgzak/xmjzoeas6m5ygsgzp2mpzbxulact2dukavremxo6klr7cumk/bfcstyyzz64zgyus6k15p+0wywlg8gkvvyiil81wlssykgbfdk50mbrlh1gxav3v8ozphlht/ <latexit sha1_base64="i0aeayt9hihuk/vjtsdd2czegl8=">aaachhicbvfdaxpbfb03bwrtrzr57mtqksii7lyjlyueaqstxyeu1ijoinfhqw6znv1m7gzl8zfktf1r/tedvqtve+hc4zz7fanusuu+/7vkht14epy Sampling to Search min z2r d (z) =<latexit sha1_base64="zcjli+x7yefdo p(z) E p [ (z)] apple<latexit sha1_base64="54fqrxzqyz # E p(z;#) [ (z)] =: J(#) Search over probability distributions Use function approximations that might not capture optimal distribution Can build (incredibly high variance) stochastic gradient estimates by sampling: rj(#) =E p(z;#) [ (z)r # log p(z; #)]
<latexit sha1_base64="dqebqq2yo/bjn7jctcmtxxkfwfa=">aaacq3icbvhdahnbfj6svwvqt6qx3gygwgy07iqokelxb6x0iqwmlwaxchzykh06o7vmnc1nl7yjt+nt+wk+jbnpxcbxwigp7zv/jymutbqevxvery3bd+5u3mtu3x/w8ffr+/grzusjsc9ylzutbcwqqbfpkhsefayhsxqej6efav34di2vuf5o0wljdczajquactsw9xrpj87auioenxcfejqbpulsfzknq8k/ep9pna141eulf9hhcxpyagfdyg58hyql0gyl6w23g3e0ykwzosahwnpbgbquv666fapnzai0wia4hqkohnsqoy2r+yizvuoyer/nxrkmpmdvzlsqwtvnehdzz29xtzr8nzyoafw2rqquskitrhuns8up5/w1+egafksmdoaw0s3krqogblmblnwz1y5qlg1snzdainyek6yiczlgsiuugdt1vtvx <latexit sha1_base64="oo9jp+19oa2n8nsqhpl71jm7sec=">aaacynicbvfbaxnbfj6st7beun30ztaog5cwwwqlrsgqkjkhikytzjdwdvyko3r2dpk5w5osefnf+ut89fx/hlnplcbpgyfvvu/ct1iqaskifra8w7fv3l23s7t3/8hdr4/b+09obfezgunrqmkcjwbrsy1dkqtwrdqiealwndl/3+inf2islpq3mpuy5zdvciifkkpg7efupcfrmk6jczcuicgcf/b/f7r85vsesu08gmtsn3f5df6lpz+6fphox+1o0auwxrdbuaidtrlbel8vr2khqhw1cqxwjskgplh2oavqunilkoslihoy4shbdtnauf7ov+avhjpyswhcc50u2esrnetwzvleeezamd3ugvimbvtr5dcups4rqi2uck0qxangztj5kg0kujmhqbjpeuuiawoc3mrxqixzlyjwjqkvky1fkeigq+isddjsiuugdtnv/veqxb+ctrwvpxn9u13arvy/ykkk+6rv7qq7w87uiohm+rfbyuevdhrhl9ed43er0+ywz+w581ni3rbj9okn2jaj9op9yr/zh6/vgw/m1veuxmsv85stmff9l/yh4fa=</latexit> <latexit sha1_base64="nbbn4o5cmynrkblxni3vhtvu0jm=">aaac23icbvfnixnbeo2mh7uux1k9emkmwgqkzcycwiiskuhhdxhn7kjmcdwdmkyzpt1dd82yyzctn/hqv/ip+de86sgebbqzsadh9xtv1v2vkljjs0hwvendu37j5s7urb3bd+7eu9/df3bii8oihilcfeysaytkahyrjivnpuhie4wnyfnrrj+9qgnlot/svmq4h5mwqrrajpp005c8kpp4nmykv+jzsgfkfpqaehwkivewqamlmjqhwzkx/ulw77w/rfv3ymhzrv1wgp8ujt1emahwwbdbuay9to7hzl8tr9ncvdlqegqshydbsxhtekqhclkxvrzleocww7gdgnk0cb0yzmmfogbk08k444zbsf9w1jbbo88tl5kdzbatnet/thff6yu4lrqsclw4eiitfkecn+7yqtqosm0daggk+ysxgtgzye1g45vv7xlfxit1zawlkkbyyhvdkgfhwqqcpg6mqt9kpfgh0jyfn67/uv3brvbfyjkk+/tylvr3t5ldqsk2/dvg5gaqbopw/bpe0av1anbzi/ay+sxkz9kre8egbmqe+8z+sj/slxd7n7zp3pervk+zrnninsl7+htfc+ml</latexit> <latexit sha1_base64="7ypzmnmmi6uohph3ssog/g0xytm=">aaacy3icbvfdi9nafj3gr931q6upvgwwiqupysioilcooa8rvltdhsaum8ltmuxkemzulm1jh/1x/hfffduf4arbxbzegdicc+bo3hotskllqfc94127fupmrb39g9t37t673z18mlzlbqsorklkc5aarsu1jkiswrpkibsjwtpk/e2rn16gsblun2leyvxapuvmcibhtbvjvzysmnikcey+j4a59bd9hmlifeyb6aim5uiwdi4y45w/epmxczyjs5z6w3s62j92e8egwbxfbeea9ni6htpdthylpagl1cquwdsjg4rixvwuquhyikotvidoicojgxokthgzcmdjnzgm5bpsuoogwbh/3migshzejm5zaov2w2vj/2mtmmyv4kbqqibu4uqhwa04lbxnk6fsoca1dwceke6vxorgqjdlfoovve8kxcykzwwtpsht3givxzibr1qkaqrup2resax4j9cwn7sp/1fd21b238pmkn164har+ztmt5bwo/5dmd4ahmeg/pisd/x6vzo99og9zj4l2xn2zn6zirsxwb6xh+wn++v98ky38l5cwb3o+s5dtlhe19/on+ht</latexit> <latexit sha1_base64="fa1z8kn/imf/v9cyloes1fopwme=">aaacz3icbvfbi9nafj7g27reuvroy2arwpcsikcwlcxe0id96kldxwxcojmejsnojmhmzn1uipjqv/jv+ad81z/gpfvfth4y+pi+c5nznaru0plvf+94v65eu35j6+b2rdt37t7r7tw/skvlbi5foqpzkobfjtwoszlck9ig5inc4+t0vasfn6gxstafaf5ileoq5uwkieff3y97pmybsisp3zrxxfyvdsmzmjqhwaajfc5owsnrjvsxax5qsbte9d+mhoeqsplqfq+ntdok4m7ph/ql4jsgwiiew8yo3ule4bqqvy6ahajrj4ffuls7xlioblbdymij4hrsndioiucb1qstgv7ymvm+k4x7mvic/beihtzaez64zhzhu6615p+0suwzf1etdvkrane5afyptgvvhevtavcqmjsawkj3vy4ymcdi+b4yzdg7rlgysx1easmkka6xis7jgcmtug5st1vvb6vs/d1oyw9aj/+orm0r91/lvjj9cucoqwcbye4gwbr9m+do6tdwh8hhs97+y+vptthd9oj1wcces332jo3yman2jf1gp9kv79d75h32vlymep1lzqo2et7x30+55pi=</latexit> Reinforce Algorithm J(#) :=E p(z;#) [ (z)] Z r # J(#) = (z)r # p(z; #)dz Z = (z) Z r# p(z; #) p(z; #)dz p(z; #) = ( (z)r # log p(z; #)) p(z; #)dz = E p(z;#) [ (z)r # log p(z; #)]
<latexit sha1_base64="dqebqq2yo/bjn7jctcmtxxkfwfa=">aaacq3icbvhdahnbfj6svwvqt6qx3gygwgy07iqokelxb6x0iqwmlwaxchzykh06o7vmnc1nl7yjt+nt+wk+jbnpxcbxwigp7zv/jymutbqevxvery3bd+5u3mtu3x/w8ffr+/grzusjsc9ylzutbcwqqbfpkhsefayhsxqej6efav34di2vuf5o0wljdczajquactsw9xrpj87auioenxcfejqbpulsfzknq8k/ep9pna141eulf9hhcxpyagfdyg58hyql0gyl6w23g3e0ykwzosahwnpbgbquv666fapnzai0wia4hqkohnsqoy2r+yizvuoyer/nxrkmpmdvzlsqwtvnehdzz29xtzr8nzyoafw2rqquskitrhuns8up5/w1+egafksmdoaw0s3krqogblmblnwz1y5qlg1snzdainyek6yiczlgsiuugdt1vtvx <latexit sha1_base64="xncvduignbply0pxcjg0898xebs=">aaac4nicbvhlitrafk1kfizjq0exbgobiq3sjiogimlga0vm0ai9m9ajovk5syqpvelvzta9it/gttz6v678fhdwutthd3uh4hdofds9j6mlmoj7pxx358rva9d3b+zdvhx7zt3b/r1juzwaw5rxstknctmghyipcprwwmtgzslhjdl71esn56cnqnqnnncqlsxxihocoaxiqrkqlkhg33vhodnyalirfuhdkmgrjo2blm5r7/l5x7eljwq4cyef8c5hdfket38sohrkkqfrraeweyerjqddf+wvgm6dyawgzbwten+jwrtitqkkuwtgzak/xqi1jqwx0o2fjyga8toww8xcxuowubu4s0cfwsalwaxtu0gx7l8vlsunmzejzez3nztat/5pmzwypytaoeogqfhlokyrfcvah5mmqgnhobeacs3sxykvmgycrrvruxa9a+brm7qxjrk8smgdlxibmlnsajzmqh6r9q2qkn5kytcj/si/vdu2l73xihdohh9zv9vok9kaemyefxsch4wdfxx8edi8flmyzpc8ia+jrwlylbysd2rcpost7+sn4zo7bup+dr+4x5eprroquu/wwv32c33a6oi=</latexit> <latexit sha1_base64="jj1kekwesntnojcyyygyte4uvyc=">aaacjnicbvftaxnben6cl631ldvp4pffikqg4u7eciiwfeyhfqho2kjyhhobstjkd+/ynstnj+kv8av+hv+ne2kekziw8pa8m8/szosljs9x/lsv3bh56/bw9p2du/fup3jy3n104ovkkeyrqhfulaepmiz2mvjjwekqtk7xnj99bpttc3secvun5ywmbiawxqsaa5w1n1xmmzn0zgtzvxw7pafhu2tizntzuxp34kxitzasqucs4zjbbaxduaeqg <latexit sha1_base64="pp9kp46qo/izm8uhki7xz6n6m8w=">aaacwnicbvfti9naen7gt7v61topflksqgthsurqeohqg/pdcrxt3uetwmq7tdzunnf3clyn/vx+mvuqf8rnr0jfhfh4ej5nznzmkljjs75/3fju3b5z997efvv+g4ephnconpzzojicr6jqhbliwkksgkckseffardyrof5mvvq6oexakws9fealxjlkgo5lqliuxhn0/5j70c8owwvwvcgbpgsz9/xcjjjhu/zueoiik7xdaseqillzwn4u57yjjtdf+avg++cyaw6bbxd+kavhzncvdlqegqshqd+svhtakqhcneok4slibmkohzqq442qpdzl/glx0z4tdduaejldj2jhtzaez44zw6u2w2tif+njsuavolqqcukuiubrtnkcsp4s0q+kqyfqbkdiix0f+uiawoc3ko3uixrlyg2jqmvki1fmcetvtevgxckrcpb6maq+kqqxb+atvxuphn9u13zru4dy1ssptx199t9hbm7slc9/l1w9niq+ipg86vu0fvvafbym/ac9vjaxrmj9pen2ygj9otds9/sj3fsffo+e/bg6rvwou/zrng//wln8t5q</latexit> <latexit sha1_base64="/6wnxjvti3ogtr3oxgqus2ijbjg=">aaacshicbvfna9taef2rx2n6eac95rlebbyagikeggif0aasqw4prrodlcropby2wq3e7ijeff4x/tw9tsf+m64cl8z2bhye78282zmjcyut+f6fhvfo8zonz9aer794+er1rnpzzaxnsyowk3kvm14mfpxu2cvjcnufqchihvdx+rnwr27qwjnrbzqpmmxgrovicibhrc2jwq0yspagqtj3wzr/5p+zll/na1bfuspt9vco3bsn7kbnlt/xz8fxqtahltapi2izeq6gusgz1cquwnsp/ilcyllkoxc6pigtfibsggpfqq0z2rcattnlo44z8lfu3npez+z9igoyaydz7dizomquazx5knyvaxqyvlixjaewd41gpeku83plfcgnclitb0ay6f7krqigblnflnszercofiapbkstrt7ejvbrlrlwpexkqop6qupuksw/grb8xi4t+qc621pun8ixjlt37q6nd1es3ugc5fwvgsv9tub3gi8hrenp89osss22zdosyb/ymttjf6zlbpvbf Reinforce Algorithm J(#) :=E p(z;#) [ (z)] rj(#) =E p(z;#) [ (z)r # log p(z; #)] Sample z k p(z; # k ) Compute G(z k, # k )= (z k )r #k log p(z k ; # k ) Update # k+1 = # k k G(z k, # k )
<latexit sha1_base64="qqxcfhphmdwqztemz4v5odvaqpm=">aaachxicbvhfaxnben6cra1t1vqffvkahbbacfek+qqfbfvqh4qmletomlc3sybu7r27c6xxyh/iq/5p/jfupsmyxigbj++b35owmhyh4z9w8ght/fhg5pot7z2nz563d19cuqkycnuq0iw9tsghjom9jtz4xvqepnv4ld58bpsrw7socvonjyumoywmdukbe2rqbv+qmrkz10fryr <latexit sha1_base64="dqebqq2yo/bjn7jctcmtxxkfwfa=">aaacq3icbvhdahnbfj6svwvqt6qx3gygwgy07iqokelxb6x0iqwmlwaxchzykh06o7vmnc1nl7yjt+nt+wk+jbnpxcbxwigp7zv/jymutbqevxvery3bd+5u3mtu3x/w8ffr+/grzusjsc9ylzutbcwqqbfpkhsefayhsxqej6efav34di2vuf5o0wljdczajquactsw9xrpj87auioenxcfejqbpulsfzknq8k/ep9pna141eulf9hhcxpyagfdyg58hyql0gyl6w23g3e0ykwzosahwnpbgbquv666fapnzai0wia4hqkohnsqoy2r+yizvuoyer/nxrkmpmdvzlsqwtvnehdzz29xtzr8nzyoafw2rqquskitrhuns8up5/w1+egafksmdoaw0s3krqogblmblnwz1y5qlg1snzdainyek6yiczlgsiuugdt1vtvx <latexit sha1_base64="jj1kekwesntnojcyyygyte4uvyc=">aaacjnicbvftaxnben6cl631ldvp4pffikqg4u7eciiwfeyhfqho2kjyhhobstjkd+/ynstnj+kv8av+hv+ne2kekziw8pa8m8/szosljs9x/lsv3bh56/bw9p2du/fup3jy3n104ovkkeyrqhfulaepmiz2mvjjwekqtk7xnj99bpttc3secvun5ywmbiawxqsaa5w1n1xmmzn0zgtzvxw7pafhu2tizntzuxp34kxitzasqucs4zjbbaxduaeqg <latexit sha1_base64="pp9kp46qo/izm8uhki7xz6n6m8w=">aaacwnicbvfti9naen7gt7v61topflksqgthsurqeohqg/pdcrxt3uetwmq7tdzunnf3clyn/vx+mvuqf8rnr0jfhfh4ej5nznzmkljjs75/3fju3b5z997efvv+g4ephnconpzzojicr6jqhbliwkksgkckseffardyrof5mvvq6oexakws9fealxjlkgo5lqliuxhn0/5j70c8owwvwvcgbpgsz9/xcjjjhu/zueoiik7xdaseqillzwn4u57yjjtdf+avg++cyaw6bbxd+kavhzncvdlqegqshqd+svhtakqhcneok4slibmkohzqq442qpdzl/glx0z4tdduaejldj2jhtzaez44zw6u2w2tif+njsuavolqqcukuiubrtnkcsp4s0q+kqyfqbkdiix0f+uiawoc3ko3uixrlyg2jqmvki1fmcetvtevgxckrcpb6maq+kqqxb+atvxuphn9u13zru4dy1ssptx199t9hbm7slc9/l1w9niq+ipg86vu0fvvafbym/ac9vjaxrmj9pen2ygj9otds9/sj3fsffo+e/bg6rvwou/zrng//wln8t5q</latexit> <latexit sha1_base64="/6wnxjvti3ogtr3oxgqus2ijbjg=">aaacshicbvfna9taef2rx2n6eac95rlebbyagikeggif0aasqw4prrodlcropby2wq3e7ijeff4x/tw9tsf+m64cl8z2bhye78282zmjcyut+f6fhvfo8zonz9aer794+er1rnpzzaxnsyowk3kvm14mfpxu2cvjcnufqchihvdx+rnwr27qwjnrbzqpmmxgrovicibhrc2jwq0yspagqtj3wzr/5p+zll/na1bfuspt9vco3bsn7kbnlt/xz8fxqtahltapi2izeq6gusgz1cquwnsp/ilcyllkoxc6pigtfibsggpfqq0z2rcattnlo44z8lfu3npez+z9igoyaydz7dizomquazx5knyvaxqyvlixjaewd41gpeku83plfcgnclitb0ay6f7krqigblnflnszercofiapbkstrt7ejvbrlrlwpexkqop6qupuksw/grb8xi4t+qc621pun8ixjlt37q6nd1es3ugc5fwvgsv9tub3gi8hrenp89osss22zdosyb/ymttjf6zlbpvbf <latexit sha1_base64="ys3klrjikvg9dud1jmmn7l7zgpg=">aaacz3icbvhlbtnafj2yvympprbkmyjcslsibiren5uqqijff6kgbuvswdetm3jk8diaus4nvhbb/orf4afywicwtonoeq40o6nz7mpm3kru0plv/2h5167fuhlr6/b2nbv37u+0dx+c2kiyaoeiuiu5s8cikhqhjenhwwkq8kthazk9bvttczrwfvodzuqmcphqozecyffx+2n4dozsjijrbc+y8wp+j8n4mx6cktmghonudj/hwa+5+b4pixinuge05fgt6htevvla68xtjt/3f8e3qbaehbamqbzbisjxiaocnqkf1o4cv6sodj2ludjfdiuljygmpjhysblso3phwpw/ccyytwrjjia+yk9w1jbbo8stl5kdpxzda8j/aaokjvtrlxvzewpxowhsku4fbxzly2lqkjo5amji91yuujagypm+mmxru0sx8pp6otjsfgncyxvdkafhwqqcpg5+vb+vsvh3oc0/ktou/qqubsn338ipjpv0yc1x9zas3ukcdfs3wcnzfud3g+mxncnxy9vssufsmeuygl1kh+wdg7ahe+w7+8l+sd/esffj++j9vuz1wsuah2wlvg9/ajo45di=</latexit> <latexit sha1_base64="jwaab2pnwnllfm05bp8bogpjn2w=">aaac1nicbvfnixnbeo2mx+v6svk9emkmqoiazkrqkjufbt3syuwzg8imq6wnjmm2p2forlksh3gtr/4rf4m/wqte7zle2sqwndzee13v/wpckgnj93+0veuxr1y9tnn998bnw7f32vt3tmxegoedkavcdmdguumna5kkcfgyhgys8hr89qrwt8/rwjnrdzqvmmpgomuqbzcj4jyu3u8vwnmwnewchj/gywhyjk7kqbd4mpawnscqegfof0v+zxnl3mjjp77ipwyonvfc7vh9vym+dyiv6lbvhcf7rshmclfmqekoshyu+avflesphclfblhaleccwqrhdmri0ezvk8wcp3bmwtpcukojn+zfgxvk1s6zsxnmqfo7qdxk/7rrsenzqjk6kam1wa5ks8up53wwpjegbam5aycmdg/lygoupxlxr01pehco1n5szuotrz7gbqtorgycazeyklr+vfvgksxfg7b8se6m9fd1bwu5+1pojnlhr27hurdldgsjnupfbidp+ohfd9497ry+xk1mh91j91mxbewzo2rv2tebmmg+s5/sf/vtdb3p3hfv69lqtvz37rk18r79aak+59q=</latexit> Reinforce Algorithm J(#) :=E p(z;#) [ (z)] Sample z k p(z; # k ) Compute G(z k, # k )= (z k )r #k log p(z k ; # k ) Update # k+1 = # k k G(z k, # k ) Generic algorithm for solving discrete optimization: z 2 { 1, 1} d p(z; #) = dy i=1 exp(z i # i ) exp( # i )+exp(# i ) # k+1 = # k k (z k )(z k + tanh(# k )) Does this solve any discrete problem?
<latexit sha1_base64="vs+14vgxeycwqa4/abiirwhhyzg=">aaadgnicbvjnb9naelxnv0n5sohizuvelyooshescfspoia49fbe01bkgmu9gser7q6t3tfkspxpupjhucguxpg3rfmjsmjilmbfe/n2z8zpiyxfmpzlb1euxrt+y+tmz/vw7tt3uzv3tm1egg4jnsvcnkfmghqarihqwnlhgkluwll6cdjwz5/awjhre1wuecs21sitnkgdku5xmsju6iozwxz1jwxdosrn55uswijxgwqys6hioevt6k2dajwq4zjauiuv7kf1xxnymgb/nucgthcpgjgdyuxpa2ohogws5k79okrjpsn+qgfqvndolnehr9fcojiia5w6fpskfvfs7yxdcblkm4napoe1czzs+dgd5lxuojflzu04cgumnr0klse1wvoogl9guxi7vdmfnq6w86zji4dmsjyb92kks/tfioopaxcqdcpmmnada8d/cemss+dxjxrrimh+evfwsoi5azzdjsiar7lwcengulcspmogcxqrxlll6v0ax+mkmpda8hwca6jeorrmqauomnbnv9vbisx5wlqlr83k/rdotqh7r8vuob0cuf9e722i3uki9ffvjqdphle4jn4/7r28alez5t3whnp9l/keeqfeo+/yg3nc3/yj/4x/mvgsfau+bz8upyhf1tz3vil4+rs0rp43</late <latexit sha1_base64="ry/q9u1/eodhnqdeowrn6r+2wk4=">aaadkxicbvlbbtnaelxnryrlu3idlxurvskiyeziikgisavxur+kanpkwwotn5nk1d21ttuueoy/ifd+hdfglr9hnqrbukaynhvombm7m05zksyg4q8/uht5ytvrw9cbn27eur3d3llzblpccbjwtgbmnguwpnawqiestnmdtkusttkz/zo/oqdjraapcj5drnhei7hgdb2unl/sfczcl8wynq9kkasgvwk2k5xqqolpujfdqhxdazqwr6qkhas7ryivltdgibwfskrci6qpr2q/wfzssxeoezmpxpsu7gwpe7xvzkkfrrxzi+o/6i7ufbsxs9ybwqfi3i4+o/i5pwcgp4cs06cgr6tnjs1w2asxqs4m0sppeas4thb8mi4yxijqycwzdhifocbodgwx4houlosmn7ejdf2qmqibl4vxvushq0zknbn3asql9n+kkilr5yp1ynpqdporwf9xwwlhz+js6lxa0hx50biqbdns74qmhagocu4sxo1wbyv8ygzj6da6dsvcowe+1kk5k7tg2qg2uikznmybflaxoeuuytdcsvkbauso6s39yz1ttbdfiola2z1wv43uxbc7husb47+yh <latexit sha1_base64="o4j6otjoeqzpjitxt8gszcrluky=">aaadnxicbvjnixnbej0zv9b4ldwjl8zgsnglzerqkjwfvfsqw4qb7ei6dj2dmqtz7p6hu2zjhoz3efvvepamxv0l9iqjmsschur3xlv1vxwcswgx3//mb9eu37h5a+92487de/cfnpcfjmyagw5dnsruxmtmghqahihqwkvmgklywnl8evlx51dgrej1gs4zmcg20yirnkgdouzxgsnm6iizw5zliwxzocpof4uswijxgursjlqxnmdx8bamcojwkkykzqykehicu5urqmcjspx0rk4i7cycio+ws42yzxfcaz3r9rbxzvs49ufykios/fufqhvbg23ilo4inbmdiszdxterznaoya7whbsncnpapzlqtvq9/srirhpwtsur7tta9yd0mvjcguyumbxjsj/hxkvdwsw4/nmlgeoxbazj52qmwe6k1ahl8tqhu5kkxh2nzix+g1ewze1sxu5zdc1ucxx4p26cy/jyugid5qiarwslussykmpvzcomcjrl5zbuhhsr4xnmgee33y0qq9wz8i1oikwubu+nsivkxkbhdrsaiglddvw8e1ksj0xbmqhw+id1asu680bmbnrdgftcursjd h i PT minimize E e t=1 C t(x t, u t ) s.t. x t+1 = f t (x t, u t, e t ) u t = t ( t ) Direct Policy Search Policy Gradient h PT i minimize E et,u t t=1 C t(x t, u t ) s.t. x t+1 = f t (x t, u t, e t ) u t p(u x t ; #) probabilistic policy Random Search h PT i minimize E et,! t=1 C t(x t, u t ) s.t. x t+1 = f t (x t, u t, e t ) u t = ( t ; # +!) parameter perturbation Reinforce applied to either problems does not depend on the dynamics Both are Derivative-free algorithms!
<latexit sha1_base64="ry/q9u1/eodhnqdeowrn6r+2wk4=">aaadkxicbvlbbtnaelxnryrlu3idlxurvskiyeziikgisavxur+kanpkwwotn5nk1d21ttuueoy/ifd+hdfglr9hnqrbukaynhvombm7m05zksyg4q8/uht5ytvrw9cbn27eur3d3llzblpccbjwtgbmnguwpnawqiestnmdtkusttkz/zo/oqdjraapcj5drnhei7hgdb2unl/sfczcl8wynq9kkasgvwk2k5xqqolpujfdqhxdazqwr6qkhas7ryivltdgibwfskrci6qpr2q/wfzssxeoezmpxpsu7gwpe7xvzkkfrrxzi+o/6i7ufbsxs9ybwqfi3i4+o/i5pwcgp4cs06cgr6tnjs1w2asxqs4m0sppeas4thb8mi4yxijqycwzdhifocbodgwx4houlosmn7ejdf2qmqibl4vxvushq0zknbn3asql9n+kkilr5yp1ynpqdporwf9xwwlhz+js6lxa0hx50biqbdns74qmhagocu4sxo1wbyv8ygzj6da6dsvcowe+1kk5k7tg2qg2uikznmybflaxoeuuytdcsvkbauso6s39yz1ttbdfiola2z1wv43uxbc7husb47+yh <latexit sha1_base64="upao0ytgznfi8wjp4jzh6wbyake=">aaadbnicbvjda9rafj3er1q/tvoowucizmfdeilukeqhqn3oq8xdtrbjw81kdnfozbjmbsoucd999y/4jr76n/wlvjrzruvueifwoofmutp3jimkmoj7px332vubn29t3n68c/fe/qetryfhji814wowy1yfjmc4fiopukdkp4xmkcwsnytn+7v+csg1ebnq46zguqzjjuacavoqbn058ekeshtegmyjr+jqxrpkpkivnguwv7gbzm/6dn+bxtgty+zquivxbdshs3ncsfrzs6r/mpjtueeiif6ban35mbzxhgejptnin1enm9y41fz7/qloogga0cznhcvbthsmosszrpbjmgyy+avglc0vtpl5zlgaxga7hzefwqgg4yaqfrob02ewseko1/ztsbfsvycqyiyzzyl1zoats6rv5p+0yymj11elvfeiv+yy0aiufhnal4kmqnogcmybmc3sxsmbgaagdl1lxrbzbwdll6mmprist/kkk3gkgixpogygvp2q6kbist+cmvswnvef1cbwsvdojawa7qh9j1rnzwwxeqyofx0cv+offi/4sn3ee9uszom8jk+jrwkyq/bie3jebosrx84t57nzwv3sfnw/ud8vra7tnhlelsr98rszcvet</latexit> Policy Gradient h PT i minimize E et,u t t=1 C t(x t, u t ) s.t. x t+1 = f t (x t, u t, e t ) u t p(u x t ; #) Direct Policy Search probabilistic policy G(, #) = TX C(x t, u t )! XT 1 r # log p # (u t x t ; #)! t=1 t=0
<latexit sha1_base64="j4lebcdojzuwdwucyrzfilabtyq=">aaadhhicbvlfb9mwehbcr1eydpdii0ufggxvcsdba5xgamhdhjzot0lnfjmu01qznci+obyr/wqv/co8iv6r+g+wuydrjpoiu3zf3dl3n/nkcanr9dsil12+cvxaxvxojzubt253t+4cmblwli1okup9khpdbfdsbbweo6k0izix7dg/e+p5489mg16qiswqlkoyvbzglicdsu63jgdtrizrmiwak0ttswrezq3kikv+htx4iu4kgvme23dnilgb46tqhnq4scmgj6awmyvb3jwo8tyd08f40hu8g+vl30fve82nm0itpo1u+td3neeudcdu8ac/bov2flrzlinowtskvvnw7ux9agn4yhc3qq+1dpbtbwkykwktmqiqidhjokogde2au8hcmlvhfafnzmrgllremppa5uyb/mahe1yu2n0k8bl9t8isacxc5i7t78ascx78hzeuoxizwq6qgpii5wcvtcbqyi8pnndnkiifcwjv3n0v0xlx6wyn4sopy94voyut2hmtoc0nba0vmadnhggysmkvn8q+50lgt0qzvo/v+cu6tp7efsunhmytffds1kmlyu6qeh39f4ojp/046sehz3u7e600g+geuo+2uyxeof30ar2gealbzvasebumwq/h9/bh+pm8nqzamrtoxcjffwdk/f3k</latexit> <latexit sha1_base64="aqbp24lviyjw82kpzzckyu6idtc=">aaacpxicbvhbbhmxehwwwwmxpuwrf4sikqkkdhesfaluusr4kfilsvop2a5mnuli1etd2wouajxf4gt4hx/gb/bug0qsrrj8fm5cpdnpoaslmpzdcg7dvnp33s795oohjx7vtvb2bzz3rmbf5co3lylyvfjjnyqpvcwmqpyqveivtyr94hsak3pdo0wbcqztlsdsahkqayunnrgbozjii+uypksjahnv4/oerl7y8+rmr7irx1+qu5m02me3ri1vg2gf2mxlz8leix6nc+ey1cquwdumwoliegxjoxdzhdmlbyhrmolqqw0z2risw1vy554z80lu/nhea/bfibiyaxdz6j0zojnd1cryf9rq0eqwlquuhkewn4umtnhketunppygbamfbycm9h/lygygbplprlwpcxco1jop505lky9xg1u0jwoetegzsf11vx6usvgvoc0/ldmz/vv92krufjbtsfb1qv+zpthy9gujnse/dqzvulhyjc7fto/fr1azw56yz6zdivaohbnp7iz1mwdf2q/2k/0kxgsfg14wuhengquyj2znguqpro <latexit sha1_base64="np4otb/lzbrm8sy8kda7c6pm1ec=">aaaczxicbvfdaxnbfj2sx7v+pfroy2aqurfhvwr9eyovfcxyswkd2xs5ozubdj2zxwbu2ir18+q/8n/47qv+bmfticbxwsdhnpsx99y0lmjigh5vbveuxrt+y+vm9q3bd+7ea+/cp7gfm4z3wselm0jbcik076nayqel4absyu/t84ngp/3mjrwfpszzyuckxlrkggf6kmkp3iexapwyvwl+udny8hzbmokc/lukmdv0j8ygywkkod/oxghudx5bp5ikx4x1wxw8f/lq7rkk0wtpnittttglf0e3qbqehbkmo2snnyqzgjnfntij1g6jsmrrbqyfk7zejp3ljbbzgpohhxout6nqyufnh3smo3lh/nnif+y/fruoa2cq9znnunzda8j/auoh+ctrjxtpkgt2osh3kmjbgz9pjgxnkgceadpc/5wycrhg6f1fmbloxxk2skk1dvqwiunrrmqpgvck5aha6gar6q2qkn4cbemhge/wj+rbnnl3jrglte8p/wn17kayp0i0bv8mohnwi8je9pf5z//18jrb5cf5rlokii/ipnlhjkifmpkn/ca/ya/gq+ccl8h8mjvolwsekjuivv4gvjrkjq==</latexit> Policy Gradient for LQR minimize s.t. E h 1 T P T t=1 x t Qx t + u t Ru t i x t+1 = Ax t + Bu t + e t Greedy strategy : Build control ut = Kxt Sample a bunch of random vectors: t N (0, 2 I) Collect samples from control u t = Kx t + t : = {x 1,...,x T } Compute cost: C( ) = TX t=1 Update: K new K old t C( ) x t Qx t + u t Ru t T 1 X t=0 t x t policy gradient only has access to 0-th order information!!!
<latexit sha1_base64="o4j6otjoeqzpjitxt8gszcrluky=">aaadnxicbvjnixnbej0zv9b4ldwjl8zgsnglzerqkjwfvfsqw4qb7ei6dj2dmqtz7p6hu2zjhoz3efvvepamxv0l9iqjmsschur3xlv1vxwcswgx3//mb9eu37h5a+92487de/cfnpcfjmyagw5dnsruxmtmghqahihqwkvmgklywnl8evlx51dgrej1gs4zmcg20yirnkgdouzxgsnm6iizw5zliwxzocpof4uswijxgursjlqxnmdx8bamcojwkkykzqykehicu5urqmcjspx0rk4i7cycio+ws42yzxfcaz3r9rbxzvs49ufykios/fufqhvbg23ilo4inbmdiszdxterznaoya7whbsncnpapzlqtvq9/srirhpwtsur7tta9yd0mvjcguyumbxjsj/hxkvdwsw4/nmlgeoxbazj52qmwe6k1ahl8tqhu5kkxh2nzix+g1ewze1sxu5zdc1ucxx4p26cy/jyugid5qiarwslussykmpvzcomcjrl5zbuhhsr4xnmgee33y0qq9wz8i1oikwubu+nsivkxkbhdrsaiglddvw8e1ksj0xbmqhw+id1asu680bmbnrdgftcursjd <latexit sha1_base64="6+d+t3hyrysuqvej1zizpuxu/ba=">aaada3icbvjna9wwejxdryt92qs39ik6lxhpstih0fxsaimkhxxs2k0ca8dotbjxrjknna5zhi+99o/0vnrtd+n/6a+ovn6g7g4hbi/3nmy0mxqvghsiw9+ef+fuvfsp1ty3hj56/orpz3pr1bsvpmxac1ho8xexthdfbsbbspnsmyjhgp2nlg8b/eykacml9qwmjuskyrxpocxgqltzbf3owgayvwdxivlotumromhcgptwpo4ztainaitrhjtkppbvr/wfbpnd4mam3zid55lgnk3ke3gh3zbsrbpqu9ty9q2zdrphp5wfxgxrhhtrpe7sts+jxwwtjfnabtfmgiuljnav5vsweioudcsjvsq5gzqoigqmsbpb1fi1y8y4k7q7cvcmvx3demnmvi6cuxkymgwtif+ndsvi9hllvvkbu7qtlfucq4gbleax14ycmdpaqoburzhoibsquf0tvjnllhld6mrev4rtysywwahxoikjdqnjugq6skdccpyzkiopet6bf6pl28jbb55zmnvh7koo3orzlsrahv8qon3tr2e/+vs2e/b+vpo19ak9ragk0dt0gd6iezrafp3xnntd75x/1f/u//b/tlbfm995hhbc//uxpln0yg==</latexit> <latexit sha1_base64="eh+lm1eg20caiqwbiclxhabyf/4=">aaacxnicbvhlattafb2rrzr9oe2ym6gmiimxuii0m5racskii4tgscbsxdx4wh4ygomzq9rggppx/zyuum1/oynhhdruhweo59z3tusllqxbj4537/6dh492hu8+efrs+yvu3stlw1rg4eguqjdxkvhuuuoijcm8lg1cniq8sm+ogv3qfo2vhb6grylxdpmwuymahjv0z4/9qmgxg0f0c4zmsndnbzxsocu/slwe1hqqlr9c8cn/ntcgsqjpiyozwfnrsbu45yljzzunn3r7wtbygd8gyqt6rlwzzk8tr5ncvdlqegqshydbsxht2pfc4xi3qiywig4gw7gdgnk0cb2afcnfombcp4vxtxnfsf9g1jbbu8ht55kdzeym1pd/08yvtt/etdrlrajfxafpptgvvfkkn0idgttcarbgul65miebqw7da1vwuusua5pu80pluuxwg1u0jwooteg5sn1mvr9lpfhn0jafnpv/q7q0jex/kpkkozh1n9x9lwd3khbz/dvgcn8ybspw/f3v8gn7mh32mr1hpgvze3bittgzgzhbvrof7bf77z142qu8r3euxqenecxwzpv2b2p33yw=</latexit> (μ,λ)-evolution Bandit <latexit sha1_base64="kdgzh6hio9nc9jpwvim6pvpovhi=">aaacnhicbvftaxnben5cfwnrs9p6uzdficzqwp0i9ktliqufk1rs0kjyhnobsbj0b+/yns0jr36cv8av9of4b9xli5jegywh55mzz2cmlzs0fia/a8hwg4ephm/v7d55+uz5xn3/ogdzzwr2ra5yc52crsu1dkmswuvcigspwqv0plppv7dormz1jc0kjdmyazmsashtsf1tpzm4bumtjgjxyz6wlktkoo7m3y95pzln6nal1ojjvrg2w0xwtratqymt4ylzr8wdys5chpqeamv7uvhqxhovkrtodwfoyghibsby91bdhjyufxpn+rvpdpkon/5p4gv234osmmtnweozm <latexit sha1_base64="hbpv3rsxivbvxbvtlqyeh9u0gig=">aaack3icbvftaxnben5c1db6llb8jmhiefio4u4klqhstkbclyqmlesomlezxjbuy7e7jw1hvvlr/kp/xn/jxhrbja4mpdzpve9ekukpjn+3oo1bt+9sbt3dvnf/wcnh7z3dc28rj7avrllumgepshrskysfl6vd0lnci/zqbanffepnptvfavpipqewciwfu Random Search h PT i minimize E et,! t=1 C t(x t, u t ) s.t. x t+1 = f t (x t, u t, e t ) u t = ( t ; # +!) Direct Policy Search! TX G(!, #) = C(x t, u t ) r log p(!) t=1 parameter perturbation G (m) (!, #) = 1 m mx i=1 C(# +! i ) C(#! i ) 2! i TX C(#) = C(x t, u t ) t=1! i N (0, I) random finite difference approximation to the gradient aka Strategies SPSA Convex Opt
<latexit sha1_base64="j4lebcdojzuwdwucyrzfilabtyq=">aaadhhicbvlfb9mwehbcr1eydpdii0ufggxvcsdba5xgamhdhjzot0lnfjmu01qznci+obyr/wqv/co8iv6r+g+wuydrjpoiu3zf3dl3n/nkcanr9dsil12+cvxaxvxojzubt253t+4cmblwli1okup9khpdbfdsbbweo6k0izix7dg/e+p5489mg16qiswqlkoyvbzglicdsu63jgdtrizrmiwak0ttswrezq3kikv+htx4iu4kgvme23dnilgb46tqhnq4scmgj6awmyvb3jwo8tyd08f40hu8g+vl30fve82nm0itpo1u+td3neeudcdu8ac/bov2flrzlinowtskvvnw7ux9agn4yhc3qq+1dpbtbwkykwktmqiqidhjokogde2au8hcmlvhfafnzmrgllremppa5uyb/mahe1yu2n0k8bl9t8isacxc5i7t78ascx78hzeuoxizwq6qgpii5wcvtcbqyi8pnndnkiifcwjv3n0v0xlx6wyn4sopy94voyut2hmtoc0nba0vmadnhggysmkvn8q+50lgt0qzvo/v+cu6tp7efsunhmytffds1kmlyu6qeh39f4ojp/046sehz3u7e600g+geuo+2uyxeof30ar2gealbzvasebumwq/h9/bh+pm8nqzamrtoxcjffwdk/f3k</latexit> <latexit sha1_base64="gbej/5jhfqpxmesx5rjmy6uj8yg=">aaacnhicbvhtahnbfj2sx7v+pfptkmegpqbhv4t2z9gcyipungkhu4s7k7vj0nmzzeaoniztg/g0/tuh8w2ctsoyxasdh3pux9x78kpjr3h8uxvdu37j5q2t29t37t67/6c983dojlccb8ioy09zckikxgfjunhawyqyv3isn71t9jovaj00+gvnk8xkmgpzsaeuqhh7ez+ncgsca8033ucveqqqmsgylj90uwk/e8lt7fm43yl78sl4jkiwomowctzeawxpxahfoiahwllreleu1wbjcoux26l3wie4gymoatrqosvqxuyx/flgjrwwnjxnfmh+w1fd6dy8zenmctrz61pd/k8besr2s1rqyhnqctwo8iqt4c15+erafktmaycw <latexit sha1_base64="b7badqzs7+mpsxwxgaronsg5mmk=">aaacmhicbvhtahnbfj2sx7v+tfpp/wwgiyusdougp4mktvckfqqtzndyd3kzuxrmdpm5kw1lhscn8a8+im/jbbrbjf4yojxz5n7mlsbpcfy7e924eev2na272/fup3j4agf38akva6dwqepduvmcpgqyogrijeevqzc5xrp88m2rn31d56m0x3hwywagsdqhbryoi51uamuzejiyncbtbbr5mo/f+y1xgph6ii/2givux4uqmybzgq5yxsnfbidlx6wqdvpwgrwfjxhfwqooswmcb6e1xwr <latexit sha1_base64="kwygz+w7cnkjjvpxrgavefkydqs=">aaachnicbvfdaxnbfj1sty1ra2iffrkmqoisdkvpxwqhfrtsq0stfpil3j3cjenmz5ezozvh6u/xvx+t/8bznijjvdbw5pz7fzncsuth+lsshdx4ehhufvr7/otp8bn64/nazs4i7itmzeymaytkauytjiu3uufie4xxyeky1k9v0viz6w+0yjfoyablvaogt43rdtcmfs5bn1+ptgsv/wdcb4adcg18h0qb0gqb640blxg0yyrluznqyo0wcnokczakhck72shzzeesyizddzwkaoni3fsdf+wzc Random Search for LQR minimize s.t. E h 1 T P T t=1 x t Qx t + u t Ru t i x t+1 = Ax t + Bu t + e t Greedy strategy : Build control ut = Kxt Sample a random perturbation: N (0, 2 I) Collect samples from control u t =(K + )x t : = {x 1,...,x T } Compute cost: J( )= T t=1 x t Qx t + u t Ru t Update: K K t J( )
Reinforce is NOT Magic What is the variance? Necessarily becomes derivative free as you are accessing the decision variable by sampling Approximation can be far off But it s certainly super easy!
Deep Reinforcement Learning Simply parameterize Q-function or policy as a deep net Note, ADP is tricky to analyze with function approximation Policy search is considerably more straightforward: make the log-prob a deep net.
<latexit sha1_base64="mr94ezqth17vzwjopx3thjsdtck=">aaacg3icbvfbaxnred5ztdz6aaqpvhwmqousdktbfsguffshdxwnlsrlmd2zjeppztlntiqu+sw+6o/y33g2jwasbwy+vm/uu5saaqfp71zy5+69nfu7d/yepnr8zl998prbcjvx2fnoo39vqebnfntm <latexit sha1_base64="c3vasflhsyp2zenoal07qvhm1sk=">aaaci3icbvfdsxtbfj1sbwttq7e++ncxoaeqoyrdeaqlgvhbffdbulofzf3utm6swznzzeaujcz5nx1tf5d/xtmyqpp0wsdhnpsx9540v9jrgd7ugmcrz1+8xh219vrn2/wn+ua7ny4r <latexit sha1_base64="oi5ov9kcoehyn9bwcwwjat8txu4=">aaac4hicbvhlihnbfk1ux2p7yujstwfqrkyyxslmufagfhqxixgnm5buqnx1taeyquqm6ryknp0brsstn+xop3fpdrjlknih4hdouy+6n6uudbjhv4lwytvr12/s3ixu3b5z915v9/5nv9zwwfcuqrtngxegpiehslrwxlngolnwll286fszl2cdlm0nnfeqal4yozgco6fgptubn7jpwvqkjhku0jsz5mjlri0yfuiztzioxgiw+t+xy2rppo02k+ikyqfa85fdto7suttu9enbvai6ddgk9mkqtse7qzrkpag1gbskozdicyvpwy1koacnktpbxcufl2dkoeeaxnos1tlsx57j6as0/hmkc/zyrso1c3odeacfdoo2ty78nzaqcxkuntjunyiry0atwlesabdjmkslatxcay6s9lnsmewwc/sxwouyqf2bwptjm6unfguog6zcgvrusqeouttdr5p3uin6krtht2qxxb+ql9vje29lide9o/hnnk+3zp4gbhp922d4fpbywd686b+/xl <latexit sha1_base64="+ojsm2yurvuosb1y5f0dyh5w9ea=">aaaco3icbvhbbtnaen2ywym3fb55wrgbioscjzbahkcvqikhpbroakxyisbritpqem3tjqsek5/b1/akh8hfse4digkjrfbonllpwmlyhia/osgvq9eu39i5uxvr9p2 Simplest Example: LQR minimize TX (x t ) 2 1 t=0 subject to x t+1 = x t = +ru 2 t apple 1 1 0 1 x t + apple zt v t apple 0 1/m u t samples nominal control and ADP with 10 samples
<latexit sha1_base64="mr94ezqth17vzwjopx3thjsdtck=">aaacg3icbvfbaxnred5ztdz6aaqpvhwmqousdktbfsguffshdxwnlsrlmd2zjeppztlntiqu+sw+6o/y33g2jwasbwy+vm/uu5saaqfp71zy5+69nfu7d/yepnr8zl998prbcjvx2fnoo39vqebnfntm <latexit sha1_base64="c3vasflhsyp2zenoal07qvhm1sk=">aaaci3icbvfdsxtbfj1sbwttq7e++ncxoaeqoyrdeaqlgvhbffdbulofzf3utm6swznzzeaujcz5nx1tf5d/xtmyqpp0wsdhnpsx9540v9jrgd7ugmcrz1+8xh219vrn2/wn+ua7ny4r <latexit sha1_base64="oi5ov9kcoehyn9bwcwwjat8txu4=">aaac4hicbvhlihnbfk1ux2p7yujstwfqrkyyxslmufagfhqxixgnm5buqnx1taeyquqm6ryknp0brsstn+xop3fpdrjlknih4hdouy+6n6uudbjhv4lwytvr12/s3ixu3b5z915v9/5nv9zwwfcuqrtngxegpiehslrwxlngolnwll286fszl2cdlm0nnfeqal4yozgco6fgptubn7jpwvqkjhku0jsz5mjlri0yfuiztzioxgiw+t+xy2rppo02k+ikyqfa85fdto7suttu9enbvai6ddgk9mkqtse7qzrkpag1gbskozdicyvpwy1koacnktpbxcufl2dkoeeaxnos1tlsx57j6as0/hmkc/zyrso1c3odeacfdoo2ty78nzaqcxkuntjunyiry0atwlesabdjmkslatxcay6s9lnsmewwc/sxwouyqf2bwptjm6unfguog6zcgvrusqeouttdr5p3uin6krtht2qxxb+ql9vje29lide9o/hnnk+3zp4gbhp922d4fpbywd686b+/xl <latexit sha1_base64="+ojsm2yurvuosb1y5f0dyh5w9ea=">aaaco3icbvhbbtnaen2ywym3fb55wrgbioscjzbahkcvqikhpbroakxyisbritpqem3tjqsek5/b1/akh8hfse4digkjrfbonllpwmlyhia/osgvq9eu39i5uxvr9p2 Simplest Example: LQR minimize TX (x t ) 2 1 t=0 subject to x t+1 = x t = +ru 2 t apple 1 1 0 1 x t + apple zt v t apple 0 1/m u t samples nominal control and ADP with 10 samples
Extraordinary Claims Require Extraordinary Evidence* * only if your prior is correct How can we dismiss an entire field which claims such success? blog.openai.com/openai-baselines-dqn/ Reinforcement learning results are tricky to reproduce: performance is very noisy, algorithms have many moving parts which allow for subtle bugs, and many papers don t report all the required tricks. RL algorithms are challenging to implement correctly; good results typically only come after fixing many seemingly-trivial bugs. Average Return Average Return 5000 4000 3000 2000 1000 0 2000 1500 1000 500 0 arxiv:1709.06560 HalfCheetah-v1 (TRPO, Different Random Seeds) Random Average (5 runs) Random Average (5 runs) 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Timesteps 10 6 HalfCheetah-v1 (TRPO, Codebase Comparison) 500 There has to be a better way! Schulman 2015 Schulman 2017 Duan 2016 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Timesteps 10 6
Coarse-ID control G u y K
Coarse-ID control v Δ ^ G w u High dimensional stats bounds the error Coarse-grained model is trivial to fit y K Design robust control for feedback loop
<latexit sha1_base64="6bzqpf3vhyc8ozgggdlvwhkvsy4=">aaac2nicbvfnbxmxehwwrxk+ujhysyha5aprbouehjcqgashchvbakxsduv1jsmotndlz6ke7z44ia78lp4af4mrhpcmqsini1l+em9mpj6xfqodhegpvndu/iwllzyut69cvxb9rmfz5gexl1bcqoyqt4ezckdqwicqfbwwfotofbxkxy8a/eajwie5eu/zahitjgbhkav5ku2m4wwmacphrzjxlvj1o84lsijya4sgb5vggxo/qr0rf5nlq37n7/hyltqt8hluh73h8um/tje/nkuh3+azfooto512dga0bjx2umevxarfb9esdnky9tpnvhkpcllqmcsvcg4yhqulvh2hvocnlb0uqh6lcqw9bgz1sbvysm3vembex7n1xxbfsp9wvei7n9ezz9scpu6s1pd/04yljz8mfzqijddy9kfxqtjlvnkuh6efswrugzaw/axctouvkrwhk68sehcgv35szuqdmh/bgvbrjkzwpapsak3zq+ovksxfcep4hk6m9ff1brt56yvovfup9rzr5v5asjckorv+dtdy6t3rrw8fd3f7s2c22g12h22xid1hu+w122cdjtl39pp9yr+djpgcfam+nqygrwxnlbyswbc/ks3qpw==</latexit> Coarse-ID control (static case) minimize u x Qx subject to x = Bu + x 0 B unknown! Collect data: {(x i, u i )} x i = Bu i + x 0 + e i Estimate B: minimize B P N i=1 kbu i + x 0 x i k 2 Guarantee: kb ˆBk apple with high probability ˆB Note: x = ˆBu + x 0 + B u Robust optimization problem: minimize sup k u B kapple kq 1/2 (x Bu)k subject to x = ˆBu + x 0
<latexit sha1_base64="6bzqpf3vhyc8ozgggdlvwhkvsy4=">aaac2nicbvfnbxmxehwwrxk+ujhysyha5aprbouehjcqgashchvbakxsduv1jsmotndlz6ke7z44ia78lp4af4mrhpcmqsini1l+em9mpj6xfqodhegpvndu/iwllzyut69cvxb9rmfz5gexl1bcqoyqt4ezckdqwicqfbwwfotofbxkxy8a/eajwie5eu/zahitjgbhkav5ku2m4wwmacphrzjxlvj1o84lsijya4sgb5vggxo/qr0rf5nlq37n7/hyltqt8hluh73h8um/tje/nkuh3+azfooto512dga0bjx2umevxarfb9esdnky9tpnvhkpcllqmcsvcg4yhqulvh2hvocnlb0uqh6lcqw9bgz1sbvysm3vembex7n1xxbfsp9wvei7n9ezz9scpu6s1pd/04yljz8mfzqijddy9kfxqtjlvnkuh6efswrugzaw/axctouvkrwhk68sehcgv35szuqdmh/bgvbrjkzwpapsak3zq+ovksxfcep4hk6m9ff1brt56yvovfup9rzr5v5asjckorv+dtdy6t3rrw8fd3f7s2c22g12h22xid1hu+w122cdjtl39pp9yr+djpgcfam+nqygrwxnlbyswbc/ks3qpw==</latexit> Coarse-ID control (static case) minimize u x Qx subject to x = Bu + x 0 B unknown! Collect data: {(x i, u i )} x i = Bu i + x 0 + e i Estimate B: minimize B P N i=1 kbu i + x 0 x i k 2 Guarantee: kb ˆBk apple with high probability ˆB Solve robust optimization problem: minimize sup k u B kapple kq 1/2 (x Bu)k subject to x = ˆBu + x 0 Relaxation: (Triangle inequality!) minimize kq 1/2 xk + kuk u subject to x = ˆBu + x 0
Coarse-ID control (static case) minimize u x Qx subject to x = Bu + x 0 B unknown! Collect data: {(x i, u i )} x i = Bu i + x 0 + e i Estimate B: Guarantee: P N minimize B i=1 kbu i x i k 2 kb ˆBk apple with high probability ˆB Relaxation: (Triangle inequality!) minimize kq 1/2 xk + kuk u subject to x = ˆBu + x 0 Generalization bound cost(û) apple cost(u? )+4 ku? kkq 1/2 x? k + 4 2 2 ku? k 2
<latexit sha1_base64="6bzqpf3vhyc8ozgggdlvwhkvsy4=">aaac2nicbvfnbxmxehwwrxk+ujhysyha5aprbouehjcqgashchvbakxsduv1jsmotndlz6ke7z44ia78lp4af4mrhpcmqsini1l+em9mpj6xfqodhegpvndu/iwllzyut69cvxb9rmfz5gexl1bcqoyqt4ezckdqwicqfbwwfotofbxkxy8a/eajwie5eu/zahitjgbhkav5ku2m4wwmacphrzjxlvj1o84lsijya4sgb5vggxo/qr0rf5nlq37n7/hyltqt8hluh73h8um/tje/nkuh3+azfooto512dga0bjx2umevxarfb9esdnky9tpnvhkpcllqmcsvcg4yhqulvh2hvocnlb0uqh6lcqw9bgz1sbvysm3vembex7n1xxbfsp9wvei7n9ezz9scpu6s1pd/04yljz8mfzqijddy9kfxqtjlvnkuh6efswrugzaw/axctouvkrwhk68sehcgv35szuqdmh/bgvbrjkzwpapsak3zq+ovksxfcep4hk6m9ff1brt56yvovfup9rzr5v5asjckorv+dtdy6t3rrw8fd3f7s2c22g12h22xid1hu+w122cdjtl39pp9yr+djpgcfam+nqygrwxnlbyswbc/ks3qpw==</latexit> Coarse-ID control (static case) minimize u x Qx subject to x = Bu + x 0 B unknown! Collect data: {(x i, u i )} x i = Bu i + x 0 + e i Estimate B: minimize B P N i=1 kbu i + x 0 x i k 2 Guarantee: kb ˆBk apple with high probability ˆB Relaxation: (Triangle inequality!) minimize kq 1/2 xk + kuk u subject to x = ˆBu + x 0 Generalization bound cost(û) =cost(u? )+O( )
Coarse-ID optimization minimize x f(x) subject to g(x; #) apple 0 # unknown! Collect data: {(x i, g(x i ; #)} Estimate #: ˆ# Guarantee: dist(#, ˆ#) apple with high probability Relaxation: minimize ˆfrobust (x) x subject to g(x; ˆ#) apple 0 Generalization bound f(ˆx) apple f(x? )+err, kf ˆfrobust k, x?, #
<latexit sha1_base64="euyqlm8oqoqnwpvqldlbjjbjanm=">aaadnxicbvjlbxmxepyurxietehixskikhrfuwgjlpvkacehhxastfk8jbyon7fqe1f2lcry+7u48jc4cenc+qt400uicsnzm/7m5znpasgfhsj6horxrl67fmprzuvw7tt3t9s794y2lw3ja5bl3jyl1hipnb+aamnpcsopsiu/ts9e1/7tt9xykes+laqekdrvihomgofg7w8k5vohhtwglionzduiks3ntgktlpjck7ylirrq7preiokmfgt+mqidwaiiisistd3bikiewyhkhjixv65fywjlnwqhcxxex/mxnd/bj7xg+7hc3j7u+rjmqkjt1nahw7ec+9t9umih+fwtdfshe83h0cjct5onj9udqbstbw8acwn0ucph450gizoclypryjjao4qjahjfdgst3m9fwl5qdkgnforntrw3ivuuuskppdlbww780ycx6l8zjiprfyr1kfvu7lqvbv/ng5wqvuyc0eujxlplrlkpmes45g1phoem5miblbnh34rzjpp1g2d3pcuydshzyiruxmrb8glfqyxmwvapwg6kelb9vo6dkbj/pnrixs3ox68vw7v33oipapu057+qfrwr7amj19e/aqyfdeoog5887xwendrsoqfoidpdmxqbdtf7diwgiaw7qs8ybmpwa/gj/bn+ugwngybnplqr8pcfxx4jrw==</latexit> <latexit sha1_base64="uo+1skyh9ob/xn5keqmafdwhdva=">aaac03icbvfdixmxfe3hr3x92k4++hisqotsokxqfvtu0ieck253f5pustn3pmgtzgxyr7beerff/vn+ch+dr/pupq1gwy8edufcj9xzp4wsdnu9h43o2vubn2/t3n69c/fe/b3m/omtl5dwwfdkkrdnu+5asqndlkjgrlda9vtb6ftida2ffglrzg6ocv7awppmyfqkjogancfhlinlylcqbpz7iilisc1sy4vntmaan/dpo3ladcrpvoib8ilnmupmaq+lqao2gyxp0wfqoklyc96vklmym2fn0mz1ur1f0g0qr0clrojost8ysyqxpqadqnhnrngvwlhnfqvquo2y0khbxqxpybsg4rrc2c+cqoitwcq0zw14bumc/bfcc+3cxe9dzr2d29rq8n/aqmt05dhlu5qiriwhpawimnpavppicwlvpaaurax/pwlgg4syzf+bsuhdgfjbx Simple Example: LQR minimize s.t. lim T!1 E h Gaussian noise Obvious strategy : Estimate (A,B), build control ut = Kxt 1 T x t+1 = Ax t + Bu t + e t P T t=1 x t Qx t + u t Ru t i Run an experiment for T steps with random input. Then minimize (A,B) P T i=1 kx i+1 Ax i Bu i k 2 If T Õ 2 (d + p) min( c ) 2 where c = A c A + BB controllability Gramian then A Â B ˆB and w.h.p. [Dean, Mania, Matni, R.,Tu, 2017] [Mania, R., Simchowitz, Tu, 2018]
<latexit sha1_base64="qybpfo5ltn9luocv7dk5zoe+lxe=">aaadhhicbvjnbxmxepuuxyvqsohixsjc2qhpuxuq4acoupgk2hykanpkcro5xm9i1fzubs9q5pqvcowpcenckfg3ejnuahjgsvt03njgm8/dgjnt4vhven65e+/+g7whtuep1588rw88o9f5qqjtkpzn6myinevm0q5hhtozqleshpyedi/2kv30g1wa5flytaraf3gkwcyinp4a1h+gtgfidyi0xsyeuubwwqbpg5wznwaiti/hhmqhvg3tyyhwwckbzvgjs7hzxs5mhlhvmt6sbjporagzo0nz3g4lo20hnyg6ppyvrtednljsndbngtkxylg4ew7sbctwettkn4ums8fodeqnedueblwfyrw0wdyobhtbh6u5kqwvhncsds+jc9o3wblgohu1vgpayhkbr7tnocsc6r6dbttbv55jyzyrf6sbu/b2dyuf1hmx9jnvuhpzq8j/ab3szo/6lsminfsswaos5ndkslihpkxryvjea0wu82+fziz9tow3cahlthzbycik9qqujoqpxwk5utike1jtizct1vr2n3eov2kpyafy4eb1zss5+srgzohwx/8u2vxj9oyky+tfbsft7stetr68aex+nfuzbl6alyaccxgldsfncas6gatrwevgffah/b7+dh+fv2epytc/8xwsrpjnh9uaaes=</latexit> Simple Example: LQR Obvious strategy : Estimate (Â, ˆB), build control u t = ˆKx t minimize u P sup lim 1 T k A k 2 apple A, k B k 2 apple B T!1 T t=1 x t Qx t + u t Ru t s.t. x t+1 =(Â + A)x t +(ˆB + B )u t Solving an SDP relaxation of this robust control problem yields J(ˆK) J? apple C cl J? r 2 min( c ) 1/2 (d + p) + kk? k 2 T w.h.p. c = A c A + BB controllability Gramian cl := k(zi A BK? ) 1 k H1 closed loop gain This also tells you when your cost is finite! [Dean, Mania, Matni, R., Tu 2017]
Why robust? x t+1 = 2 1.01 0.01 0 3 40.01 1.01 0.015 x t + 0 0.01 1.01 2 1 0 3 0 40 1 05 u t + e t 0 0 1 Slightly unstable system, system ID tends to think some nodes are stable
Least-squares estimate may yield unstable controller Robust synthesis yields stable controller
Model-free performs worse than model-based
Why has no one done this before? Our guarantees for least-squares estimation required some heavy machinery. Indeed, best bounds building on papers from last few years Our SDP relaxation uses brand new techniques in controller parameterization (Matni et al) Naive robust synthesis is nonconvex and requires solving very large SDPs The Singularity has arrived!
<latexit sha1_base64="eumgloqutsxkdwjk4wwvpysqp+q=">aaadknicbvjnaxsxen3dfqxur5z2miuokayj4+yaqnsqksmk2d6kte4clmnkrwylsnqnnftilp1hvfap9bz67q+p1nygtjsgelw3mthm0ygt3eau3frbg4ephj/zelp59vzfy+3qzqttk+aash5nrarpr8qwwrxraqfbzjpnibwjdja6pcr1sx9mg56q7zdl2ecsiejjtgk4alj9hceaunso8zsa7rt1/fyqgyc6shcaycgu0bfu4eyfhxmpydbiswcqpawikjw8hhal1zs5j0muiile3qvn6xd2pz5ofwgp4zvoojs+gbaw5pmp1cvyxgmwapecbphekotwmoxldszssrgf4iqjipxcurldai1qrvnamybegpq3jjphjj/asupzyrrqqyzpx1ega0s0ccpyucg5yrmhl2tc+g4qipkz2pl+c/twmqkap9odbwjo3r9hitrmjkcusxzergsl+t+tn8p4w8bylexaff00guccqypks1dcnamgzg4qqrl7k6jt4jyeztkvlvpagamrk9jrxhgajmynfxanmjjsmjceq3iqe8yfqn+imqhbonknurklhh7mew6m0xx/rtu3kp0h8fr6n8fpqxlhzfjru9rhx6u1w96u98ylvdh77x16x7wtr+drf9f/5lf9tvaz+b3cbn8wqyg/vppaw4ng7z9wcawm</latexit> where Even LQR is not simple!!! minimize J := t=1 x T t Qx t + u T t Ru t s.t. J(ˆK) J? apple C cl J? c = A c A + BB controllability Gramian x t+1 = Ax t + Bu t + e t Gaussian noise r 2 min( c ) 1/2 (d + p) log(1/ ) + kk? k 2 n cl := k(zi A BK? ) 1 k H1 closed loop gain Hard to estimate Control insensitive to mismatch Easy to estimate Control very sensitive to mismatch 50 papers on Cosma Shalizi s blog say otherwise! Need to fix learning theory for time series.
<latexit sha1_base64="euyqlm8oqoqnwpvqldlbjjbjanm=">aaadnxicbvjlbxmxepyurxietehixskikhrfuwgjlpvkacehhxastfk8jbyon7fqe1f2lcry+7u48jc4cenc+qt400uicsnzm/7m5znpasgfhsj6horxrl67fmprzuvw7tt3t9s794y2lw3ja5bl3jyl1hipnb+aamnpcsopsiu/ts9e1/7tt9xykes+laqekdrvihomgofg7w8k5vohhtwglionzduiks3ntgktlpjck7ylirrq7preiokmfgt+mqidwaiiisistd3bikiewyhkhjixv65fywjlnwqhcxxex/mxnd/bj7xg+7hc3j7u+rjmqkjt1nahw7ec+9t9umih+fwtdfshe83h0cjct5onj9udqbstbw8acwn0ucph450gizoclypryjjao4qjahjfdgst3m9fwl5qdkgnforntrw3ivuuuskppdlbww780ycx6l8zjiprfyr1kfvu7lqvbv/ng5wqvuyc0eujxlplrlkpmes45g1phoem5miblbnh34rzjpp1g2d3pcuydshzyiruxmrb8glfqyxmwvapwg6kelb9vo6dkbj/pnrixs3ox68vw7v33oipapu057+qfrwr7amj19e/aqyfdeoog5887xwendrsoqfoidpdmxqbdtf7diwgiaw7qs8ybmpwa/gj/bn+ugwngybnplqr8pcfxx4jrw==</latexit> Simplest Example: LQR minimize s.t. lim T!1 E How many samples are needed for near optimal control? Lots of asymptotic work in the 80s (adaptive control) Fietcher, 1997: PAC, discounted costs, many assumptions on contractivity, bugs in proof. Abbas-Yadkori & Szepesvári, 2011: Regret, exponential in dimension, no guarantee on parameter convergence, NP-hard subroutine. Ibrahimi et al, 2012: require sparseness in state transitions Ouyang et al, 2017: Bayesian setting, unrealistic assumptions, not implementable Abeille and Lazaric, 2015: Suboptimal bounds h 1 T x t+1 = Ax t + Bu t + e t P T t=1 x t Qx t + u t Ru t i
The Linearization Principle If a machine learning algorithm does crazy things when restricted to linear models, it s going to do crazy things on complex nonlinear models too. What happens when we return to nonlinear models?
<latexit sha1_base64="geqtxxms1vozi4hkkogwziugzue=">aaadzxicbvndb9mwfm1apkb42ucrf4ufizdndppm6dtgd0xcqnlwnmmpjsd1vquxexyn28jck/+pn975ithxgnjxpsjh555z73xsrfnccgxhj5vo99bto3dx79n3hzx89hht/cmnpc0kososjqk8ixboeybosdgv0jnmusyjhb5hs706fzynmmepofjxgr1zfc5yzahwmjpbx/kzrvscivlhqeiwrepcrgkhlb25cujejirpjguxzu+nqrimoxi8w5emfxxgxrmfuydpbzatqgjhaiftehq7vprsrxf5hpnzbxspdn1wulh6/2zlq5cygcwlznhb8moweavfgx14wtincmuotndp+x3hvh3n79pbpetupv00y4y2sq7g4lckgragwou1pxbqpu7ivsmlck/bbv6o6zzsd3ng7ptu4/eccgvpmu5mvdyneoofrtr3e54bgb9jqh/g1fjxqoe6agp5zmrph0bkbahmnd7urd1pwbfi2cqu9ye0e3hec4jmjk0errnwafrm/hzq2dog3izngjsatwddamn4tvy9nksk4fqokua8p0uwu+mss30delrzyzhtdjozpv1tdqxmnb+xzxwswavnteccsv3ojtbsoqpepm+veksvhktpvpyryf/ltgsvb+osiaxqvbdtkc4sofjq320wyzislvxpgilkelzaplhiovqfyoupgja3fbom3o3bnvrobuy+br/gqvxmem69tjc1y+1a+9bqglmk87bzufolu3ah3xn3uvvvsdsrreep9u90v/0cyxirpg==</latexit> Random search of linear policies outperforms Deep Reinforcement Learning Larger is better 365 366 365 131 3909 3651 3810 3668 6722 4149 6620 4800 11389 5234 5867 5594 5146 4607 4816 5007 11600 6440 6849 6482
<latexit sha1_base64="geqtxxms1vozi4hkkogwziugzue=">aaadzxicbvndb9mwfm1apkb42ucrf4ufizdndppm6dtgd0xcqnlwnmmpjsd1vquxexyn28jck/+pn975ithxgnjxpsjh555z73xsrfnccgxhj5vo99bto3dx79n3hzx89hht/cmnpc0kososjqk8ixboeybosdgv0jnmusyjhb5hs706fzynmmepofjxgr1zfc5yzahwmjpbx/kzrvscivlhqeiwrepcrgkhlb25cujejirpjguxzu+nqrimoxi8w5emfxxgxrmfuydpbzatqgjhaiftehq7vprsrxf5hpnzbxspdn1wulh6/2zlq5cygcwlznhb8moweavfgx14wtincmuotndp+x3hvh3n79pbpetupv00y4y2sq7g4lckgragwou1pxbqpu7ivsmlck/bbv6o6zzsd3ng7ptu4/eccgvpmu5mvdyneoofrtr3e54bgb9jqh/g1fjxqoe6agp5zmrph0bkbahmnd7urd1pwbfi2cqu9ye0e3hec4jmjk0errnwafrm/hzq2dog3izngjsatwddamn4tvy9nksk4fqokua8p0uwu+mss30delrzyzhtdjozpv1tdqxmnb+xzxwswavnteccsv3ojtbsoqpepm+veksvhktpvpyryf/ltgsvb+osiaxqvbdtkc4sofjq320wyzislvxpgilkelzaplhiovqfyoupgja3fbom3o3bnvrobuy+br/gqvxmem69tjc1y+1a+9bqglmk87bzufolu3ah3xn3uvvvsdsrreep9u90v/0cyxirpg==</latexit> Larger is better 365 366 365 131 3909 3651 3810 3668 6722 4149 6620 4800 11389 5234 5867 5594 5146 4607 4816 5007 11600 6440 6849 6482
200 100 AverageReward300 0 Swimmer-v1 0-10 10-20 20-100 0 500 1000 1500 4000 3000 2000 1000 0 Hopper-v1 0-20 20-30 30-100 0 5000 10000 6000 5000 4000 3000 2000 1000 0 HalfCheetah-v1 0-5 5-20 20-100 0 5000 10000 4000 3000 2000 1000 0 1000 Ant-v1 0-30 30-70 70-100 0 25000 50000 75000 Episodes AverageReward 10000 8000 6000 4000 2000 0 Walker2d-v1 0-80 80-90 90-100 0 25000 50000 Episodes 8000 6000 4000 2000 0 Humanoid-v1 0-30 30-70 70-100 0 100000 200000 300000 400000 Episodes Larger is better
<latexit sha1_base64="rli7jlp75guz6h4r9bivhozu6ym=">aaadmhicbvjnj9mwehxc11k+undkyqhytdqqshasxcqtwnby2enxtlsr1svyxke11nyie4jsovcjupjh4is48itwukgwlsnfgr838+yzlzitwkiqfpf8a9dv3ly1c7t15+69+w/auw9pbzobxicslak5j6nlumg+aqgsn2eguxvlfhzfhnb82udurej1gfyznym60cirjikdovzxevof0cu1hq6qusqqrvscfquswijxivd4dxnfyrnh5dsq4ktybkbe5ioqyrhwh8b4mijueue/j6ch990xccdyvb9wpwleygkzqhpvo4bbreh4cdwe4urvc587gf7nigqhhevyww5zfsqtroyarfvbhot589qo3qkgwtrwdhi2sqc1myp2vrmzpyxxxaot1nppggqwc3igmoru9nzyjliluubtl2qquj2v6y1x+jld5jhjjfs04dx6b0djlburfbvkel92k6vb/3hthjjxs1lolaeu2evfss4xpli2dm+f4qzkyiwugeheitmsgsragxvllrv2xtmvscoi14klc76bsijauadadookxu9vhgkp8xuqlt6unfvdotma7r4rcwg2f+z+ht3bknaghjvr305onw/cybcevogcvg6s2ugp0vpursf6iq7qozrce8s8j96rn/jo/c/+n/+h//oy1peankfosvi/fgnhzqxm</latexit> <latexit sha1_base64="pkddr1xni4717umkasm7i9gzej0=">aaac0hicbvhlattafb2rryr9oe0ym6gmybfjpfjon4xqtlsllnkhnyakxgh8jq8ajctmvberontbv+pn9au6bf+gi8eb2u6fgcm59zh33ksswqdn/ew5n27eun1nb//g7r37dx72dx9ntvlrdhneyljfjsyafaomkfdczawbfymeiyq/7fsll6cnknvnxfyqfsxtihwcoaxifhbwis6hibi6zl36iozmz2ehvnzult210ilor7vlj2lymjwnsfo2jrtoqwkpbnqan/mx3w7t68wrug6ortbhko4pvlg3crol/duykhwcx4e9kjyvvc5aizfmmmd3kowaplfwce1bwbuogm9zbogfihvgomblqkufwmzg01lbp5cu2h8rglyysywsm9ltyra1jvyfftsyvowaoaoaqfgrqwktkza0s5tohaaocmkb41ryv1i+z5pxtmzvtfn1robvbnisaiv4oymtvuicnbokasyyun1wztshjf3elkfnncfxqm3bycm3ihnormf2usrdsbyh8bft3wxtz2pfg/sfng9oxq9ps0eoybmyjd55qu7ie3jojosth+qx+u3+ob+dhfpv+xav6vtwny/jrjjf/wjmcoov</latexit> <latexit sha1_base64="wjxwbc8ktns7bqu2kdxmd9r45bw=">aaacxxicbvfdi9nafj3gr3x96uqjl4nfslguzbhcl4xfvdahfahouwtncjppttp0zhjmbpawepxx/hfbv/0dtrovboufgcm592puuwkphcug+nhx7ty9d//bwcpdr4+fph3wpxo+suvloix5iqtzntilumgyo0aj16ubplijv+nivnwvbsbyueivucohvizxihocoaos7igqrrl6yz49prezeasetuqqoefjwl8oqj59qypfcj6m9ccmqagjjgq4pzokdhs/a2sh1qd6/ciifi5x0u0fw2addb+eg9ajmxglr504mhw8uqcrs2btnaxkjgtmuhajzwfuwsgzx7acpg5qpsdg9xr1hr52zixmhxfpi12z/1butfm7uqnlblewu1pl/k+bvpidxlxqzywg+e2grjiuc9r6sgfcaee5cobxi9xfkz8zwzg6t7emrhuxwlc2qzevfryywq4rcymgodickiz0u1v9iaskx5i29ll1+k/q2ray/0hkau3g0p1u9/es3uhcxfv3wer4gabd8ppb3tn7zwkoyevyivgkjo/igflermrmoplofpjf5ld34skpvzvbvk+zqxlbtsl79gfagt6r</latexit> Model Predictive Control h i PT minimize E e t=1 C t(x t, u t )+C f (x T+1 ) s.t. Optimal Policy: x t+1 = f t (x t, u t, e t ), x 1 = x u t = t ( t ) k ( k ) = arg min u C k (x k, u)+e e [V k+1 (f k (x k, u, e))] MPC: use the policy at every time step 1 (x) = arg min u C k (x, u)+e e [V 1 (f 1 (x, u, e))] MPC ethos: plan on short time horizons, use feedback to correct modeling error and disturbance. MPC trades improved computation for control cleverness, requiring significant planning for each action.
Model Predictive Control Videos from Todorov Lab https://homes.cs.washington.edu/~todorov/
Actionable Intelligence Control Theory Reinforcement Learning is the study of how to use past data to enhance the future manipulation of a dynamical system
Actionable Intelligence is the study of how to use past data to enhance the future manipulation of a dynamical system As soon as a machine learning system is unleashed in feedback with humans, that system is an actionable intelligence system, not a machine learning system.