Boosting. Sewoong Oh. CSE/STAT 416 University of Washington

Podobne dokumenty
Hard-Margin Support Vector Machines

Linear Classification and Logistic Regression. Pascal Fua IC-CVLab

Machine Learning for Data Science (CS4786) Lecture11. Random Projections & Canonical Correlation Analysis

tum.de/fall2018/ in2357

Tychy, plan miasta: Skala 1: (Polish Edition)

Machine Learning for Data Science (CS4786) Lecture 11. Spectral Embedding + Clustering

SSW1.1, HFW Fry #20, Zeno #25 Benchmark: Qtr.1. Fry #65, Zeno #67. like

Wojewodztwo Koszalinskie: Obiekty i walory krajoznawcze (Inwentaryzacja krajoznawcza Polski) (Polish Edition)

Machine Learning for Data Science (CS4786) Lecture 24. Differential Privacy and Re-useable Holdout

Katowice, plan miasta: Skala 1: = City map = Stadtplan (Polish Edition)

deep learning for NLP (5 lectures)

Gradient Coding using the Stochastic Block Model

Stargard Szczecinski i okolice (Polish Edition)

Previously on CSCI 4622

Zakopane, plan miasta: Skala ok. 1: = City map (Polish Edition)

MaPlan Sp. z O.O. Click here if your download doesn"t start automatically

Wojewodztwo Koszalinskie: Obiekty i walory krajoznawcze (Inwentaryzacja krajoznawcza Polski) (Polish Edition)

Miedzy legenda a historia: Szlakiem piastowskim z Poznania do Gniezna (Biblioteka Kroniki Wielkopolski) (Polish Edition)

Wybrzeze Baltyku, mapa turystyczna 1: (Polish Edition)

Revenue Maximization. Sept. 25, 2018

Miedzy legenda a historia: Szlakiem piastowskim z Poznania do Gniezna (Biblioteka Kroniki Wielkopolski) (Polish Edition)

Dealing with continuous-valued attributes

TTIC 31210: Advanced Natural Language Processing. Kevin Gimpel Spring Lecture 9: Inference in Structured Prediction

Analysis of Movie Profitability STAT 469 IN CLASS ANALYSIS #2

Emilka szuka swojej gwiazdy / Emily Climbs (Emily, #2)

Karpacz, plan miasta 1:10 000: Panorama Karkonoszy, mapa szlakow turystycznych (Polish Edition)

ERASMUS + : Trail of extinct and active volcanoes, earthquakes through Europe. SURVEY TO STUDENTS.

Helena Boguta, klasa 8W, rok szkolny 2018/2019

Wojewodztwo Koszalinskie: Obiekty i walory krajoznawcze (Inwentaryzacja krajoznawcza Polski) (Polish Edition)

Wojewodztwo Koszalinskie: Obiekty i walory krajoznawcze (Inwentaryzacja krajoznawcza Polski) (Polish Edition)

OpenPoland.net API Documentation

Neural Networks (The Machine-Learning Kind) BCS 247 March 2019

Karpacz, plan miasta 1:10 000: Panorama Karkonoszy, mapa szlakow turystycznych (Polish Edition)

ARNOLD. EDUKACJA KULTURYSTY (POLSKA WERSJA JEZYKOWA) BY DOUGLAS KENT HALL

Wojewodztwo Koszalinskie: Obiekty i walory krajoznawcze (Inwentaryzacja krajoznawcza Polski) (Polish Edition)

Wprowadzenie do programu RapidMiner, część 3 Michał Bereta

Dolny Slask 1: , mapa turystycznosamochodowa: Plan Wroclawia (Polish Edition)

Egzamin maturalny z języka angielskiego na poziomie dwujęzycznym Rozmowa wstępna (wyłącznie dla egzaminującego)

Towards Stability Analysis of Data Transport Mechanisms: a Fluid Model and an Application

Maximum A Posteriori Chris Piech CS109, Stanford University

TTIC 31210: Advanced Natural Language Processing. Kevin Gimpel Spring Lecture 8: Structured PredicCon 2

Agnostic Learning and VC dimension

Relaxation of the Cosmological Constant

Domy inaczej pomyślane A different type of housing CEZARY SANKOWSKI

DODATKOWE ĆWICZENIA EGZAMINACYJNE

Prices and Volumes on the Stock Market

Pielgrzymka do Ojczyzny: Przemowienia i homilie Ojca Swietego Jana Pawla II (Jan Pawel II-- pierwszy Polak na Stolicy Piotrowej) (Polish Edition)

Blow-Up: Photographs in the Time of Tumult; Black and White Photography Festival Zakopane Warszawa 2002 / Powiekszenie: Fotografie w czasach zgielku

Wojewodztwo Koszalinskie: Obiekty i walory krajoznawcze (Inwentaryzacja krajoznawcza Polski) (Polish Edition)

Wprowadzenie do programu RapidMiner, część 2 Michał Bereta 1. Wykorzystanie wykresu ROC do porównania modeli klasyfikatorów

Steeple #3: Gödel s Silver Blaze Theorem. Selmer Bringsjord Are Humans Rational? Dec RPI Troy NY USA

Rozpoznawanie twarzy metodą PCA Michał Bereta 1. Testowanie statystycznej istotności różnic między jakością klasyfikatorów

Dolny Slask 1: , mapa turystycznosamochodowa: Plan Wroclawia (Polish Edition)

PSB dla masazystow. Praca Zbiorowa. Click here if your download doesn"t start automatically

Cracow University of Economics Poland. Overview. Sources of Real GDP per Capita Growth: Polish Regional-Macroeconomic Dimensions

Poland) Wydawnictwo "Gea" (Warsaw. Click here if your download doesn"t start automatically

Few-fermion thermometry

EGZAMIN MATURALNY Z JĘZYKA ANGIELSKIEGO POZIOM ROZSZERZONY MAJ 2010 CZĘŚĆ I. Czas pracy: 120 minut. Liczba punktów do uzyskania: 23 WPISUJE ZDAJĄCY

Installation of EuroCert software for qualified electronic signature

Steps to build a business Examples: Qualix Comergent

Karpacz, plan miasta 1:10 000: Panorama Karkonoszy, mapa szlakow turystycznych (Polish Edition)

Sargent Opens Sonairte Farmers' Market

Supervised Hierarchical Clustering with Exponential Linkage. Nishant Yadav

Test sprawdzający znajomość języka angielskiego

SubVersion. Piotr Mikulski. SubVersion. P. Mikulski. Co to jest subversion? Zalety SubVersion. Wady SubVersion. Inne różnice SubVersion i CVS

Wprowadzenie do programu RapidMiner Studio 7.6, część 9 Modele liniowe Michał Bereta

Leba, Rowy, Ustka, Slowinski Park Narodowy, plany miast, mapa turystyczna =: Tourist map = Touristenkarte (Polish Edition)

Proposal of thesis topic for mgr in. (MSE) programme in Telecommunications and Computer Science

y = The Chain Rule Show all work. No calculator unless otherwise stated. If asked to Explain your answer, write in complete sentences.

Zmiany techniczne wprowadzone w wersji Comarch ERP Altum


Zdecyduj: Czy to jest rzeczywiście prześladowanie? Czasem coś WYDAJE SIĘ złośliwe, ale wcale takie nie jest.


General Certificate of Education Ordinary Level ADDITIONAL MATHEMATICS 4037/12

Extraclass. Football Men. Season 2009/10 - Autumn round

ZDANIA ANGIELSKIE W PARAFRAZIE

Wroclaw, plan nowy: Nowe ulice, 1:22500, sygnalizacja swietlna, wysokosc wiaduktow : Debica = City plan (Polish Edition)

Wroclaw, plan nowy: Nowe ulice, 1:22500, sygnalizacja swietlna, wysokosc wiaduktow : Debica = City plan (Polish Edition)

Wprowadzenie do programu RapidMiner, część 3 Michał Bereta

Jak zasada Pareto może pomóc Ci w nauce języków obcych?

Presented by. Dr. Morten Middelfart, CTO

Weronika Mysliwiec, klasa 8W, rok szkolny 2018/2019

Financial support for start-uppres. Where to get money? - Equity. - Credit. - Local Labor Office - Six times the national average wage (22000 zł)

Karpacz, plan miasta 1:10 000: Panorama Karkonoszy, mapa szlakow turystycznych (Polish Edition)

Camspot 4.4 Camspot 4.5

EGZAMIN MATURALNY Z JĘZYKA ANGIELSKIEGO POZIOM ROZSZERZONY MAJ 2010 CZĘŚĆ I. Czas pracy: 120 minut. Liczba punktów do uzyskania: 23 WPISUJE ZDAJĄCY

Knovel Math: Jakość produktu

Compatible cameras for NVR-5000 series Main Stream Sub stream Support Firmware ver. 0,2-1Mbit yes yes yes n/d

Label-Noise Robust Generative Adversarial Networks

Nonlinear data assimilation for ocean applications

Estimation and planing. Marek Majchrzak, Andrzej Bednarz Wroclaw,

MoA-Net: Self-supervised Motion Segmentation. Pia Bideau, Rakesh R Menon, Erik Learned-Miller

Poniżej moje uwagi po zapoznaniu się z prezentowanymi zasadami:

USB firmware changing guide. Zmiana oprogramowania za przy użyciu połączenia USB. Changelog / Lista Zmian

Export Markets Enterprise Florida Inc.

Lecture 18 Review for Exam 1

January 1st, Canvas Prints including Stretching. What We Use

Zasady rejestracji i instrukcja zarządzania kontem użytkownika portalu

ANKIETA ŚWIAT BAJEK MOJEGO DZIECKA

Transkrypt:

Boosting Sewoong Oh CSE/STAT 416 University of Washington

Ensemble classifier: Aggregating weak classifiers!2

Consider a scenario where we train many weak classifiers on a given data f 1 (x) <latexit sha1_base64="frnixik8xv+gaue+4exo/tdeydg=">aaab73icbva9twjbej3dl8qv1njmi5hgq+6w0jjoy4mjfcrwixvlhmzy3tt394zkwp+wsdayw/+onf/gba5q8cwtvlw3k5l5qcyznq777etw1jc2t/lbhz3dvf2d4ufrs0ejirrjih6ptoa15uzspmgg006skbybp+1gfdpz249uarbjezojqs/wulkqewys1cmhfa/ydf7uf0tu1z0drriviyxi0ogxv3qdicscskm41rrrubhxu6wmi5xoc71e0xitmr7srqusc6r9dh7vfj1zzydcsnmsbs3v3xmpflpprga7btyjveznxp+8bmlckz9lmk4mlwsxkew4mhgapy8gtffi+mqstbsztyiywgotyymq2bc85zdxsatw9s6qtbtaqx6dxzghezifcnhwcxw4hqy0gqchz3ifn+fbexheny9fa87jzo7hd5zph03vjtm=</latexit> f 2 (x) <latexit sha1_base64="mn00doyhcrgr8htvjfgnhtj5zxu=">aaab73icbvbnt8jaej3if+ix6thlrjdbc2nrqy9elx4xky8egrjdtrbhu627wynp+bnepgimv/+on/+nw+hbwzdm8vletgbm+tfnstv2t1vyw9/y3cpul3z29/ypyodhbrulktawixgkuz5wldnbw5pptruxpdj0oe34k5vm7zxsqvgk7vu0pl6ir4ifjgbtpg41gli1p/pqofyx6/ycaju4oalajuag/nufriqjqdcey6v6jh1rl8vsm8lprnrpfi0xmear7rkqceivl87vnaezowxreeltqqo5+nsixafs09a3nshwy7xszej/xi/rwzwxmhenmgqywbqkhokizc+jizouad41bbpjzk2ijlherjuisiyez/nlvdj2685f3b1zk43rpi4inmap1mcbs2jaltshbqq4pmmrvfkp1ov1bn0swgtwpnmmf2b9/gbpxy7u</latexit> f 3 (x) <latexit sha1_base64="/3mdykqm+6hw7ubtcfwu8rdp12w=">aaab73icbva9twjbej3ze/eltbtzccbykdsotctawgiihwlcyn6ybxt2987dpso58cdsldtg1r9j579xgssufmkkl+/nzgzeehomjet+o2vrg5tb27md/o7e/sfh4ei4paneedokey9uj8cacizp0zddasdwfiua03ywvpn57ueqnivkvzne1bd4kfnicdzw6ptcfq38dfhqf4puxz0drrivi0xi0ogxvnqdicscskm41rrrubhxu6wmi5xo871e0xitmr7srqusc6r9dh7vfj1bzydcsnmsbs3v3xmpflpprga7btyjveznxp+8bmlckz9lmk4mlwsxkew4mhgapy8gtffi+mqstbsztyiywgotyypk2xc85zdxsata8wqv6l21wl/o4sjbkzxbgty4hdrcqgoaqiddm7zcm/pgvdjvzseidc3jzk7gd5zph1dljtu=</latexit> f 4 (x) <latexit sha1_base64="8y2tta8/6eccxch/nhzxqnrycy4=">aaab8hicbvbnswmxej31s9avqkcvwvaol7jbbt0wvxisyd+kxuo2zbahsxzjsmjz+iu8efdeqz/hm//gtn2dtj4yelw3w8y8iozmg9f9dlzw19y3nnnb+e2d3b39wsfhu0ejirrbih6pdoa15uzshmgg03askbybp61gddp1w49uarbjezooqs/wqlkqewys9fakexeo/hrw6hwkbswdas0tlynfyfdvfb66/ygkgkpdona647mx8vosdcoctvldrnmykxee0i6leguq/xr28asdwqwpwkjzkgbn1n8tkrzaj0vgowu2q73otcx/ve5iwis/ztjodjvkvihmodirmn6p+kxryvjyekwus7cimsqke2mzytsqvmwxl0mzwvhok9w7arf2ncwrg2m4gtj4cak1uiu6nicaggd4htdhos/ou/mxb11xspkj+apn8wepfi8a</latexit> By taking a weighted average of those weak classifiers with appropriately chosen weights, we might get a strong classifier f(x) = w 1 f 1 (x)+w 2 f 2 (x)+ <latexit sha1_base64="cueznkbetfdckocfdv/grtcpdpe=">aaacfhicbzdlssnafiynxmu9rv26gwyfsqekcafqhkiblxxsbzoqjpnjo3ryywailtkhcooruhghifsx7nwbj20w2vrdwdf/oyez83sjo0iaxre2tlyyurze2chubm3v7op7+20rpxytfo5zzlseeotrilqklyx0e05q6dhs8yzxwb1zr7igcxqrrwlxqtspaeaxkspy9wo5qdycqlt+ydfhvwsgrpndq4qtwlvmbgm/lqls6iwjzkwff8hmoqrynv39y/zjniykkpghixqmkuhnjlikmjfj0u4fsraeoj7pkyxqsiqzni41gcfk8weqc3uicafu74kxcouyhz7qdjeciplazv5x66uyohfgnepsssi8eyhigzqxzbkcpuueszzsgdcn6q8qdxbhwkociyoec37lrwhbnfo0zt1ypczlhkcbhiijuaemoamnca2aoauweatp4bw8au/ai/aufcxal7r85gd8kfb5a+ngmlo=</latexit>!3

Y we train many weak regressors on a given data X Y Y f 2 (x) <latexit sha1_base64="mn00doyhcrgr8htvjfgnhtj5zxu=">aaab73icbvbnt8jaej3if+ix6thlrjdbc2nrqy9elx4xky8egrjdtrbhu627wynp+bnepgimv/+on/+nw+hbwzdm8vletgbm+tfnstv2t1vyw9/y3cpul3z29/ypyodhbrulktawixgkuz5wldnbw5pptruxpdj0oe34k5vm7zxsqvgk7vu0pl6ir4ifjgbtpg41gli1p/pqofyx6/ycaju4oalajuag/nufriqjqdcey6v6jh1rl8vsm8lprnrpfi0xmear7rkqceivl87vnaezowxreeltqqo5+nsixafs09a3nshwy7xszej/xi/rwzwxmhenmgqywbqkhokizc+jizouad41bbpjzk2ijlherjuisiyez/nlvdj2685f3b1zk43rpi4inmap1mcbs2jaltshbqq4pmmrvfkp1ov1bn0swgtwpnmmf2b9/gbpxy7u</latexit> Y f 3 (x) <latexit sha1_base64="/3mdykqm+6hw7ubtcfwu8rdp12w=">aaab73icbva9twjbej3ze/eltbtzccbykdsotctawgiihwlcyn6ybxt2987dpso58cdsldtg1r9j579xgssufmkkl+/nzgzeehomjet+o2vrg5tb27md/o7e/sfh4ei4paneedokey9uj8cacizp0zddasdwfiua03ywvpn57ueqnivkvzne1bd4kfnicdzw6ptcfq38dfhqf4puxz0drrivi0xi0ogxvnqdicscskm41rrrubhxu6wmi5xo871e0xitmr7srqusc6r9dh7vfj1bzydcsnmsbs3v3xmpflpprga7btyjveznxp+8bmlckz9lmk4mlwsxkew4mhgapy8gtffi+mqstbsztyiywgotyypk2xc85zdxsata8wqv6l21wl/o4sjbkzxbgty4hdrcqgoaqiddm7zcm/pgvdjvzseidc3jzk7gd5zph1dljtu=</latexit> Y f 4 (x) <latexit sha1_base64="8y2tta8/6eccxch/nhzxqnrycy4=">aaab8hicbvbnswmxej31s9avqkcvwvaol7jbbt0wvxisyd+kxuo2zbahsxzjsmjz+iu8efdeqz/hm//gtn2dtj4yelw3w8y8iozmg9f9dlzw19y3nnnb+e2d3b39wsfhu0ejirrbih6pdoa15uzshmgg03askbybp61gddp1w49uarbjezooqs/wqlkqewys9fakexeo/hrw6hwkbswdas0tlynfyfdvfb66/ygkgkpdona647mx8vosdcoctvldrnmykxee0i6leguq/xr28asdwqwpwkjzkgbn1n8tkrzaj0vgowu2q73otcx/ve5iwis/ztjodjvkvihmodirmn6p+kxryvjyekwus7cimsqke2mzytsqvmwxl0mzwvhok9w7arf2ncwrg2m4gtj4cak1uiu6nicaggd4htdhos/ou/mxb11xspkj+apn8wepfi8a</latexit> f 1 (x) <latexit sha1_base64="frnixik8xv+gaue+4exo/tdeydg=">aaab73icbva9twjbej3dl8qv1njmi5hgq+6w0jjoy4mjfcrwixvlhmzy3tt394zkwp+wsdayw/+onf/gba5q8cwtvlw3k5l5qcyznq777etw1jc2t/lbhz3dvf2d4ufrs0ejirrjih6ptoa15uzspmgg006skbybp+1gfdpz249uarbjezojqs/wulkqewys1cmhfa/ydf7uf0tu1z0drriviyxi0ogxv3qdicscskm41rrrubhxu6wmi5xoc71e0xitmr7srqusc6r9dh7vfj1zzydcsnmsbs3v3xmpflpprga7btyjveznxp+8bmlckz9lmk4mlwsxkew4mhgapy8gtffi+mqstbsztyiywgotyymq2bc85zdxsatw9s6qtbtaqx6dxzghezifcnhwcxw4hqy0gqchz3ifn+fbexheny9fa87jzo7hd5zph03vjtm=</latexit> X By taking a weighted average of those weak regressors with appropriately chosen weights, we might get a strong regressor X X X Y f(x) = w 1 f 1 (x)+w 2 f 2 (x)+ <latexit sha1_base64="cueznkbetfdckocfdv/grtcpdpe=">aaacfhicbzdlssnafiynxmu9rv26gwyfsqekcafqhkiblxxsbzoqjpnjo3ryywailtkhcooruhghifsx7nwbj20w2vrdwdf/oyez83sjo0iaxre2tlyyurze2chubm3v7op7+20rpxytfo5zzlseeotrilqklyx0e05q6dhs8yzxwb1zr7igcxqrrwlxqtspaeaxkspy9wo5qdycqlt+ydfhvwsgrpndq4qtwlvmbgm/lqls6iwjzkwff8hmoqrynv39y/zjniykkpghixqmkuhnjlikmjfj0u4fsraeoj7pkyxqsiqzni41gcfk8weqc3uicafu74kxcouyhz7qdjeciplazv5x66uyohfgnepsssi8eyhigzqxzbkcpuueszzsgdcn6q8qdxbhwkociyoec37lrwhbnfo0zt1ypczlhkcbhiijuaemoamnca2aoauweatp4bw8au/ai/aufcxal7r85gd8kfb5a+ngmlo=</latexit>!4 X

<latexit sha1_base64="cueznkbetfdckocfdv/grtcpdpe=">aaacfhicbzdlssnafiynxmu9rv26gwyfsqekcafqhkiblxxsbzoqjpnjo3ryywailtkhcooruhghifsx7nwbj20w2vrdwdf/oyez83sjo0iaxre2tlyyurze2chubm3v7op7+20rpxytfo5zzlseeotrilqklyx0e05q6dhs8yzxwb1zr7igcxqrrwlxqtspaeaxkspy9wo5qdycqlt+ydfhvwsgrpndq4qtwlvmbgm/lqls6iwjzkwff8hmoqrynv39y/zjniykkpghixqmkuhnjlikmjfj0u4fsraeoj7pkyxqsiqzni41gcfk8weqc3uicafu74kxcouyhz7qdjeciplazv5x66uyohfgnepsssi8eyhigzqxzbkcpuueszzsgdcn6q8qdxbhwkociyoec37lrwhbnfo0zt1ypczlhkcbhiijuaemoamnca2aoauweatp4bw8au/ai/aufcxal7r85gd8kfb5a+ngmlo=</latexit> <latexit sha1_base64="by5x56sffodlzqqzarvb/ul3bxu=">aaaccxicbvc7tsmwfhxkq4rxgjhfokuqululyycxgowxsh1jbrq5jtnadr6yhuqvdwxhv1gyqiivp2djb3dtdnbyjmvh59yr63u8hfehtfnbk62tb2xulbf1nd29/qpj8kgr4prj0sexi3nfq4iwgpgopjkrfsijcj1get7kzu737gkxni7acpoqj0sjiayui6kk14dvwlvqd+d1glh2fg+xh0trd9y2eupv16iydtmhxcvwqsqgqms1voz+jnoqrbizjmtamhppzihlihmz6cnukathcrqrgaircolwsnytgtxtig+dmkstszirvzsyfaoxdt1vgsi5fsvexpzpg6qyuhiygiwpjbfedapsbmum57fan3kcjzsqgjcn6q8qjxfhwkrwdbwctbzykunadeuiyd/zlez1eucznibtuamwuarncataoamweatp4bw8au/ai/aufsxks1rrcwz+qpv8atx5lt0=</latexit> Ensemble classifier Given data {(xi,yi)} Train an ensemble of classifiers: each with its own model parameters Train weight parameters to predict f 1 (x),f 2 (x),,f T (x) w 1,w 2,,w <latexit sha1_base64="81ywhnqi/wokyvwva9swvn+sgy0=">aaacahicbvc7tsmwfhv4lvakmdcwwlridfwvhahgchbgivultvhkoe5r1ykj26gqoi78cgsdclhygwz8dw6bavqodkwjc+7vvfcekans2fa3sba+sbm1xdoxd/f2dw6to+o25jnapiu546ibiekytuhlucvinxuexqejnwb0n/m7j0riypommqtei9egorhfsgnjt04ry9+pwrhvvmefh1zj6thvmhxfkts1ew64spyclegbhm999uoos5gkcjmkzc+xu+xlscikgzma/uysfoergpcepgmkifty+qntekgveezc6eounku/j3iuszmja90zizwuy95m/m/rzsq68xkapjkicv4sijigfyezngbibcgkttrbwfb9k8rdjbbwojnth+asv7xk2m7nuaq5d265flveuqjn4bxcagdcgzq4bw3qahhmwtn4bw/gk/fivbsfi9y1o5g5ax9gfp4aanquww==</latexit> T f(x) = w 1 f 1 (x)+w 2 f 2 (x)+ If we need to make a hard prediction in {-1,1}, then ŷ = sign(w 1 f 1 (x)+w 2 f 2 (x)+ + w T f T (x))!5 But, which classifiers do we use? and how do we choose the weights?

Ensemble classifier!6 Training a single decision tree often results in either a large tree that is overfitted; or a small tree that is a weak classifier Principled averaging methods Bagging: primary goal is variance reduction Fit many large trees using bootstrap-resampled versions of training data and classify using majority vote Boosting: primary goal is bias reduction Fit many small trees using re-weighted visions of the training data and classify using weighted majority vote In general, Boosting > Bagging > single decision tree AdaBoost is the best off-the-shelf classifier; the most popular classifier on Kaggle challenges

Boosting: How do we find those classifiers to add?!7

Boosting Idea: Sequentially add a new classifier each round, which are chosen to work well on hard example Technique: maintain a appropriately weighted dataset at the end of each round, each data point is weighted by i, such that if data point (x i,y i ) is hard to classify with current models, then it is weighted high When training a new model to be added in the next round, data point (x is counted as i,y i ) i data points this ensures that newly added models are different from existing ones, as it can better classify a subset of examples that are hard for existing models!8

Example: round 1 In round 1, start with all weights 1, that is for all i 2 {1,,N} i =1 and train a simple model, like a one-level decision tree (a.k.a. decision stump) At the end of training in round 1, increase i for misclassified data points as they are hard (and decrease it for correctly classified data points as they are easy) We will learn how to update the weights later (called AdaBoost) Credit Income y Weight α Credit Income y A $130K Safe B $80K Risky C $110K Risky A $110K Safe A $90K Safe B $120K Safe C $30K Risky C $60K Risky B $95K Safe A $60K Safe A $98K Safe!9 Income? > $100K $100K 3 1 4 3 ŷ = Safe ŷ = Safe A $130K Safe 0.5 B $80K Risky 1.5 C $110K Risky 1.2 A $110K Safe 0.8 A $90K Safe 0.6 B $120K Safe 0.7 C $30K Risky 3 C $60K Risky 2 B $95K Safe 0.8 A $60K Safe 0.7 A $98K Safe 0.9

Example: round 2 Train a new classifier (typically a decision tree) with the weighted examples by taking weighted count of the data points (examples coming soon) If it is not decision tree, we train by minimizing weighted loss: minimize L(w) = 1 N NX i=1 i`(ŷ i,y i )!10

AdaBoost: Adaptive Boosting!11

AdaBoost Initialize i = 1 N for all i 2 {1,,N} For t =1,,T Train a classifier f t (x) on weighted training data Compute coe cient ŵ t for the t-th classifier f t (x) Re-compute weights { i } N i=1 Final model is prediction: ŷ = sign P T t=1 ŵtf t (x)!12

Computing ŵ t Idea: at round t, we want to give larger weight ŵ t to classifier f t (x), if it is accurate on the hard examples Formally, ŵ t = 1 1 2 log WeightedError(ft ) WeightedError(f t ) that is, higher weight to accurate classifier, but accuracy is computed on the weighted examples (at round t) weighted_error(ft) 1 WeightedError(f t ) WeightedError(f t ) 0.01 99 2.3 informative & accurate 0.5 1 0.0 uninformative ŵt!13 where, 0.99 0.01-2.3 informative but opposite WeightedError(f t ) = note that weighted error of ZERO or ONE can be problematic 1 N NX i=1 i I( ŷ i {z} f t (x i ) 6= y i )

Justification of the coefficient ŵ t = 1 2 log 1 WeightedError(ft ) WeightedError(f t ) Consider each classifier ft(x) as an expert, telling you what he thinks the label is for a point x Consider each expert as giving independent advice, which is correct with probability (1-WeightedError) Then the Likelihood of the label y at point x is log( P(expert advices true y=+1) P(expert advices true y= 1) ) = f 1(x) log 1 W eightederror(f1 ) W eightederror(f 1 ) + +f t (x) log 1 W eightederror(ft ) W eightederror(f t )!14 Hence the maximum likelihood estimate of the label is sign(ŵ 1 f 1 (x)+ ŵ t f t (x)) which is exactly the weights (or coefficients) used in AdaBoost But, note that in practice, ft(x) s are not independent experts

Re-compute i s Idea: each data point (xi,yi) is weighted by i to reflect how difficult it is to correctly classify that point with the current averaged classifier f(x) =ŵ 1 f 1 (x)+ +ŵ t f t (x) AdaBoost uses the exponential weight to measure how hard that data point is: i = e y if(x i ) = e ŵ1(y i f 1 (x i )) ŵ t (y i f t (x i )) this can be computed by iteratively updating the weights as follows: e ŵ t i if correct: f t (x i )=y i i eŵt i if error: f t (x i ) 6= y i

Implications of the weight update e ŵ t i if correct: f t (x i )=y i i eŵt i if error: f t (x i ) 6= y i!16 ft(xi)=yi? ŵt Multiply αi by Implication e 2.3 =0.1 yes 2.3 decrease importance of easy example yes 0 e 0 =1 no change if predictor is bad yes -2.3 increase weight for easy examples, for bad predictors e 2.3 = 10 e 2.3 = 10 no 2.3 increase importance of hard example e 0 =1 no 0 no change if predictor is bad e 2.3 =0.1 no -2.3 decrease weight for hard examples, for bad predictors

Normalizing weights As we update the weights multiplicatively, AdaBoost can become numerically unstable if some weights are too large If x i often mistake, weight α i gets very large If x i often correct, weight α i gets very small Can cause numerical instability after many iterations Normalize weights to add up to 1 after every iteration i i P N j=1 j!17

AdaBoost Start with same weight for all points: α i = 1/N For t = 1,,T - Learn f t (x) with data weights α i - Compute coefficient ŵ t - Recompute weights α i α i ŵ t = 1 2 ln 1 weighted error(ft ) weighted error(f t ) ç -ŵ α e t, if i f t (x i )=y i ŵ α e t, if i f t (x i ) y i - Normalize weights α i Final model predicts by: ŷ = sign TX ŵ t f t (x)! i i P N j=1 j t=1!18

Example At t=1, it is the standard training of a decision tree but we train a simple model, in this case a decision stump (which is another name for a decision tree of level 1) Original data Learned decision stump f 1 (x) Then compute the weight of the classifier, based on the error: 9/30 -> weight=0.5 log (21/9) = 0.424!19 ŵ t = 1 2 log 1 WeightedError(ft ) WeightedError(f t )

At t=1, we then update the weights of the data points WeightedError was 9/30 then, eŵ1 = q 1 W eightederror W eightederror = p 7/3 =1.53 Learned decision stump f 1 (x) Increase weight α i of misclassified points New data weights α i Boundary!20 i eŵt i e ŵ t i if correct: f t (x i )=y i if error: f t (x i ) 6= y i

At t=2, we train a new classifier on the weighted data f 1 (x) Weighted data: using α i chosen in previous iteration Learned decision stump f 2 (x) on weighted data then we compute the weight of the classifier!21

When training a decision stump on weighted samples, it is exactly like training a decision stump on (unweighted) samples, but competing each data point according to the weight Concretely, for each feature x[1] and x[2], we consider splitting at any possible point that is in between two data points, and find the one that minimizes the error threshold on x[2] error of resulting decision stump error of resulting decision stump!22 threshold on x[1]

weighted sum of the learned predictors give a new ensemble predictor ŵ 1 f 1 (x) ŵ 2 + 0.53 f 2 (x) 0.61 = At t=2, we then update the weights fo the data points!23

After 30 iterations after enough iterations, we get zero training error but probably overfitted training_error = 0!24

Example: loan default finding the best decision stump on weighted data points and compute the new eight of the learned decision stump Consider splitting on each feature: Income>$100K? Credit history? Savings>$100K? Market conditions? Yes No Bad Good Yes No Bad Good Safe Risky Risky Safe Safe Risky Risky Safe weighted_error = 0.2 weighted_error = 0.35 weighted_error = 0.3 weighted_error = 0.4 ŵ t f t = Income>$100K? Yes Safe No Risky = 1 2 ln 1 weighted error(ft ) weighted error(f t ) = 0.69 As WeightedError=0.2 and ŵ t =0.5 log( 1 W eightederror W eightederror )!25 eŵt =2

Updating the weights of the data points Income>$100K? Yes No i e ŵ t i if correct: f t (x i )=y i eŵt i if error: f t (x i ) 6= y i Safe Risky eŵt =2!26 Credit Income y ŷ Previous New A $130K Safe Safe 0.5 0.5/2 = 0.25 B $80K Risky Risky 1.5 0.75 C $110K Risky Safe 1.5 2 * 1.5 = 3 A $110K Safe Safe 2 1 A $90K Safe Risky 1 2 B $120K Safe Safe 2.5 1.25 C $30K Risky Risky 3 1.5 C $60K Risky Risky 2 1 B $95K Safe Risky 0.5 1 A $60K Safe Risky 1 2 A $98K Safe Risky 0.5 1

Convergence and Overfitting in Boosting!27

Boosting example training error goes to zero, after some iterations! Training error of 1 decision stump = 22.5% Training error Training error of ensemble of 30 decision stumps = 0% Boosted decision stumps on toy dataset Iterations of boosting This implies that 1. average of decision stumps can represent any complex function 2. via Boosting, the weighted averaging of simple decision stumps has in the end correct learned the function!28

AdaBoost Theorem Under some technical conditions May oscillate a bit Training error of boosted classifier 0 as T Training error But will generally decrease, & eventually become 0! Iterations of boosting!29

Conditions of AdaBoost Theorem Under some technical conditions Training error of boosted classifier 0 as T Condition = At every t, can find a weak learner with weighted_error(f t ) < 0.5 Not always possible Extreme example: No classifier can separate a +1 on top of -1 Nonetheless, boosting often yields great training error!30

Comparisons Single decision tree vs. Boosting Single decision tree: best test error is 34% training error decreases fast overfits fast AdaBoost: best test error is 32% training error decreases slow overfits less Boosted decision stumps on loan data 39% test error Overfitting 8% training error!31 32% test error Better fit & lower test error 28.5% training error

Boosting tends to be robust to overfitting Test set performance about constant for many iterations è Less sensitive to choice of T But will eventually overfit Best test error around 31% Test error eventually increases to 33% (overfits)!32 We can use cross-validation or validation to choose # of iterations

Why does boosting work so well in practice? If the training data can be perfectly classified by some function (also known as separable) then eventually boosting can achieve ZERO training error However, at this point, usually we overfitted training data so test error is bad in many examples, further training with boosting after training error has reached zero improves test error (note that training error stays at zero after this point) Reason for boosting's robustness to overfitting further updates only change a small subset of dimensions classifier parameters and classifier weights are separately updated Boosting can be thought of as logistic regression over many simple classifiers, trying to minimize!33

Variants of boosting and related algorithms There are hundreds of variants of boosting, most important: Gradient boosting Like AdaBoost, but useful beyond basic classification Many other approaches to learn ensembles, most important: Random forests Bagging: Pick random subsets of the data - Learn a tree in each subset - Average predictions Simpler than boosting & easier to parallelize Typically higher error than boosting for same # of trees (# iterations T)!34

Impact of boosting (spoiler alert... HUGE IMPACT) Amongst most useful ML methods ever created Extremely useful in computer vision Standard approach for face detection, for example Used by most winners of ML competitions (Kaggle, KDD Cup, ) Malware classification, credit fraud detection, ads click through rate estimation, sales forecasting, ranking webpages for search, Higgs boson detection, Most deployed ML systems use model ensembles Coefficients chosen manually, with boosting, with bagging, or others!35