Agnostic Learning and VC dimension

Podobne dokumenty
Hard-Margin Support Vector Machines

Machine Learning for Data Science (CS4786) Lecture11. Random Projections & Canonical Correlation Analysis

Machine Learning for Data Science (CS4786) Lecture 11. Spectral Embedding + Clustering

Helena Boguta, klasa 8W, rok szkolny 2018/2019

Revenue Maximization. Sept. 25, 2018

Machine Learning for Data Science (CS4786) Lecture 24. Differential Privacy and Re-useable Holdout

SSW1.1, HFW Fry #20, Zeno #25 Benchmark: Qtr.1. Fry #65, Zeno #67. like

Analysis of Movie Profitability STAT 469 IN CLASS ANALYSIS #2

Weronika Mysliwiec, klasa 8W, rok szkolny 2018/2019

TTIC 31210: Advanced Natural Language Processing. Kevin Gimpel Spring Lecture 9: Inference in Structured Prediction

Zakopane, plan miasta: Skala ok. 1: = City map (Polish Edition)

Linear Classification and Logistic Regression. Pascal Fua IC-CVLab

Tychy, plan miasta: Skala 1: (Polish Edition)

deep learning for NLP (5 lectures)

Rozpoznawanie twarzy metodą PCA Michał Bereta 1. Testowanie statystycznej istotności różnic między jakością klasyfikatorów

tum.de/fall2018/ in2357

Previously on CSCI 4622

Karpacz, plan miasta 1:10 000: Panorama Karkonoszy, mapa szlakow turystycznych (Polish Edition)

Steeple #3: Gödel s Silver Blaze Theorem. Selmer Bringsjord Are Humans Rational? Dec RPI Troy NY USA

Egzamin maturalny z języka angielskiego na poziomie dwujęzycznym Rozmowa wstępna (wyłącznie dla egzaminującego)

Wojewodztwo Koszalinskie: Obiekty i walory krajoznawcze (Inwentaryzacja krajoznawcza Polski) (Polish Edition)

TTIC 31210: Advanced Natural Language Processing. Kevin Gimpel Spring Lecture 8: Structured PredicCon 2

Stargard Szczecinski i okolice (Polish Edition)

Gradient Coding using the Stochastic Block Model

Stability of Tikhonov Regularization Class 07, March 2003 Alex Rakhlin

Wprowadzenie do programu RapidMiner, część 2 Michał Bereta 1. Wykorzystanie wykresu ROC do porównania modeli klasyfikatorów

MaPlan Sp. z O.O. Click here if your download doesn"t start automatically

Wojewodztwo Koszalinskie: Obiekty i walory krajoznawcze (Inwentaryzacja krajoznawcza Polski) (Polish Edition)

Maximum A Posteriori Chris Piech CS109, Stanford University

Convolution semigroups with linear Jacobi parameters

SubVersion. Piotr Mikulski. SubVersion. P. Mikulski. Co to jest subversion? Zalety SubVersion. Wady SubVersion. Inne różnice SubVersion i CVS

New Roads to Cryptopia. Amit Sahai. An NSF Frontier Center

OpenPoland.net API Documentation

ERASMUS + : Trail of extinct and active volcanoes, earthquakes through Europe. SURVEY TO STUDENTS.

Few-fermion thermometry

Jak zasada Pareto może pomóc Ci w nauce języków obcych?

Miedzy legenda a historia: Szlakiem piastowskim z Poznania do Gniezna (Biblioteka Kroniki Wielkopolski) (Polish Edition)

ARNOLD. EDUKACJA KULTURYSTY (POLSKA WERSJA JEZYKOWA) BY DOUGLAS KENT HALL

DODATKOWE ĆWICZENIA EGZAMINACYJNE

Miedzy legenda a historia: Szlakiem piastowskim z Poznania do Gniezna (Biblioteka Kroniki Wielkopolski) (Polish Edition)

Wojewodztwo Koszalinskie: Obiekty i walory krajoznawcze (Inwentaryzacja krajoznawcza Polski) (Polish Edition)

Mixed-integer Convex Representability

Compressing the information contained in the different indexes is crucial for performance when implementing an IR system

Emilka szuka swojej gwiazdy / Emily Climbs (Emily, #2)

Katowice, plan miasta: Skala 1: = City map = Stadtplan (Polish Edition)

y = The Chain Rule Show all work. No calculator unless otherwise stated. If asked to Explain your answer, write in complete sentences.

Karpacz, plan miasta 1:10 000: Panorama Karkonoszy, mapa szlakow turystycznych (Polish Edition)

Zdecyduj: Czy to jest rzeczywiście prześladowanie? Czasem coś WYDAJE SIĘ złośliwe, ale wcale takie nie jest.

Blow-Up: Photographs in the Time of Tumult; Black and White Photography Festival Zakopane Warszawa 2002 / Powiekszenie: Fotografie w czasach zgielku

Dolny Slask 1: , mapa turystycznosamochodowa: Plan Wroclawia (Polish Edition)

18. Przydatne zwroty podczas egzaminu ustnego. 19. Mo liwe pytania egzaminatora i przyk³adowe odpowiedzi egzaminowanego

Pielgrzymka do Ojczyzny: Przemowienia i homilie Ojca Swietego Jana Pawla II (Jan Pawel II-- pierwszy Polak na Stolicy Piotrowej) (Polish Edition)

Wojewodztwo Koszalinskie: Obiekty i walory krajoznawcze (Inwentaryzacja krajoznawcza Polski) (Polish Edition)

Rachunek lambda, zima

A Zadanie

Wybrzeze Baltyku, mapa turystyczna 1: (Polish Edition)

English Challenge: 13 Days With Real-Life English. Agnieszka Biały Kamil Kondziołka

Karpacz, plan miasta 1:10 000: Panorama Karkonoszy, mapa szlakow turystycznych (Polish Edition)

Wojewodztwo Koszalinskie: Obiekty i walory krajoznawcze (Inwentaryzacja krajoznawcza Polski) (Polish Edition)

Ankiety Nowe funkcje! Pomoc Twoje konto Wyloguj. BIODIVERSITY OF RIVERS: Survey to students

Wojewodztwo Koszalinskie: Obiekty i walory krajoznawcze (Inwentaryzacja krajoznawcza Polski) (Polish Edition)

Prices and Volumes on the Stock Market

JĘZYK ANGIELSKI ĆWICZENIA ORAZ REPETYTORIUM GRAMATYCZNE

Estimation and planing. Marek Majchrzak, Andrzej Bednarz Wroclaw,

Dolny Slask 1: , mapa turystycznosamochodowa: Plan Wroclawia (Polish Edition)

EGZAMIN MATURALNY Z JĘZYKA ANGIELSKIEGO POZIOM ROZSZERZONY MAJ 2010 CZĘŚĆ I. Czas pracy: 120 minut. Liczba punktów do uzyskania: 23 WPISUJE ZDAJĄCY

Financial support for start-uppres. Where to get money? - Equity. - Credit. - Local Labor Office - Six times the national average wage (22000 zł)


Relaxation of the Cosmological Constant

Poland) Wydawnictwo "Gea" (Warsaw. Click here if your download doesn"t start automatically

Roland HINNION. Introduction

The Lorenz System and Chaos in Nonlinear DEs

Towards Stability Analysis of Data Transport Mechanisms: a Fluid Model and an Application

Wroclaw, plan nowy: Nowe ulice, 1:22500, sygnalizacja swietlna, wysokosc wiaduktow : Debica = City plan (Polish Edition)

Supervised Hierarchical Clustering with Exponential Linkage. Nishant Yadav

Extraclass. Football Men. Season 2009/10 - Autumn round

Angielski bezpłatne ćwiczenia - gramatyka i słownictwo. Ćwiczenie 4

Marzec: food, advertising, shopping and services, verb patterns, adjectives and prepositions, complaints - writing

Wprowadzenie do programu RapidMiner Studio 7.6, część 9 Modele liniowe Michał Bereta

January 1st, Canvas Prints including Stretching. What We Use

Wroclaw, plan nowy: Nowe ulice, 1:22500, sygnalizacja swietlna, wysokosc wiaduktow : Debica = City plan (Polish Edition)

Counting Rules. Counting operations on n objects. Sort, order matters (perms) Choose k (combinations) Put in r buckets. None Distinct.

Bardzo formalny, odbiorca posiada specjalny tytuł, który jest używany zamiast nazwiska

Neural Networks (The Machine-Learning Kind) BCS 247 March 2019

Presented by. Dr. Morten Middelfart, CTO

Surname. Other Names. For Examiner s Use Centre Number. Candidate Number. Candidate Signature

EGZAMIN MATURALNY 2012 JĘZYK ANGIELSKI

EGZAMIN MATURALNY 2012 JĘZYK ANGIELSKI


Realizacja systemów wbudowanych (embeded systems) w strukturach PSoC (Programmable System on Chip)

Pudełko konwersacji. Joel Shaul

Język angielski. Poziom rozszerzony Próbna Matura z OPERONEM i Gazetą Wyborczą CZĘŚĆ I KRYTERIA OCENIANIA ODPOWIEDZI POZIOM ROZSZERZONY CZĘŚĆ I

Title: On the curl of singular completely continous vector fields in Banach spaces

How much does SMARTech system cost?

Lecture 18 Review for Exam 1

ABOUT NEW EASTERN EUROPE BESTmQUARTERLYmJOURNAL

CS 6170: Computational Topology, Spring 2019 Lecture 09

Leba, Rowy, Ustka, Slowinski Park Narodowy, plany miast, mapa turystyczna =: Tourist map = Touristenkarte (Polish Edition)

Proposal of thesis topic for mgr in. (MSE) programme in Telecommunications and Computer Science

Ogólnopolski Próbny Egzamin Ósmoklasisty z OPERONEM. Język angielski Kartoteka testu. Wymagania szczegółowe Uczeń: Poprawna odpowiedź 1.1.

Transkrypt:

Agnostic Learning and VC dimension Machine Learning Spring 2019 The slides are based on Vivek Srikumar s 1

This Lecture Agnostic Learning What if I cannot guarantee zero training error? Can we still get a bound for generalization error? Shattering and the VC dimension How to get the generalization error bound when the hypothesis space is infinite? 2

So far we have seen PAC learning and Occam s Razor How good will a classifier that is consistent on a training set be? Let H be any hypothesis space. With probability at least 1 -!, a hypothesis h H that is consistent with a training set of size m will have true error < # if Is the second assumption reasonable? 3

So far we have seen PAC learning and Occam s Razor How good will a classifier that is consistent on a training set be? Two assumptions so far: 1. Training and test examples come from the same distribution 2. For any concept, there is some function in the hypothesis space that is consistent with the training set What does the second assumption imply? 4

What is agnostic learning? So far, we have assumed that the learning algorithm could find consistent hypothesis What if: We are trying to learn a concept f using hypotheses in H, but f Ï H That is C is not a subset of H This setting is called agnostic learning Can we say something about sample complexity? H C More realistic setting than before 5

What is agnostic learning? So far, we have assumed that the learning algorithm could find consistent hypothesis H What if: We are trying to learn a concept f using hypotheses in H, but f Ï H That is C is not a subset of H This setting is called agnostic learning Can we say something about sample complexity? C More realistic setting than before 6

Agnostic Learning Learn a concept f using hypotheses in H, but f Ï H Are we guaranteed that training error will be zero? No. There may be no consistent hypothesis in the hypothesis space! Our goal should be to find a classifier h H that has low training error This is the fraction of training examples that are misclassified 7

Agnostic Learning Learn a concept f using hypotheses in H, but f Ï H Our goal should be to find a classifier h H that has low training error What we want: A guarantee that a hypothesis with small training error will have a good accuracy on unseen examples 8

We will use Tail bounds for analysis How far can a random variable get from its mean? Tails of these distributions 9

Bounding probabilities Markov s inequality: Bounds the probability that a nonnegative random variable exceeds a fixed value Chebyshev s inequality: Bounds the probability that a random variable differs from its expected value by more than a fixed number of standard deviations What we want: To bound average of random variables Why? Because the training error depends on the number of errors on the training set 10

<latexit sha1_base64="mqmgjpa0bhiyhlukq1k9w/5e7m=">aaae7xicrvtlbtnafhxramu8mskszygouqkkkaduaqmqvigqwbze2kiz1bppxoltj208y0g0mx9gwwke2pi/7pgbxnhetbnifth1veeceykjhshphe2/xdn1yzcuxtv77714ogjx/vfgycxpeotqpskcqkk5wjoazkteglglbihglmbvtsvx6tzs8/04t7ufhrjglaybgxp5psnat58a0y4hh0xc9uvdirhfypy8vkagxhakbcvhjex2tsnvwsacuw76vy1bxokjhnu5diz2lhjtarcy2vt2///dm9aleeovzwumamcwwhawhsueqpbir4xgduchw6zwunmj39khis1uc3u4nueqifobgtc2oxym7bqj85u1yahaoarbjzlmualj6bbtwhdrjdyvb3ynuxuqzrhpvg5wba2bxwrptr5xor1clgkvqkz5subpbe12zqyiwmvwjht6ghg8nzh0ojbhfnlwkpfvlpvq9xalmw5bjjlzz6u31vvc7k4kfhktzlcpgwohjxcdcu0cwblxymw7iy9oebmaadfyziec6f4b3ujkjiachjgztvqjkvh4kt4jkdkqimnmsbxuefbugwxo7wjj2ramxd6qjp/0a8kbrg0l1msmw4hzfxi7p78pvz1tw0a6fce9wrfhingoykxslararyf590putskqw0gumia9atlhohqh/ycyeod6lw8wf0cnkjhvz8unb2exrfnpdoegxudginmodcw40dwiozk/md/nhisp8k/ws/mqhuzttzlnj5rr/wm255xl</latexit> Hoeffding s inequality Upper bounds on how much the average of a set of random variables differs from its expected value X 1,...,X m (iid) p = 1 m mx j=1 X j p = E(X j )(j =1,...,m) 11

Hoeffding s inequality Upper bounds on how much the average of a set of random variables differs from its expected value True mean (Eg. For a coin toss, the probability of seeing heads) Empirical mean, computed over m independent trials What this tells us: The empirical mean will not be too far from the true mean if there are many samples. 12

<latexit sha1_base64="pdy74sp7xyu4vwmveczzyj6pmoy=">aaafmxicrvrbt9swfa7qbiy7lu3acy/wukvelvxckdajvwixjtgtuwcv6hi5qdma4itezmgv/jv2x/a2fzm7adpy69p84byf853vnhwshshhhhlru0vfkppni4kh//otpse1trfhleotdx95uralhrcxhjaqh3hca9yje4yogat9/ytip/8wgkjufitj2pco2gqep94ieuxs7byuw4p4kpxzy7fkerrdmrzsiagceebwhqbswjwlx0jy4zwrcyymhid/qr52dx9xi6l2k5ebuda8eroir/v9p3fc7gjruwstwhodlri8cgueopt37ewwgpnfawtxde9fp31r67shhmn0ixxfiwluwb2gchtlyippggncst6ibak7kotowcgdlbktixbdu2k1wwwqcazetoputgwtklbsdf1bashzhkzatfb5hz5nz4fv1aeu7lbjijjzksuxemekvo2sdcgxs60uuxro8zhvovdypow7nttsm6pkvjchioeamos4qzp2cen4emhs6kiki69pqavkkxwys7yttqmkpjuwc85mq7qnclbzdo84syrthj4lq83jhccoselbemd4xsghjgzxbrjmpgn0dgxdmjao8v0wqhnntp1/dlwc2obvsvifbhv2xnjqjuvqqf2b/chlkq65fydguryv816gek68aasdpgzhydtha9yvzihvy70sf1keqetph/hynv0o5cd3zmdkidi2pq5ekhhyzzhy3hxrptx/18tigkcch15rye8dwcognil5bwn2edcwbvisinsf3hbjybl8zjqi9s1fvm0cb7xsny3r2/bg7sejhkvauvzamzrbe6vtavvaoxakezvxlfevvcqx6nr1q3w/rwali9ncl5q11b1xz9nyc1k</latexit> Back to agnostic learning Suppose we consider the true error (a.k.a generalization error) err D (h) to be the true mean The training error over m examples err S (h) is the empirical estimate of this true error, why? Let s apply Hoeffding s inequality Binary random variables (iid) = p I f(x i ) 6= h(x i ) =1 Mean of each binary variable 13

Back to agnostic learning Suppose we consider the true error (a.k.a generalization error) err D (h) to be the true mean The training error over m examples err S (h) is the empirical estimate of this true error We can ask: What is the probability that the true error is more than! away from the empirical error? 14

Back to agnostic learning Suppose we consider the true error (a.k.a generalization error) err D (h) to be the true mean The training error over m examples err S (h) is the empirical estimate of this true error Let s apply Hoeffding s inequality 15

Back to agnostic learning Suppose we consider the true error (a.k.a generalization error) err D (h) to be the true mean The training error over m examples err S (h) is the empirical estimate of this true error Let s apply Hoeffding s inequality 16

Agnostic learning The probability that a single hypothesis h has a true error that is more than! away from the training error is bounded above The learning algorithm looks for the best one of the H possible hypotheses The probability that there exists a hypothesis in H whose training error is ² away from the true error is bounded above 17

Agnostic learning The probability that a single hypothesis h has a true error that is more than! away from the training error is bounded above The learning algorithm looks for the best one of the H possible hypotheses (we do not know what is the best before training) The probability that there exists a hypothesis in H whose training error is ² away from the true error is bounded above 18

Agnostic learning The probability that a single hypothesis h has a true error that is more than! away from the training error is bounded above The learning algorithm looks for the best one of the H possible hypotheses (we do not know what is the best before training) The probability that there exists a hypothesis in H whose true error is! away from the training error is bounded above Union bound P({h 1 more than! away} or {h 2 more than! away} or ) <= p(h 1 more than! away ) p(h 2 more than! away ) 19

Agnostic learning The probability that there exists a hypothesis in H whose true error is! away from the training error is bounded above Same game as before: We want this probability to be smaller than " Rearranging this gives us 20

Agnostic learning: Interpretations 1. An agnostic learner makes no commitment to whether f is in H and returns the hypothesis with least training error over at least m examples. It can guarantee with probability 1 -! that the true error is not off by more than " from the training error if 21

Agnostic learning: Interpretations 1. An agnostic learner makes no commitment to whether f is in H and returns the hypothesis with least training error over at least m examples. It can guarantee with probability 1 -! that the true error is not off by more than " from the training error if Difference between generalization and training errors: How much worse will the classifier be in the future than it is at training time? 22

Agnostic learning: Interpretations 1. An agnostic learner makes no commitment to whether f is in H and returns the hypothesis with least training error over at least m examples. It can guarantee with probability 1 -! that the true error is not off by more than " from the training error if Difference between generalization and training errors: How much worse will the classifier be in the future than it is at training time? Size of the hypothesis class: Again an Occam s razor argument prefer smaller sets of functions 23

Agnostic learning: Interpretations 1. An agnostic learner makes no commitment to whether f is in H and returns the hypothesis with least training error over at least m examples. It can guarantee with probability 1 -! that the true error is not off by more than " from the training error if 2. We have a generalization bound: A bound on how much the true error will deviate from the training error. If we have m examples, then with high probability Generalization error Training error 24

What we have seen so far Occam s razor: When the hypothesis space contains the true concept Agnostic learning: When the hypothesis space may not contain the true concept Learnability depends on the log of the size of the hypothesis space Have we solved everything? Eg: What about linear classifiers? 25

Infinite Hypothesis Space The previous analysis was restricted to finite hypothesis spaces Some infinite hypothesis spaces are more expressive than others E.g., Rectangles, 17- sides convex polygons vs. general convex polygons Linear threshold function vs. a combination of LTUs Need a measure of the expressiveness of an infinite hypothesis space other than its size The Vapnik-Chervonenkis dimension (VC dimension) provides such a measure What is the expressive capacity of a set of functions? Analogous to H, there are bounds for sample complexity using VC(H) 26

Learning Rectangles Assume the target function is an axis parallel rectangle Y X 27

Learning Rectangles Assume the target function is an axis parallel rectangle Y Points outside are negative Points outside are negative Points inside are positive Points outside are negative Points outside are negative X 28

Learning Rectangles Assume the target function is an axis parallel rectangle Y - X 29

Learning Rectangles Assume the target function is an axis parallel rectangle Y - X 30

Learning Rectangles Assume the target function is an axis parallel rectangle Y - X 31

Learning Rectangles Assume the target function is an axis parallel rectangle Y - X 32

Learning Rectangles Assume the target function is an axis parallel rectangle Y - X 33

Learning Rectangles Assume the target function is an axis parallel rectangle Y - X 34

Learning Rectangles Assume the target function is an axis parallel rectangle Y - Will we be able to learn the target rectangle? X 35

Let s think about expressivity of functions There are four ways to label two points And it is possible to draw a line that separates positive and negative points in all four cases Linear functions are expressive enough to shatter 2 points What about fourteen points? 36

Shattering 37

Shattering 38

Shattering This particular labeling of the points can not be separated by any line 39

Shattering Linear functions are not expressive to shatter fourteen points Because there is a labeling that can not be separated by them Of course, a more complex function could separate them 40

Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Intuition: A rich set of functions shatters large sets of points 41

Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Intuition: A rich set of functions shatters large sets of points Example 1: Hypothesis class of left bounded intervals on the real axis: [0,a) for some real number a>0 - - 0 a 42

Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Intuition: A rich set of functions shatters large sets of points Example 1: Hypothesis class of left bounded intervals on the real axis: [0,a) for some real number a>0 - - 0 a 0 - a Sets of two points cannot be shattered That is: given two points, you can label them in such a way that no concept in this class will be consistent with their labeling 43 -

Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Example 2: Hypothesis class is the set of intervals on the real axis: [a,b],for some real numbers b>a 44

Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Example 2: Hypothesis class is the set of intervals on the real axis: [a,b],for some real numbers b>a - - a - - b 45

Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Example 2: Hypothesis class is the set of intervals on the real axis: [a,b],for some real numbers b>a - - a - - b - - a - b - 46

Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Example 2: Hypothesis class is the set of intervals on the real axis: [a,b],for some real numbers b>a - - a - - b - - a - b - All sets of one or two points can be shattered But sets of three points cannot be shattered Proof? Enumerate all possible three points 47

Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Example 3: Half spaces in a plane - - - 48

Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Example 3: Half spaces in a plane - - - Can one point be shattered? Two points? Three points? Can any three points be shattered? 49

Half spaces on a plane: 3 points Eric Xing @ CMU, 2006-2015 50

Half spaces on a plane: 4 points 51

Shattering: The adversarial game You An adversary You: Hypothesis class H can shatter these d points Adversary: That s what you think! Here is a labeling that will defeat you. You: Aha! There is a function h H that correctly predicts your evil labeling Adversary: Argh! You win this round. But I ll be back.. 52

Some functions can shatter infinite points! If arbitrarily large finite subsets of the instance space X can be shattered by a hypothesis space H. An unbiased hypothesis space H shatters the entire instance space X, i.e., it can induce every possible partition on the set of all possible instances The larger the subset of X that can be shattered, the more expressive a hypothesis space H is, i.e., the less biased it is 53

Vapnik-Chervonenkis Dimension A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Definition: The VC dimension of hypothesis space H over instance space X is the size of the largest finite subset of X that is shattered by H If there exists any subset of size d that can be shattered, VC(H) >= d Even one subset will do If no subset of size d can be shattered, then VC(H) < d 54

What we have managed to prove Hypothesis space VC Dimension Half intervals 1 Intervals 2 Half-spaces in the plane 3 Why? There is a dataset of size 1 that can be shattered No dataset of size 2 can be shattered There is a dataset of size 2 that can be shattered No dataset of size 3 can be shattered There is a dataset of size 3 that can be shattered No dataset of size 4 can be shattered 55

More VC dimensions Hypothesis space VC Dimension Linear threshold unit in d dimensions d 1 Neural networks Number of parameters 1 nearest neighbors infinite Intuition: A rich set of functions shatters large sets of points 56

Why VC dimension? Remember sample complexity Consistent learners Agnostic learning Sample complexity in both cases depends on the log of the size of the hypothesis space For infinite hypothesis spaces, its VC dimension behaves like log( H ) 57

VC dimension and Occam s razor Using VC(H) as a measure of expressiveness we have an Occam theorem for infinite hypothesis spaces Given a sample D with m examples, find some h H is consistent with all m examples. If Then with probability at least (1-d), h has error less than e. 58

VC dimension and Agnostic Learning Similar statement can be made for the agnostic setting as well If we have m examples, then with probability 1 -!, the true error of a hypothesis h with training error err S (h) is bounded by 59

Why computational learning theory Raises interesting theoretical questions If a concept class is weakly learnable (i.e there is a learning algorithm that can produce a classifier that does slightly better than chance), does this mean that the concept class is strongly learnable? Boosting We have seen bounds of the form true error < training error (a term with!and VC dimension) Can we use this to define a learning algorithm? Structural Risk Minimization principle Support Vector Machine 60