Agnostic Learning and VC dimension Machine Learning Spring 2019 The slides are based on Vivek Srikumar s 1
This Lecture Agnostic Learning What if I cannot guarantee zero training error? Can we still get a bound for generalization error? Shattering and the VC dimension How to get the generalization error bound when the hypothesis space is infinite? 2
So far we have seen PAC learning and Occam s Razor How good will a classifier that is consistent on a training set be? Let H be any hypothesis space. With probability at least 1 -!, a hypothesis h H that is consistent with a training set of size m will have true error < # if Is the second assumption reasonable? 3
So far we have seen PAC learning and Occam s Razor How good will a classifier that is consistent on a training set be? Two assumptions so far: 1. Training and test examples come from the same distribution 2. For any concept, there is some function in the hypothesis space that is consistent with the training set What does the second assumption imply? 4
What is agnostic learning? So far, we have assumed that the learning algorithm could find consistent hypothesis What if: We are trying to learn a concept f using hypotheses in H, but f Ï H That is C is not a subset of H This setting is called agnostic learning Can we say something about sample complexity? H C More realistic setting than before 5
What is agnostic learning? So far, we have assumed that the learning algorithm could find consistent hypothesis H What if: We are trying to learn a concept f using hypotheses in H, but f Ï H That is C is not a subset of H This setting is called agnostic learning Can we say something about sample complexity? C More realistic setting than before 6
Agnostic Learning Learn a concept f using hypotheses in H, but f Ï H Are we guaranteed that training error will be zero? No. There may be no consistent hypothesis in the hypothesis space! Our goal should be to find a classifier h H that has low training error This is the fraction of training examples that are misclassified 7
Agnostic Learning Learn a concept f using hypotheses in H, but f Ï H Our goal should be to find a classifier h H that has low training error What we want: A guarantee that a hypothesis with small training error will have a good accuracy on unseen examples 8
We will use Tail bounds for analysis How far can a random variable get from its mean? Tails of these distributions 9
Bounding probabilities Markov s inequality: Bounds the probability that a nonnegative random variable exceeds a fixed value Chebyshev s inequality: Bounds the probability that a random variable differs from its expected value by more than a fixed number of standard deviations What we want: To bound average of random variables Why? Because the training error depends on the number of errors on the training set 10
<latexit sha1_base64="mqmgjpa0bhiyhlukq1k9w/5e7m=">aaae7xicrvtlbtnafhxramu8mskszygouqkkkaduaqmqvigqwbze2kiz1bppxoltj208y0g0mx9gwwke2pi/7pgbxnhetbnifth1veeceykjhshphe2/xdn1yzcuxtv77714ogjx/vfgycxpeotqpskcqkk5wjoazkteglglbihglmbvtsvx6tzs8/04t7ufhrjglaybgxp5psnat58a0y4hh0xc9uvdirhfypy8vkagxhakbcvhjex2tsnvwsacuw76vy1bxokjhnu5diz2lhjtarcy2vt2///dm9aleeovzwumamcwwhawhsueqpbir4xgduchw6zwunmj39khis1uc3u4nueqifobgtc2oxym7bqj85u1yahaoarbjzlmualj6bbtwhdrjdyvb3ynuxuqzrhpvg5wba2bxwrptr5xor1clgkvqkz5subpbe12zqyiwmvwjht6ghg8nzh0ojbhfnlwkpfvlpvq9xalmw5bjjlzz6u31vvc7k4kfhktzlcpgwohjxcdcu0cwblxymw7iy9oebmaadfyziec6f4b3ujkjiachjgztvqjkvh4kt4jkdkqimnmsbxuefbugwxo7wjj2ramxd6qjp/0a8kbrg0l1msmw4hzfxi7p78pvz1tw0a6fce9wrfhingoykxslararyf590putskqw0gumia9atlhohqh/ycyeod6lw8wf0cnkjhvz8unb2exrfnpdoegxudginmodcw40dwiozk/md/nhisp8k/ws/mqhuzttzlnj5rr/wm255xl</latexit> Hoeffding s inequality Upper bounds on how much the average of a set of random variables differs from its expected value X 1,...,X m (iid) p = 1 m mx j=1 X j p = E(X j )(j =1,...,m) 11
Hoeffding s inequality Upper bounds on how much the average of a set of random variables differs from its expected value True mean (Eg. For a coin toss, the probability of seeing heads) Empirical mean, computed over m independent trials What this tells us: The empirical mean will not be too far from the true mean if there are many samples. 12
<latexit sha1_base64="pdy74sp7xyu4vwmveczzyj6pmoy=">aaafmxicrvrbt9swfa7qbiy7lu3acy/wukvelvxckdajvwixjtgtuwcv6hi5qdma4itezmgv/jv2x/a2fzm7adpy69p84byf853vnhwshshhhhlru0vfkppni4kh//otpse1trfhleotdx95uralhrcxhjaqh3hca9yje4yogat9/ytip/8wgkjufitj2pco2gqep94ieuxs7byuw4p4kpxzy7fkerrdmrzsiagceebwhqbswjwlx0jy4zwrcyymhid/qr52dx9xi6l2k5ebuda8eroir/v9p3fc7gjruwstwhodlri8cgueopt37ewwgpnfawtxde9fp31r67shhmn0ixxfiwluwb2gchtlyippggncst6ibak7kotowcgdlbktixbdu2k1wwwqcazetoputgwtklbsdf1bashzhkzatfb5hz5nz4fv1aeu7lbjijjzksuxemekvo2sdcgxs60uuxro8zhvovdypow7nttsm6pkvjchioeamos4qzp2cen4emhs6kiki69pqavkkxwys7yttqmkpjuwc85mq7qnclbzdo84syrthj4lq83jhccoselbemd4xsghjgzxbrjmpgn0dgxdmjao8v0wqhnntp1/dlwc2obvsvifbhv2xnjqjuvqqf2b/chlkq65fydguryv816gek68aasdpgzhydtha9yvzihvy70sf1keqetph/hynv0o5cd3zmdkidi2pq5ekhhyzzhy3hxrptx/18tigkcch15rye8dwcognil5bwn2edcwbvisinsf3hbjybl8zjqi9s1fvm0cb7xsny3r2/bg7sejhkvauvzamzrbe6vtavvaoxakezvxlfevvcqx6nr1q3w/rwali9ncl5q11b1xz9nyc1k</latexit> Back to agnostic learning Suppose we consider the true error (a.k.a generalization error) err D (h) to be the true mean The training error over m examples err S (h) is the empirical estimate of this true error, why? Let s apply Hoeffding s inequality Binary random variables (iid) = p I f(x i ) 6= h(x i ) =1 Mean of each binary variable 13
Back to agnostic learning Suppose we consider the true error (a.k.a generalization error) err D (h) to be the true mean The training error over m examples err S (h) is the empirical estimate of this true error We can ask: What is the probability that the true error is more than! away from the empirical error? 14
Back to agnostic learning Suppose we consider the true error (a.k.a generalization error) err D (h) to be the true mean The training error over m examples err S (h) is the empirical estimate of this true error Let s apply Hoeffding s inequality 15
Back to agnostic learning Suppose we consider the true error (a.k.a generalization error) err D (h) to be the true mean The training error over m examples err S (h) is the empirical estimate of this true error Let s apply Hoeffding s inequality 16
Agnostic learning The probability that a single hypothesis h has a true error that is more than! away from the training error is bounded above The learning algorithm looks for the best one of the H possible hypotheses The probability that there exists a hypothesis in H whose training error is ² away from the true error is bounded above 17
Agnostic learning The probability that a single hypothesis h has a true error that is more than! away from the training error is bounded above The learning algorithm looks for the best one of the H possible hypotheses (we do not know what is the best before training) The probability that there exists a hypothesis in H whose training error is ² away from the true error is bounded above 18
Agnostic learning The probability that a single hypothesis h has a true error that is more than! away from the training error is bounded above The learning algorithm looks for the best one of the H possible hypotheses (we do not know what is the best before training) The probability that there exists a hypothesis in H whose true error is! away from the training error is bounded above Union bound P({h 1 more than! away} or {h 2 more than! away} or ) <= p(h 1 more than! away ) p(h 2 more than! away ) 19
Agnostic learning The probability that there exists a hypothesis in H whose true error is! away from the training error is bounded above Same game as before: We want this probability to be smaller than " Rearranging this gives us 20
Agnostic learning: Interpretations 1. An agnostic learner makes no commitment to whether f is in H and returns the hypothesis with least training error over at least m examples. It can guarantee with probability 1 -! that the true error is not off by more than " from the training error if 21
Agnostic learning: Interpretations 1. An agnostic learner makes no commitment to whether f is in H and returns the hypothesis with least training error over at least m examples. It can guarantee with probability 1 -! that the true error is not off by more than " from the training error if Difference between generalization and training errors: How much worse will the classifier be in the future than it is at training time? 22
Agnostic learning: Interpretations 1. An agnostic learner makes no commitment to whether f is in H and returns the hypothesis with least training error over at least m examples. It can guarantee with probability 1 -! that the true error is not off by more than " from the training error if Difference between generalization and training errors: How much worse will the classifier be in the future than it is at training time? Size of the hypothesis class: Again an Occam s razor argument prefer smaller sets of functions 23
Agnostic learning: Interpretations 1. An agnostic learner makes no commitment to whether f is in H and returns the hypothesis with least training error over at least m examples. It can guarantee with probability 1 -! that the true error is not off by more than " from the training error if 2. We have a generalization bound: A bound on how much the true error will deviate from the training error. If we have m examples, then with high probability Generalization error Training error 24
What we have seen so far Occam s razor: When the hypothesis space contains the true concept Agnostic learning: When the hypothesis space may not contain the true concept Learnability depends on the log of the size of the hypothesis space Have we solved everything? Eg: What about linear classifiers? 25
Infinite Hypothesis Space The previous analysis was restricted to finite hypothesis spaces Some infinite hypothesis spaces are more expressive than others E.g., Rectangles, 17- sides convex polygons vs. general convex polygons Linear threshold function vs. a combination of LTUs Need a measure of the expressiveness of an infinite hypothesis space other than its size The Vapnik-Chervonenkis dimension (VC dimension) provides such a measure What is the expressive capacity of a set of functions? Analogous to H, there are bounds for sample complexity using VC(H) 26
Learning Rectangles Assume the target function is an axis parallel rectangle Y X 27
Learning Rectangles Assume the target function is an axis parallel rectangle Y Points outside are negative Points outside are negative Points inside are positive Points outside are negative Points outside are negative X 28
Learning Rectangles Assume the target function is an axis parallel rectangle Y - X 29
Learning Rectangles Assume the target function is an axis parallel rectangle Y - X 30
Learning Rectangles Assume the target function is an axis parallel rectangle Y - X 31
Learning Rectangles Assume the target function is an axis parallel rectangle Y - X 32
Learning Rectangles Assume the target function is an axis parallel rectangle Y - X 33
Learning Rectangles Assume the target function is an axis parallel rectangle Y - X 34
Learning Rectangles Assume the target function is an axis parallel rectangle Y - Will we be able to learn the target rectangle? X 35
Let s think about expressivity of functions There are four ways to label two points And it is possible to draw a line that separates positive and negative points in all four cases Linear functions are expressive enough to shatter 2 points What about fourteen points? 36
Shattering 37
Shattering 38
Shattering This particular labeling of the points can not be separated by any line 39
Shattering Linear functions are not expressive to shatter fourteen points Because there is a labeling that can not be separated by them Of course, a more complex function could separate them 40
Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Intuition: A rich set of functions shatters large sets of points 41
Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Intuition: A rich set of functions shatters large sets of points Example 1: Hypothesis class of left bounded intervals on the real axis: [0,a) for some real number a>0 - - 0 a 42
Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Intuition: A rich set of functions shatters large sets of points Example 1: Hypothesis class of left bounded intervals on the real axis: [0,a) for some real number a>0 - - 0 a 0 - a Sets of two points cannot be shattered That is: given two points, you can label them in such a way that no concept in this class will be consistent with their labeling 43 -
Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Example 2: Hypothesis class is the set of intervals on the real axis: [a,b],for some real numbers b>a 44
Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Example 2: Hypothesis class is the set of intervals on the real axis: [a,b],for some real numbers b>a - - a - - b 45
Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Example 2: Hypothesis class is the set of intervals on the real axis: [a,b],for some real numbers b>a - - a - - b - - a - b - 46
Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Example 2: Hypothesis class is the set of intervals on the real axis: [a,b],for some real numbers b>a - - a - - b - - a - b - All sets of one or two points can be shattered But sets of three points cannot be shattered Proof? Enumerate all possible three points 47
Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Example 3: Half spaces in a plane - - - 48
Shattering Definition: A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Example 3: Half spaces in a plane - - - Can one point be shattered? Two points? Three points? Can any three points be shattered? 49
Half spaces on a plane: 3 points Eric Xing @ CMU, 2006-2015 50
Half spaces on a plane: 4 points 51
Shattering: The adversarial game You An adversary You: Hypothesis class H can shatter these d points Adversary: That s what you think! Here is a labeling that will defeat you. You: Aha! There is a function h H that correctly predicts your evil labeling Adversary: Argh! You win this round. But I ll be back.. 52
Some functions can shatter infinite points! If arbitrarily large finite subsets of the instance space X can be shattered by a hypothesis space H. An unbiased hypothesis space H shatters the entire instance space X, i.e., it can induce every possible partition on the set of all possible instances The larger the subset of X that can be shattered, the more expressive a hypothesis space H is, i.e., the less biased it is 53
Vapnik-Chervonenkis Dimension A set S of examples is shattered by a set of functions H if for every partition of the examples in S into positive and negative examples there is a function in H that gives exactly these labels to the examples Definition: The VC dimension of hypothesis space H over instance space X is the size of the largest finite subset of X that is shattered by H If there exists any subset of size d that can be shattered, VC(H) >= d Even one subset will do If no subset of size d can be shattered, then VC(H) < d 54
What we have managed to prove Hypothesis space VC Dimension Half intervals 1 Intervals 2 Half-spaces in the plane 3 Why? There is a dataset of size 1 that can be shattered No dataset of size 2 can be shattered There is a dataset of size 2 that can be shattered No dataset of size 3 can be shattered There is a dataset of size 3 that can be shattered No dataset of size 4 can be shattered 55
More VC dimensions Hypothesis space VC Dimension Linear threshold unit in d dimensions d 1 Neural networks Number of parameters 1 nearest neighbors infinite Intuition: A rich set of functions shatters large sets of points 56
Why VC dimension? Remember sample complexity Consistent learners Agnostic learning Sample complexity in both cases depends on the log of the size of the hypothesis space For infinite hypothesis spaces, its VC dimension behaves like log( H ) 57
VC dimension and Occam s razor Using VC(H) as a measure of expressiveness we have an Occam theorem for infinite hypothesis spaces Given a sample D with m examples, find some h H is consistent with all m examples. If Then with probability at least (1-d), h has error less than e. 58
VC dimension and Agnostic Learning Similar statement can be made for the agnostic setting as well If we have m examples, then with probability 1 -!, the true error of a hypothesis h with training error err S (h) is bounded by 59
Why computational learning theory Raises interesting theoretical questions If a concept class is weakly learnable (i.e there is a learning algorithm that can produce a classifier that does slightly better than chance), does this mean that the concept class is strongly learnable? Boosting We have seen bounds of the form true error < training error (a term with!and VC dimension) Can we use this to define a learning algorithm? Structural Risk Minimization principle Support Vector Machine 60