Dealing with continuous-valued attributes

An alternative measure: gain ratio

Handling incomplete training data

Handling attributes with different costs

Ensemble of Classifiers

Why Ensemble Works? Some unknown distribution Model 1 Model 2 Model 3 Model 4 Model 5 Model 6

Ensemble of Classifiers Data model 1 model 2 Ensemble model Combine multiple models into one! model k Basic idea is to learn a set of classifiers (experts) and to allow them to vote. Advantage: improvement in predictive accuracy. Disadvantage: it is difficult to understand an ensemble of classifiers.

Generating Base Classifiers Sampling training examples Train k classifiers on k subsets drawn from the training set Using different learning models Use all the training examples, but apply different learning algorithms Sampling features Train k classifiers on k subsets of features drawn from the feature space Learning randomly Introduce randomness into learning procedures

Majority vote D Original Training data Step 1: Build Multiple Classifiers C 1 C 2 C t -1 C t Step 2: Combine Classifiers C *

Why Majority Voting works? Suppose there are 25 base classifiers Each classifier has error rate, = 0.35 Assume errors made by classifiers are uncorrelated Probability that the ensemble classifier makes a wrong prediction: P( X 13) 25 i 13 25 i (1 ) i 25 i 0.06

Bagging Generate a random sample from training set Repeat this sampling procedure, getting a sequence of K independent training sets A corresponding sequence of classifiers C1,C2,,Ck is constructed for each of these training sets, by using the same classification algorithm To classify an unknown sample X, let each classifier predict. The Bagged Classifier C* then combines the predictions of the individual classifiers to generate the final outcome. (sometimes combination is simple voting)

Bagging classifiers Classifier generation Let n be the size of the training set. For each of t iterations: Sample n instances with replacement from the training set. Apply the learning algorithm to the sample. Store the resulting classifier. classification For each of the t classifiers: Bootstrap samples and Predict class of instance using classifier. Return class that classifiers: was predicted most often.

Bagging classifiers X 2

Boosting The final prediction is a combination of the prediction of several predictors. Differences between Boosting and previous methods? Its iterative. Boosting: Successive classifiers depends upon its predecessors. Previous methods : Individual classifiers were independent. Training Examples may have unequal weights. Look at errors from previous classifier step to decide how to focus on next iteration over data Set weights to focus more on hard examples. (the ones on which we committed mistakes in the previous iterations)

AdaBoost (algorithm) W(x) is the distribution of weights over the N training points W(x i )=1 Initially assign uniform weights W 0 (x) = 1/N for all x. At each iteration k : Find best weak classifier C k (x) using weights W k (x) Compute ε k the error rate as ε k = [ W(x i ) I(y i C k (x i )) ] / [ W(x i )] weight α k the classifier C k s weight in the final hypothesis Set α k = log ((1 ε k )/ε k ) For each x i, W k+1 (x i ) = W k (x i ) exp[α k I(y i C k (x i ))] C FINAL (x) =sign [ α i C i (x) ] L(y, f (x)) = exp(-y f (x)) - the exponential loss function

AdaBoost - example Original Training set : Equal Weights to all training samples

AdaBoost - example ROUND 1

AdaBoost - example

...... N examples Random Forest M features Take he majority vote

Random Forest Classifier generation Let n be the size of the training set. For each of t iterations: (1) Sample n instances with replacement from the training set. (2) Learn a decision tree s.t. the variable for any new node is the best variable among m randomly selected variables. (3) Store the resulting decision tree. Classification For each of the t decision trees: Predict class of instance. Return class that was predicted most often.

Rodzaje drzew decyzyjnych Podział drzew ze wzglęgu na rodzaj badań: drzewa klasyfikacyjne drzewa regresyjne (np. CART, REPTree) drzewa modelowe (np. M5, SMOTI) Odpowiedź na pytanie typu: Każdy liść zawiera model regresji liniowej (bądź nieliniowej). Czy Jaka otrzymam jest moja kredyt? zdolność kredytowa? Znajduje rozwiązanie zapewniające możliwie maksymalną dokładność parametrycznej reprezentacji funkcji docelowej. nie < 30 000 Umowa o pracę tak Roczny przychód < 100 000 >= 30 000 Zadłużenie < 80 Przykład decyzji w liściu Liście Każdy w liść drzewie drzewa klasyfikacyjnym regresyjnego zawierają wartość decyzję średnią (klasę) zmiennej zależnej (przewidywanej) wszystkich obiektów w nim się znajdujących >= 100 000 Wiek Przykładowa funkcja docelowa: >= 80 Zdolność kredytowa = 4*roczny przychód 1.5*zadłużenie 0.7*wiek

Regression trees Build a regression tree: Divide the predictor space into J distinct not overlapping regions R 1,R 2,R 3,,R J We make the same prediction for all observations in the same region; use the mean of responses for all training observations that are in the region

Regression trees

Recursive binary splitting

Regression tree - example

Overfitting

Regression Trees Like decision trees, but: Splitting criterion: minimize intra-subset variation Termination criterion: std. dev. becomes small Pruning criterion: based on numeric error measure Prediction: Leaf predicts average class value of instances Yields piecewise constant functions Easy to interpret More sophisticated version: model trees

Model trees Build a regression tree Each leaf linear regression function Smoothing: factor in ancestor s predictions Smoothing formula: Same effect can be achieved by incorporating ancestor models into the leaves Need linear regression function at each node At each node, use only a subset of attributes to build linear regression model Those occurring in subtree p' = np+ kq n+ k (+maybe those occurring in path to the root) Fast: tree usually uses only a small subset of the attributes

Building the tree Splitting: standard deviation reduction / squared error reduction Termination of splitting process: Standard deviation < 5% of its value on full training set Too few instances remain (e.g., < 4) Pruning: Heuristic estimate of n+ absolute v error of linear regression models: n- v average_absolute_error Greedily remove terms from LR models to minimize estimated error Proceed bottom up: compare error of LR model at internal node to error of subtree (this happens before smoothing is applied) Heavy pruning: single model may replace whole subtree

Metody wygładzania (smoothing) Wymaga wygenerowania modelu liniowego dla każdego węzła wewnętrznego w drzewie Osiąga dobre rezultaty gdy: modele na ścieżce przewidują różne wartości modele konstruowane są dla niewielkiej ilości obiektów uczących

Inne aspekty Umożliwienie poszukiwania rozwiązań gdy koszty niedoszacowania i przeszacowania są inne Różne funkcje kosztówkosztów np. LinLin, QuadQuad, LinEx Wartości brakujące Surrogate splits

Drzewa modelowe / regresja liniowa https://www.geogebra.org/m/fue3hfrf http://www.graphpad.com/quickcalcs/linear1/

Drzewa modelowe / regresja liniowa Zagrożenie gdy nowe dane są spoza zakresu!!!

Globalna vs lokalna indukcja Przykład sztucznego zbioru danych opisanego funkcją: Lokalnie Globalnie optymalne optymalne (zachłanne) rozwiązanie rozwiązanie podziały minimalizują odchylenie standardowe. Kolejne testy Pierwszy są już wynikiem podział nieoptymalnego węzła jest dla x 1 podziału > -1.2 w korzeniu

Algorytmy ewolucyjne i drzewa decyzyjne Algorytmy ewolucyjne: zbiór metod optymalizacji inspirowany naturalnym procesem ewolucji wykorzystują oparte na populacji losowe różnicowanie i selekcję wzajemne przenikanie się różnych technik: algorytmy genetyczne, strategie ewolucyjne, programowanie genetyczne,... efektywne w unikaniu minimów lokalnych Algorytmy ewolucyjne jako narzędzie indukcji drzew: umożliwia równoczesne poszukiwanie struktury drzewa oraz wszystkich testów możliwość wykorzystania znajomości problemu

Gdzie dalej? Ewolucyjna indukcja drzew Drzewa rozmyte (fuzzy) Algorytmy równoległe i rozproszone - MPI/OpenMP/GPGPU, Hadoop, Hive

Soft Computing / Evolutionary Computation

Typical framework of EA algorihm

Selection Selection is a procedure of picking parent chromosome to produce off-spring. Types of selection: Random Selection Parents are selected randomly from the population. Proportional Selection probabilities for picking each chromosome is calculated as: P(x i ) = f(x i )/Σf(x j ) for all j Rank Based Selection This method uses ranks instead of absolute fitness values. P(x i ) = (1/β)(1 e r(x i ) )

Roulette Wheel Selection Let i = 1, where i denotes chromosome index; Calculate P(x i ) using proportional selection; sum = P(x i ); choose r ~ U(0,1); while sum < r do i = i + 1; i.e. next chromosome sum = sum + P(x i ); end return x i as one of the selected parent; repeat until all parents are selected

Reproduction Reproduction is a processes of creating new chromosomes out of chromosomes in the population. Parents are put back into population after reproduction. Cross-over and Mutation are two parts in reproduction of an off-spring. Cross-over : It is a process of creating one or more new individuals through the combination of genetic material randomly selected from two or parents.

Cross-over Uniform cross-over : where corresponding bit positions are randomly exchanged between two parents. One point : random bit is selected and entire substring after the bit is swapped. Two point : two bits are selected and the substring between the bits is swapped. Uniform Cross-over One point Cross-over Two point Cross-over Parent1 Parent2 00110110 11011011 00110110 11011011 00110110 11011011 Off-spring1 Off-spring2 01110111 10011010 00111011 11010110 01011010 10110111

Mutation Mutation procedures depend upon the representation schema of the chromosomes. This is to prevent falling all solutions in population into a local optimum. For a bit-vector representation: random mutation : randomly negates bits in-order mutation : performs random mutation between two randomly selected bit position. Random Mutation In-order Mutation Before mutation 1110010011 1110010011 After mutation 1100010111 1110011010

http://www.puremango.co.uk/genetic-hello-world.html http://boxcar2d.com/