Agenda. WEKA Podstawowe pojęcia. Przykład danych

Agenda WEKA Podstawowe pojęcia dr inŝ. Jacek Grekow Klasyfikacja Metody ewaluacji UŜycie klasyfikatorów z linii komend, parametry Wizualizacja rezultatów klasyfikacji Zbiór danych wejściowych jest przybliŝonym odpowiednikiem 2-wymiarowej tablicy danych (arkusza kalkulacyjnego). w WEKA, jest zaimplementowane w klasie weka.core.instances java weka.core.instances data/soybean.arff Zbiór danych jest listą przykładów, z których kaŝdy jest typu weka.core.instance KaŜdy przykład składa się z określonej liczby atrybutów (cech), które mogą być nominalne (=jedna z określonej listy wartości), numeric (= a real or integer number) lub string (= długa liusta znaków zamknięta w cudzysłowy ). Przykład danych Zewnętrzna reprezentacja przykładów jest zawarta w pliku ARFF, który składa się z: Nagłówka opisującego typy atrybutów Danych kaŝda nowa linia to nowy przykład, który składa się z oddzielonych przecinkami atrybutów. 1

% This is a toy example, the UCI weather dataset. % Any relation to real weather is purely coincidental. @relation golfweathermichigan_1988/02/10_14days Komentarze najczęściej na początku pliku opisują źródło danych, zawartość i ich znaczenie Miejsce do określenia nazwy danych @attribute outlook {sunny, overcast, rainy} @attribute windy {TRUE, FALSE} Definicja dwóch nominalnych atrybutów: outlook i windy. Pierwszy ma 3 wartości: sunny, overcast i rainy; a drugi dwie: TRUE i FALSE. Nominal values with special characters, commas or spaces are enclosed in single quotes. @attribute temperature real @attribute humidity real Linie te definiują dwa numeryczne atrybuty. Zamiast real, moŝe być uŝyty integer lub numeric. Tylko 7 cyfr po przecinku jest branych pod uwagę @attribute play {yes, no} Ostatni atrybut domyślnie uwaŝany jest za klasę decyzyjną W naszym przypadku jest to nominalny atrybut z 2 wartościami (tworzy on binarny klasyfikacyjny problem) @data sunny,false,85,85,no sunny,true,80,90,no overcast,false,83,86,yes rainy,false,70,96,yes rainy,false,68,80,yes Pozostała część danych zaczyna się od @data, po której następują linie przykładów z oddzielone przecinkami wartościami W naszym przypadku mamy 5 przykładów. 2

PRZYKŁAD java weka.core.instances data/soybean.arff java weka.core.converters.csvloader data.csv > data.arff Klasyfikacja UŜycie klasyfikatora z linii komend weka.classifiers.classifier PRZYKŁAD java weka.classifiers.rules.zeror -t weather.arff java weka.classifiers.trees.j48 -t weather.arff java weka.classifiers.trees.j48 -t data/weather.arff -i Klasyfikacja Którego uŝyć algorytmu? Potrzeba systematycznej drogi do oceny (ewaluacji) algorytmu (klasyfikatora) Metody która pozwoli porównać którego algorytmu uŝyć Classifier s performance error rate For classification problems, it is natural to measure a classifier s performance in terms of the error rate. The classifier predicts the class of each instance: if it is correct, that is counted as a success; if not, it is an error. The error rate is just the proportion of errors made over a whole set of instances, and it measures the overall performance of the classifier. Classifier s performance what we are interested in is the likely future performance on new data, not the past performance on old data Is the error rate on old data likely to be a good indicator of the error rate on new data? The answer is a resounding no not if the old data was used during the learning process to train the classifier. Classifier s performance Error rate on the training set is not likely to be a good indicator of future performance. Why? Because the classifier has been learned from the very same training data, any estimate of performance based on that data will be optimistic, and may be hopelessly optimistic. 3

Test set To predict the performance of a classifier on new data, we need to assess its error rate on a dataset that played no part in the formation of the classifier. This independent dataset is called the test set. We assume that both the training data and the test data are representative samples of the underlying problem. DuŜe zbiory danych If lots of data is available, there is no problem: we take a large sample and use it for training; then another, independent large sample of different data and use it for testing. Generally, the larger the training sample the better the classifier, And the larger the test sample, the more accurate the error estimate. Małe zbiory danych In many situations the training data must be classified manually and so must the test data, of course, to obtain error estimates. This limits the amount of data that can be used for training and testing, the problem is: how to make the most of a limited dataset. From this dataset, a certain amount is held over for testing and the remainder is used for training. Małe zbiory danych There s a dilemma here: to find a good classifier, we want to use as much of the data as possible for training; to obtain a good error estimate, we want to use as much of it as possible for testing. Klasyfikacja- ocena klasyfikatora accuracy There are various approaches to determine the performance of classifiers. The performance can most simply be measured by counting the proportion of correctly predicted examples in an unseen test dataset. This value is the accuracy, which is also 1- ErrorRate. Hold-out estimate The simplest case is using a training set and a test set which are mutually independent. This is referred to as hold-out estimate. Hold-out estimates may be computed by: repeatedly resampling the same dataset i.e. randomly reordering it and then splitting it into training and test sets with a specific proportion of the examples, collecting all estimates on test data and computing average and standard deviation of accuracy. 4

Cross-validation Here, a number of folds n is specified. The dataset is randomly reordered and then split into n folds of equal size. In each iteration, one fold is used for testing and the other n-1 folds are used for training the classifier. The test results are collected and averaged overall folds. This gives the cross-validation estimate of the accuracy. Cross-validation cd. stratification Each class in the full dataset should be represented in about the right proportion in the training and testing sets. You should ensure that the random sampling is done in such a way as to guarantee that each class is properly represented in both training and test sets. This procedure is called stratification, and we might speak of stratified holdout. Cross-validation cd. stratification The folds can be purely random or slightly modified to create the same class distributions in each fold as in the complete dataset. Stratified - [ang.] uwarstwiony, złoŝony z warstw Cross-validation cd. A more general way to mitigate any bias caused by the particular sample chosen for holdout is to repeat the whole process, training and testing, several times with different random samples. In each iteration a certain proportion say twothirds of the data is randomly selected for training, possibly with stratification, and the remainder used for testing. The error rates on the different iterations are averaged to yield an overall error rate. This is the repeated holdout method of error rate estimation. stratified threefold cross-validation In cross-validation, you decide on a fixed number of folds, or partitions of the data. Suppose we use three. Then the data is split into three approximately equal partitions and each in turn is used for testing and the remainder is used for training. That is, use two-thirds for training and one-third for testing and repeat the procedure three times so that, in the end, every instance has been used exactly once for testing. This is called threefold cross-validation, and if stratification is adopted as well which it often is it is stratified threefold cross-validation. CV-10 The standard way of predicting the error rate of a learning technique given a single, fixed sample of data is to use stratified 10-fold cross-validation. The data is divided randomly into 10 parts in which the class is represented in approximately the same proportions as in the full dataset. Each part is held out in turn and the learning scheme trained on the remaining nine-tenths; then its error rate is calculated on the holdout set. 5

CV-10 Thus the learning procedure is executed a total of 10 times on different training sets (each of which have a lot in common). Finally, the 10 error estimates are averaged to yield an overall error estimate. Leave-one-out (= loo) Leave-one-out (= loo) cross-validation signifies that n is equal to the number of examples. Out of necessity, loo cv has to be non-stratified, i.e. the class distributions in the test set are not related to those in the training data. Therefore loo cv tends to give less reliable results. However it is still quite useful in dealing with small datasets Leave-one-out Each instance in turn is left out, and the learning method is trained on all the remaining instances. It is judged by its correctness on the remaining instance one or zero for success or failure, respectively. The results of all n judgments, one for each member of the dataset, are averaged, and that average represents the final error estimate. Classifiers parametry linii komend Classifiers are at the core of WEKA. There are a lot of common options for classifiers, most of which are related to evaluation purposes. -t specifies the training file (ARFF format) -T specifies the test file in (ARFF format). If this parameter is missing, a crossvalidation will be performed (default: ten-fold cv) -x This parameter determines the number of folds for the crossvalidation. A cv will only be performed if -T is missing. Classifiers parametry linii komend -c this parameter sets the class variable with a one-based index. -d The model after training can be saved via this parameter. Each classifier has a different binary format for the model, so it can only be read back by the exact same classifier on a compatible dataset. Only the model on the training set is saved, not the multiple models generated via cross-validation. -l Loads a previously saved model, usually for testing on new, previously unseen data. In that case, a compatible test file should be specified, i.e. the same attributes in the same order. Classifiers parametry linii komend -p # If a test file is specified, this parameter shows you the predictions and one attribute (0 for none) for all test instances. -i A more detailed performance description via precision, recall, true- and false positive rate is additionally output with this parameter. All these values can also be computed from the confusion matrix. java weka.classifiers.trees.j48 -t data/weather.arff -i -o This parameter switches the human-readable output of the model description off. In case of support vector machines or NaiveBayes, this makes some sense unless you want to parse and visualize a lot of information. java weka.classifiers.trees.j48 -p 5 -l test.model -T nowe.arff 6

Wizualizacja rezultatów klasyfikacji Wizualizacja rezultatów klasyfikacji Wizualizacja rezultatów klasyfikacji Zapis rezultatów klasyfikacji Przykład na WEKA Explorer 7

Dziękuję Pytania... 8