Odkrywanie współzależnych cech w danych silnie wielowymiarowy

Podobne dokumenty
Hard-Margin Support Vector Machines

Rozpoznawanie twarzy metodą PCA Michał Bereta 1. Testowanie statystycznej istotności różnic między jakością klasyfikatorów

Linear Classification and Logistic Regression. Pascal Fua IC-CVLab

Proposal of thesis topic for mgr in. (MSE) programme in Telecommunications and Computer Science

Machine Learning for Data Science (CS4786) Lecture11. Random Projections & Canonical Correlation Analysis

Zarządzanie sieciami telekomunikacyjnymi

Helena Boguta, klasa 8W, rok szkolny 2018/2019

ERASMUS + : Trail of extinct and active volcanoes, earthquakes through Europe. SURVEY TO STUDENTS.

Machine Learning for Data Science (CS4786) Lecture 11. Spectral Embedding + Clustering

Weronika Mysliwiec, klasa 8W, rok szkolny 2018/2019

Tychy, plan miasta: Skala 1: (Polish Edition)

Machine Learning for Data Science (CS4786) Lecture 24. Differential Privacy and Re-useable Holdout

Wprowadzenie do programu RapidMiner, część 2 Michał Bereta 1. Wykorzystanie wykresu ROC do porównania modeli klasyfikatorów

Revenue Maximization. Sept. 25, 2018

SSW1.1, HFW Fry #20, Zeno #25 Benchmark: Qtr.1. Fry #65, Zeno #67. like

Fig 5 Spectrograms of the original signal (top) extracted shaft-related GAD components (middle) and

Zakopane, plan miasta: Skala ok. 1: = City map (Polish Edition)

TTIC 31210: Advanced Natural Language Processing. Kevin Gimpel Spring Lecture 9: Inference in Structured Prediction

Jak zasada Pareto może pomóc Ci w nauce języków obcych?

Wojewodztwo Koszalinskie: Obiekty i walory krajoznawcze (Inwentaryzacja krajoznawcza Polski) (Polish Edition)

deep learning for NLP (5 lectures)

TTIC 31210: Advanced Natural Language Processing. Kevin Gimpel Spring Lecture 8: Structured PredicCon 2

Few-fermion thermometry

y = The Chain Rule Show all work. No calculator unless otherwise stated. If asked to Explain your answer, write in complete sentences.

European Crime Prevention Award (ECPA) Annex I - new version 2014

Network Services for Spatial Data in European Geo-Portals and their Compliance with ISO and OGC Standards

Wojewodztwo Koszalinskie: Obiekty i walory krajoznawcze (Inwentaryzacja krajoznawcza Polski) (Polish Edition)

Patients price acceptance SELECTED FINDINGS

Steeple #3: Gödel s Silver Blaze Theorem. Selmer Bringsjord Are Humans Rational? Dec RPI Troy NY USA

Analysis of Movie Profitability STAT 469 IN CLASS ANALYSIS #2

Stargard Szczecinski i okolice (Polish Edition)

Convolution semigroups with linear Jacobi parameters

Compressing the information contained in the different indexes is crucial for performance when implementing an IR system

Katowice, plan miasta: Skala 1: = City map = Stadtplan (Polish Edition)

EXAMPLES OF CABRI GEOMETRE II APPLICATION IN GEOMETRIC SCIENTIFIC RESEARCH

Wprowadzenie do programu RapidMiner, część 3 Michał Bereta

Miedzy legenda a historia: Szlakiem piastowskim z Poznania do Gniezna (Biblioteka Kroniki Wielkopolski) (Polish Edition)

Previously on CSCI 4622

tum.de/fall2018/ in2357

Blow-Up: Photographs in the Time of Tumult; Black and White Photography Festival Zakopane Warszawa 2002 / Powiekszenie: Fotografie w czasach zgielku

POLITYKA PRYWATNOŚCI / PRIVACY POLICY

ARNOLD. EDUKACJA KULTURYSTY (POLSKA WERSJA JEZYKOWA) BY DOUGLAS KENT HALL


OpenPoland.net API Documentation

MoA-Net: Self-supervised Motion Segmentation. Pia Bideau, Rakesh R Menon, Erik Learned-Miller

Wprowadzenie do programu RapidMiner Studio 7.6, część 9 Modele liniowe Michał Bereta

Karpacz, plan miasta 1:10 000: Panorama Karkonoszy, mapa szlakow turystycznych (Polish Edition)

Egzamin maturalny z języka angielskiego na poziomie dwujęzycznym Rozmowa wstępna (wyłącznie dla egzaminującego)

DOI: / /32/37

DODATKOWE ĆWICZENIA EGZAMINACYJNE

archivist: Managing Data Analysis Results

Knovel Math: Jakość produktu

Wydział Fizyki, Astronomii i Informatyki Stosowanej Uniwersytet Mikołaja Kopernika w Toruniu

MaPlan Sp. z O.O. Click here if your download doesn"t start automatically

The Overview of Civilian Applications of Airborne SAR Systems

ABOUT NEW EASTERN EUROPE BESTmQUARTERLYmJOURNAL

ANKIETA ŚWIAT BAJEK MOJEGO DZIECKA

Gradient Coding using the Stochastic Block Model

Ankiety Nowe funkcje! Pomoc Twoje konto Wyloguj. BIODIVERSITY OF RIVERS: Survey to students


Wybrzeze Baltyku, mapa turystyczna 1: (Polish Edition)

Wojewodztwo Koszalinskie: Obiekty i walory krajoznawcze (Inwentaryzacja krajoznawcza Polski) (Polish Edition)


CEE 111/211 Agenda Feb 17

Zmiany techniczne wprowadzone w wersji Comarch ERP Altum

Domy inaczej pomyślane A different type of housing CEZARY SANKOWSKI

Wprowadzenie do programu RapidMiner, część 5 Michał Bereta

EGZAMIN MATURALNY Z JĘZYKA ANGIELSKIEGO POZIOM ROZSZERZONY MAJ 2010 CZĘŚĆ I. Czas pracy: 120 minut. Liczba punktów do uzyskania: 23 WPISUJE ZDAJĄCY

Karpacz, plan miasta 1:10 000: Panorama Karkonoszy, mapa szlakow turystycznych (Polish Edition)

Sargent Opens Sonairte Farmers' Market

Camspot 4.4 Camspot 4.5

Unit of Social Gerontology, Institute of Labour and Social Studies ageing and its consequences for society

Instrukcja obsługi User s manual

STAŁE TRASY LOTNICTWA WOJSKOWEGO (MRT) MILITARY ROUTES (MRT)

Warsztaty Ocena wiarygodności badania z randomizacją

Wprowadzenie do programu RapidMiner, część 3 Michał Bereta

Test sprawdzający znajomość języka angielskiego

17-18 września 2016 Spółka Limited w UK. Jako Wehikuł Inwestycyjny. Marek Niedźwiedź. InvestCamp 2016 PL

Wojewodztwo Koszalinskie: Obiekty i walory krajoznawcze (Inwentaryzacja krajoznawcza Polski) (Polish Edition)

Has the heat wave frequency or intensity changed in Poland since 1950?

Neural Networks (The Machine-Learning Kind) BCS 247 March 2019

Testy jednostkowe - zastosowanie oprogramowania JUNIT 4.0 Zofia Kruczkiewicz

Pielgrzymka do Ojczyzny: Przemowienia i homilie Ojca Swietego Jana Pawla II (Jan Pawel II-- pierwszy Polak na Stolicy Piotrowej) (Polish Edition)

Wyk lad 8: Leniwe metody klasyfikacji

Język angielski. Poziom rozszerzony Próbna Matura z OPERONEM i Gazetą Wyborczą CZĘŚĆ I KRYTERIA OCENIANIA ODPOWIEDZI POZIOM ROZSZERZONY CZĘŚĆ I

Wykaz linii kolejowych, które są wyposażone w urządzenia systemu ETCS

Emilka szuka swojej gwiazdy / Emily Climbs (Emily, #2)

Wpływ dyrektywy PSD II na korzystanie z instrumentów płatniczych. Warszawa, 15 stycznia 2015 r. Zbigniew Długosz

A n g i e l s k i. Phrasal Verbs in Situations. Podręcznik z ćwiczeniami. Dorota Guzik Joanna Bruska FRAGMENT

INSTRUKCJE JAK AKTYWOWAĆ SWOJE KONTO PAYLUTION

Presented by. Dr. Morten Middelfart, CTO

Rachunek lambda, zima

Wykaz linii kolejowych, które są wyposażone w urzadzenia systemu ETCS

DM-ML, DM-FL. Auxiliary Equipment and Accessories. Damper Drives. Dimensions. Descritpion

Ankiety Nowe funkcje! Pomoc Twoje konto Wyloguj. BIODIVERSITY OF RIVERS: Survey to teachers

Bayesian graph convolutional neural networks

ITIL 4 Certification

Effective Governance of Education at the Local Level

FORMULARZ REKLAMACJI Complaint Form

SubVersion. Piotr Mikulski. SubVersion. P. Mikulski. Co to jest subversion? Zalety SubVersion. Wady SubVersion. Inne różnice SubVersion i CVS

Transkrypt:

Odkrywanie współzależnych cech w danych silnie wielowymiarowych Praca wspólna z Michałem Dramińskim Poznań, 22 kwietnia 2016

MCFS-ID Algorithm of Draminski et al.: the Monte Carlo Feature Selection (or MCFS) part Let us begin with a brief description of an effective method for ranking features according to their importance for classification regardless of a classifier to be later used. Our procedure is conceptually very simple, albeit computer-intensive. We consider a particular feature to be important, or informative, if it is likely to take part in the process of classifying samples into classes more often than not. This readiness of a feature to take part in the classification process, termed relative importance of a feature, is measured via intensive use of classification trees. When assessing relative importance of a feature, the aforementioned readiness of the feature to appear in a given tree is suitably moderated by the (weighted) accuracy this tree.

MCFS-ID Algorithm: the MCFS part, contd. In the main step of the procedure, we estimate relative importance of features by constructing thousands of trees for randomly selected subsets of features. More precisely, out of all d features, s subsets of m features are selected, m being fixed and m << d, and for each subset of features, t trees are constructed and their performance is assessed. Each of the t trees in the inner loop is trained and evaluated on a different, randomly selected training and test sets which come from a split of the full set of training data into two subsets: each time, out of all n samples, about 66% of samples are drawn at random for training (in such a way as to preserve proportions of classes from the full set of training data) and the remaining samples are used for testing.

MCFS-ID Algorithm: the MCFS part, contd.

Interdependency Discovery, i.e., the ID part of the MCFS-ID Algorithm In the MCFS part of the algorithm, a cutoff between informative and non-informative features is provided. From now on, our interest is confined to the set of informative features. Our approach to interdependency discovery is significantly different from known approaches which consist in finding correlations between features or finding groups of features that behave similarly in some sense across samples (e.g., as in finding co-regulated features). The focus is on identifying features that cooperate in determining that a sample belongs to a particular class. A directed graph of such cooperating features is constructed.

The ID part of MCFS-ID Algorithm, contd. To be more specific: For a given training set of samples, an ensemble of decision trees has been constructed within the MCFS part of the algorithm. Each decision rule provided by each tree has the form of an ordered conjunction of conditions imposed on particular separate features. (Note that trees are flexible classifiers, where flexibility amounts to classifier s ability to produce rules as complex as is needed.) Clearly then, each decision rule points to some interdependencies between the features appearing in the conditions. Indeed, the information included in such decision rules, when properly aggregated, reveals interdependencies (however complex they may prove) between features which are best correlated with or, as has been said, cooperate in determining, the samples classes.

The ID part of MCFS-ID Algorithm, contd. To see how an ID-Graph is built, let us recall again that each node in each of the multitude of classification trees represents a feature on which a split is made. Now, for each node in each classification tree its all antecedent nodes can be taken into account along the path to which the node belongs. For each pair [antecedent node given node] we add one directed edge to our ID-Graph from antecedent node to given node. The edges are found along the paths in all the s t MCFS trees. Clearly, the same edge can appear more than once even in a single tree.

The ID part of MCFS-ID Algorithm, contd. The strength of the interdependence between two nodes, actually two features, connected by a directed edge, termed ID weight of a given edge (ID weight for short), is defined in the following way: For node n k (τ) in the τ-th tree, τ = 1,..., s t, and its antecedent node n i (τ), ID weight of the directed edge from n i (τ) to n k (τ), denoted w[n i (τ) n k (τ)], is equal to w[n i (τ) n k (τ)] = GR(n k (τ)) ( ) no. in nk (τ), (1) no. in n i (τ) where GR(n k (τ)) stands for gain ratio for node n k (τ), (no. in n k (τ)) denotes the number of samples in node n k (τ) and (no. in n i (τ) denotes the number of samples in node n i (τ).

The ID part of MCFS-ID Algorithm, contd. The final ID-Graph is based on the sums of all ID weights for each pair [antecedent node given node]. That is, for each directed edge found, its ID weights are summed over all occurrences of this edge in all paths of all MCFS classification trees. For a given edge, it is this sum of ID weights which becomes the ID weight of this edge in the final ID-Graph.

The ID part of MCFS-ID Algorithm, contd. In sum, an ID-Graph provides a general roadmap that not only shows all the most variable attributes that allow for efficient classification of the objects but, moreover, it points to possible interdependencies between the attributes and, in particular, to a hierarchy between pairs of attributes. High differentiation of the values of ID weights in the ID-Graph gives strong evidence that some interdependencies between some features are much stronger than others and that they create some patterns/paths calling for interpretation based on background knowledge.

The ID part of MCFS-ID Algorithm - a toy example Consider objects from 3 classes, A, B and C, that contain 40, 20 and 10 objects, respectively (70 objects altogether). For each object, create 6 binary features (A1, A2, B1, B2, C1 and C2) that are ideally or almost ideally correlated with class feature. If an object s class equals A, then its features A1 and A2 are set to class value A ; otherwise A1 = A2 = 0. If an object s class is B or C, we proceed analogously, but we introduce some random corruption to 2 observations from class B and to 4 observations from class C : in the former case, for each of the two observations and both attributes B1/B2, we randomly replace their value B by 0 and in the latter case, again for each of the four observations and both attributes C1/C2, we randomly replace their value C by 0. The data also contains additional 500 random numerical features with uniformly [0,1] distributed values. Thus we end up with 6 nominal important features (3 pairs with different levels of importance for classification) and 500 randomly distributed.

The ID part of MCFS-ID Algorithm - a toy example Top Features (RI_norm) A1 A2 B2 B1 C2 C1 X418 X290 X236 X356 important not important 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Rysunek: Top features selected by MCFS-ID.

The ID part of MCFS-ID Algorithm - a toy example B2 A2 C2 C1 A1 B1 Rysunek: ID-Graph for artificial data.

The ID part of MCFS-ID Algorithm - a toy example In the ID-Graphs, as seen in the Figure, some additional information is conveyed with the help of suitable graphical means. The color intensity of a node is proportional to the corresponding feature s RI. The size of a node is proportional to the number of edges related to this node. The width and level of darkness of an edge is proportional to the ID weight of this edge. Since we would like to review only the strongest ID weights let us plot ID-Graph with only 12 top edges.

The ID part of MCFS-ID Algorithm - a toy example A1 C1 B1 B2 C2 A2 Rysunek: ID-Graph for artificial data, limited to top 6 features and top 12 ID weights.

Discovering interactions on a finer level The ID-Graph does not tell the differences between the classes, i.e., it does tell what interdependencies make the samples belong to different classes but does not give rules which determine any given class. Accordingly and separately, a way to construct rule networks is also provided, where the networks are constructed from IF-THEN rules with one network per each decision class. Please see Bornelöv, Marillet and Komorowski (2014) and Draminski et al. (2016) for our proposal. In lieu of a conclusion let state only that while the current version of the MCFS-ID is a new one, it is already included in CRAN (The Comprehensive R Archive Network). Moreover, along with a module to discover rule networks, its explanatory power has been verified on a number of molecular and medical examples.

Acknowledgements We thank our close collaborators, Jan Komorowski, Michal J. Dabrowski, Klev Diamanti, Marcin Kierczak, Marcin Kruczyk and Susanne Bornelöv, who have built on the MCFS-ID, most notably by providing a host of new insights and results within the area of bioinformatics.

Bibliography: Bornelöv S., Marillet S., Komorowski J.: Ciruvis: a web-based tool for rule networks and interaction detection using rule-based classifiers. BMC Bioinformatics. 2014; 15:139. Dramiński, M., Koronacki, J., Komorowski, J.: A study on Monte Carlo Gene Screening. In: Intelligent Information Processing and Web Mining. 2005; Springer, 349-356. Dramiński M, Rada Iglesias A, Enroth S, Wadelius C, Koronacki J, Komorowski J.: Monte Carlo feature selection for supervised classification. Bioinformatics. 2008; 24, 110-117. Dramiński M., Kierczak M., Koronacki J., Komorowski J.: Monte Carlo Feature Selection and Interdependency Discovery in Supervised Classification. In: Koronacki J., et al. (eds): Advances in Machine Learning II. 2010; Springer series: Studies in Computational Intelligence, Vol. 263, 371-385. Dramiński M., Da browski M.J., Diamanti K., Koronacki J., Komorowski J.: Discovering networks of interdependent features in high-dimensional problems. In: : N.Japkowicz and J.Stefanowski (eds.), Big Data Analysis: New Algorithms for a New Society. 2016; Springer, 285-304. Please consult the given references for a vast literature on the topic discussed.