Data Warehouses and Data Mining Izabela Szczęch Szymon Wilk Jerzy Stefanowski Institute of Computing Science Laboratory of Intelligent Decision Support Systems Poznań University of Technology Software Development Technologies Master studies, third semester Academic year 2008/09 (winter course)
1 Data Warehouses and Data Mining 2 Information about the Course
1 Data Warehouses and Data Mining 2 Information about the Course
Goal: supporting decision makers.
Main Tasks: Information processing: querying, basic statistical analysis, reporting using cross-tabs, tables, charts, or graphs, low-cost Web-based accessing tools integrated with Web browsers. Analytical processing: OLAP operations for multidimensional data view and analysis. Knowledge discovery: finding hidden patterns and associations, analytical models for prediction and clustering, visualization.
Applications: Business data analysis (sales prediction, stock market prediction, direct marketing, CRM) Computer vision and pattern recognition Web mining (personalization, text categorization, recommender systems) Forecast prediction Computer Aided Diagnostic Prediction of gene structure...
Main Players on the Market: IBM, Oracle, Microsoft, Sybase, SAS, Cognos, Informatica, Business Objects, SPSS, Statistica, Insightful (S-Plus), R, Weka :) but also: Google and Yahoo!
Data mining is predicted to be one of the most revolutionary developments of the next decade. Data mining is one of 10 emerging technologies that will change the world. Life after ERP. What now? Your ERP system is in place. Now it s time for intelligence. It s often more important to creatively invent new data sources than to implement the latest academic variations on an algorithm. Those who ignore Statistics are condemned to reinvent it.
To be learned in the coming semester...
The aim of the course is to get to know how to store, process and analyze large volumes of data. Two perspectives are presented: Basic skills for: Design, implementation and use of data warehouse and data mining systems. Design of algorithms for storing, processing and analyzing data. Designing, implementing and using data warehouse and data mining systems. Implementing an efficient data analysis tools for dedicated applications. Solving data analysis problems.
Data Models and Evolution of Database Systems Data models: hierarchical, network, relational, object-oriented, multidimensional. Database systems: operational (OLTP), analytical (OLAP).
Modeling of Data Warehouses Complex entity-relationship diagrams (ERD) for OLTP. Simple star schema for OLAP. Specific approach to data warehouse modeling. Example: mobile phone operator.
OLAP Systems and MDX Language OLAP provides an effective solution for accessing and processing large volumes of high dimensional data: parallel access to data, sophisticated data structures, optimization. Access through multidimensional reports and query languages like MDX.
Processing of Very Large Data Data denormalization. Data aggregation. Materialized perspectives. Query re-write. Partitioning. Joins. Indexes. Optimization of query processing.
ETL Process Extraction, transformation and loading of data. Heterogeneous data sources: database systems, WWW, services, specific databases,.txt,.doc and.xls files. Data is integrated, transformed and cleansed. Data is loaded and data warehouse is refreshed.
1 Data Warehouses and Data Mining 2 Information about the Course
Time and Place Lecture: Thursday 13.30, room no. 6. Labs: Wednesday 9.45 and 11.45, room no. 44. Project: Tuesday 9.45, room no. 45 and Thursday 16:50, room no. 44.
Instructors dr inż. Izabela Szczęch izabela.szczech@cs.put.poznan.pl dr inż. Szymon Wilk szymon.wilk@cs.put.poznan.pl prof. Jerzy Stefanowski jerzy.stefanowski@cs.put.poznan.pl Web site ophelia.cs.put.poznan.pl/webdav/dm/ students/.../winter_2009/
Schedule of the Lectures 10-01-2009 Data Warehouses and Data Mining 10-08-2009 Data Models and Evolution of Database Systems 10-15-2009 Modeling of Data Warehouses 10-22-2009 OLAP Systems and MDX Language 10-29-2009 Processing of Very Large Data 11-05-2009 ETL Process...... Exam to be announced
Schedule of the Laboratories 10-07-2008 Introduction to Data Warehouses (MS SQL2008) 10-14-2008 Modeling of Data Warehouses 10-21-2008 Modeling of Data Warehouses (Case Study) 10-28-2008 OLAP Systems Multidimensional Reports 11-04-2008 MDX Language 11-11-2008 Holiday 11-18-2008 ETL Process 11-25-2008 OLAP, MDX, ETL (Case Study)......
Schedule of the Laboratories Send me an email (before next Monday, 12.00) with a list of students in each lab group using a format: Family_name \t First_name \t student_id \t email
Project ophelia.cs.put.poznan.pl/webdav/dm/ students/.../winter_2009/projects/projects.html Work in groups of 2 persons. 3 presentations: preliminary, middle, final. ophelia.cs.put.poznan.pl/webdav/dbdw/ students/.../dbdw-summer_2008/projects/projects.html
Tematy projektów Algorytmy generowania reguł asocjacyjnych Klasyfikacja dokumentów tekstowych Automatyczne czyszczenie danych Język i aplikacja przetwarzania i eksploracji danych Porównanie serwerów OLAP Zapytania do strumieni danych własne propozycje tematów należy opisać i przysłać prowadzacemu projekt do poniedziałku
Final Evaluation Lectures Egzam/Test (min. 50%) Labs Case study: modeling data warehouses 10 points (min. 50%) Case study: OLAP, MDX, ETL 10 points (min. 50%) Evaluation of labs about Data Mining 20 points (min. 50%) Scale 90% points 5.0% 80% points 4.5% 70% points 4.0 60% points 3.5% 50% points 3.0% otherwise 2.0
Bibliography C.J. Date, Wprowadzenie do systemów baz danych, Wydawnictwa Naukowo-Techniczne 1999. Z. Królikowski, Hurtownie danych: logiczne i fizyczne struktury danych, Wydawnictwo Politechniki Poznańskiej 2007 Ch. Todman, Projektowanie hurtowni danych. Zarzadzanie kontaktami z klientami (CRM), Wydawnictwa Naukowo-Techniczne 2003 M. Jarke, M. Lenzerini, Y. Vassiliou, P. Vassiliadis, Hurtownie danych. Podstawy organizacji i funkcjonowania, Wydawnictwa Szkolne i Pedagogiczne 2003 V. Poe, P. Klauer, S. Brobst, Tworzenie hurtowni danych, wspomaganie podejmowania decyzji, Wydawnictwa Naukowo-Techniczne 2000 R. Kimball, L. Reeves, M. Ross, W. Thornthwaite, The Data Warehouse Lifecycle Toolkit: Expert Methods for Designing, Developing, and Deploying Data Warehouses, John Wiley & Sons 1998 R. Kimball, M. Ross, The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, John Wiley & Sons 2002
Bibliography J. Koronacki, J. Ćwik, Statystyczne systemy uczace się, Wydawnictwa Naukowo-Techniczne 2005 P. Cichosz, Systemy uczace się, Wydawnictwa Naukowo-Techniczne 2000 D. Hand, H. Mannila, P. Smyth, Eksploracja danych, Wydawnictwa Naukowo-Techniczne 2006 J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan-Kaufmann 2000 T. Hastie, R. Tibshirani, J.H Friedman, Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer 2003 R. O. Duda, P. E. Hart, D. G. Stork, Pattern Classification, 2nd Edition, Wiley-Interscience 2000 A. R. Webb, Statistical Pattern Recognition, 2nd Edition Wiley 2002 Ch. D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval, Cambridge University Press 2008, http://www-csli.stanford.edu/~hinrich/ information-retrieval-book.html