Hurtownie Danych i Business Intelligence: przegląd technologii

Podobne dokumenty

Hurtownie Danych i Business Intelligence: przegląd technologii

Hurtownie danych - przegląd technologii Robert Wrembel Politechnika Poznańska Instytut Informatyki Robert.Wrembel@cs.put.poznan.pl

Hurtownie danych - przegląd technologii

Data Warehouses and Business Intelligence: Technology Overview

Jak wiedzieć więcej i szybciej - Analizy in-memory

Szkolenia SAS Cennik i kalendarz 2017

BigData. Czy zawsze oznacza BigProblem? Artur Górnik, SAP Polska Piotr Zacharek, HP Polska 14 kwietnia, 2015

Zarządzanie sieciami telekomunikacyjnymi

SAS Access to Hadoop, SAS Data Loader for Hadoop Integracja środowisk SAS i Hadoop. Piotr Borowik

Nowe podejście do składowania danych

Presented by. Dr. Morten Middelfart, CTO

Hurtownie danych - przegląd technologii

Hurtownia danych szansa na nowe życie (starej idei) Jakub Skuratowicz Technical Sales

HARMONOGRAM SZKOLEŃ. październik - grudzień 2019

Szkolenie: Jak mieć więcej czasu na wyciąganie wniosków

Hurtownie danych. Wstęp. Architektura hurtowni danych. CO TO JEST HURTOWNIA DANYCH

Dlaczego my? HARMONOGRAM SZKOLEŃ kwiecień - czerwiec ACTION Centrum Edukacyjne. Autoryzowane szkolenia. Promocje RODO / GDPR

Cel szkolenia. Konspekt

Plan wykładu. Hurtownie danych. Problematyka integracji danych. Cechy systemów informatycznych

Wprowadzenie do Hurtowni Danych

PERFORMANCE POINT SERVICE NIE TYLKO DLA ORŁÓW

CENNIK I TERMINARZ SZKOLEŃ

Dlaczego my? HARMONOGRAM SZKOLEŃ kwiecień - czerwiec ACTION Centrum Edukacyjne. Autoryzowane szkolenia. Promocje

Tematy projektów HDiPA 2015

CENNIK I TERMINARZ SZKOLEŃ

Dlaczego my? HARMONOGRAM SZKOLEŃ lipiec - wrzesień ACTION Centrum Edukacyjne. Autoryzowane szkolenia. Promocje

[MS-10979] Course 10979C: Microsoft Azure Fundamentals (2 dni)

Hurtownie danych i business intelligence - wykład II. Zagadnienia do omówienia. Miejsce i rola HD w firmie

Organizacyjnie. Prowadzący: dr Mariusz Rafało (hasło: BIG)

Hadoop i Spark. Mariusz Rafało

HARMONOGRAM SZKOLEŃ styczeń - marzec 2017

Wprowadzenie do Hurtowni Danych. Mariusz Rafało

Datacenter - Przykład projektu dla pewnego klienta.

Dlaczego my? HARMONOGRAM SZKOLEŃ październik - grudzień ACTION Centrum Edukacyjne. Autoryzowane szkolenia. Promocje

CENNIK I TERMINARZ SZKOLEŃ

Wstęp do Business Intelligence

KATALOG SZKOLEŃ. Windows Server 2016 Liczba dni STYCZEŃ LUTY MARZEC KWIECIEŃ MAJ CZERWIEC

Samodzielny Business Intelligence in memory duże i małe. Paweł Gajda Business Solution Architect

Hbase, Hive i BigSQL

Wprowadzenie do Hurtowni Danych. Mariusz Rafało

CENNIK I TERMINARZ SZKOLEŃ

Wprowadzenie do Hurtowni Danych

(duzo, przeczytac raz i zrozumiec powinno wystarczyc. To jest proste.)

Co to jest Business Intelligence?

Architektury i technologie integracji danych

Organizacyjnie. Prowadzący: dr Mariusz Rafało (hasło: BIG)

Integracja systemów transakcyjnych

Usługi IBM czyli nie taki diabeł straszny

CENNIK I TERMINARZ SZKOLEŃ

Hurtownie danych a transakcyjne bazy danych

Hurtownie danych. 31 stycznia 2017

CENNIK I TERMINARZ SZKOLEŃ

Zmiany techniczne wprowadzone w wersji Comarch ERP Altum

[MS-20532] Course 20532B: Developing Microsoft Azure Solutions (5 dni)

Czy OMS Log Analytics potrafi mi pomóc?

Rola infrastruktury w analityce

Microsoft Certified Solutions Associate (MCSA) ścieżki certyfikacji

HURTOWNIE DANYCH I BUSINESS INTELLIGENCE

BUSINESS INTELLIGENCE

Wydział Informtyki i Nauki o Materiałach Kierunek Informatyka

Organizacyjnie. Prowadzący: dr Mariusz Rafało (hasło: BIG)

Windows Server 2012 MS 20410: Installing and Configuring Windows Server 2012

Zbieranie i zarządzanie danymi. Budżetowanie, raportowanie, planowanie. Czyli nie tylko o archiwizowaniu

Iwona Milczarek, Małgorzata Marcinkiewicz, Tomasz Staszewski. Poznań,

Hurtownie danych - przegląd technologii

Projektowanie rozwiązań Big Data z wykorzystaniem Apache Hadoop & Family

Oprogramowanie na miarę z13

Hurtownie danych i przetwarzanie analityczne - projekt

Tematy projektów Edycja 2014

Usługi analityczne budowa kostki analitycznej Część pierwsza.

Wysoka wydajność vs wysoka dostępność w środowiskach bazodanowych Oracle

Metodyki projektowania i modelowania systemów Cyganek & Kasperek & Rajda 2013 Katedra Elektroniki AGH

CENNIK I TERMINARZ SZKOLEŃ

Baza danych in-memory. DB2 BLU od środka Artur Wrooski

MS Visual Studio 2005 Team Suite - Performance Tool

BIG DATA DLA KAŻDEGO. Radosław Łebkowski, Sławomir Strzykowski - Microsoft Piotr Zacharek - Hewlett Packard

Organizacyjnie. Prowadzący: dr Mariusz Rafało (hasło: BIG)

Część I Istota analizy biznesowej a Analysis Services

INFORMACJA. Sebastian Pawlak Chief Technologist & Presales Manager

MS OD Integrating MDM and Cloud Services with System Center Configuration Manager

Strona główna > Produkty > Systemy regulacji > System regulacji EASYLAB - LABCONTROL > Program konfiguracyjny > Typ EasyConnect.

Wprowadzenie do technologii Business Intelligence i hurtowni danych

Platforma dostępności Veeam dla rozwiązań Microsoft. Mariusz Rybusiński Senior System Engineer Veeam Microsoft MVP

Instalacja SQL Server Konfiguracja SQL Server Logowanie - opcje SQL Server Management Studio. Microsoft Access Oracle Sybase DB2 MySQL

Wprowadzenie do hurtowni danych

Cena netto (PLN) IV kwartał. Cena netto (PLN) Podstawy SAS INTRO

Tematy projektów Edycja 2017

Bartłomiej Graczyk MCT,MCITP,MCTS

SubVersion. Piotr Mikulski. SubVersion. P. Mikulski. Co to jest subversion? Zalety SubVersion. Wady SubVersion. Inne różnice SubVersion i CVS

Szkolenia informatyczne Vavatech

Zabbix -Monitoring IT bez taśmy klejącej. Paweł Tomala Barcamp 15 czerwca 2015

SQL Server 2016 w świecie Big Data

Więzy integralności referencyjnej i klucze obce. PYTANIE NA EGZAMIN LICENCJACKI

archivist: Managing Data Analysis Results

Network Services for Spatial Data in European Geo-Portals and their Compliance with ISO and OGC Standards

CENNIK I TERMINARZ SZKOLEŃ SZKOLENIA OTWARTE DLA UŻYTKOWNIKÓW SYSTEMÓW INFORMATYCZNYCH

Projektowanie rozwiązań Big Data z wykorzystaniem Apache Hadoop & Family

Hurtownie danych. Hurtownie danych. dr hab. Maciej Zakrzewicz Politechnika Poznańska Instytut Informatyki. Maciej Zakrzewicz (1)

Oracle Business Intelligence 11g

Transkrypt:

Hurtownie Danych i Business Intelligence: przegląd technologii Robert Wrembel Politechnika Poznańska Instytut Informatyki Robert.Wrembel@cs.put.poznan.pl www.cs.put.poznan.pl/rwrembel Tematyka Architektury systemu hurtowni danych Business Intelligence Przetwarzanie OLTP vs. OLAP Wstęp do technologii BigData 2

Cele stosowania HD 1. Zapewnienie jednolitego dostępu do wszystkich danych gromadzonych w ramach przedsiębiorstwa 2. Dostarczenie technologii (platformy) przetwarzania analitycznego - technologii OLAP/BI 3 Business Intelligence OLAP - On-Line Analytical Processing klasyczna analiza danych (dane historyczne, predykcja - what if analysis) analiza trendów sprzedaży analiza nakładów reklamowych i zysków analiza ruchu telefonicznego credit scoring churn analysis customer profiling najczęściej SQL 4

Business Intelligence BI = OLAP+ eksploracja danych reguły asocjacyjne, profile zachowań analiza tekstów (Facebook, Tweeter,...) hot topics, bezpieczeństwo narodowe analiza sieci powiązań liderzy, zależności analiza logów przeglądarek 5 OLTP a OLAP użytkownik funkcja dane aplikacje dostęp transakcja l. przetwarzanych rek. l. użytkowników DB size metric OLTP "zwykły" bieżące operacje, kluczowe dla działania firmy bieżące, elementarne powtarzalność działań odczyt/zapis krótka kilka, kilkadziesiąt kilkudzies., tysiące, setki tys. kilka - setki TB przepustowość (l. transakcji w jednostce czasu) OLAP analityk wspomaganie decyzji elementarne, zagregowane, historyczne ad hoc odczyt długa (godziny) miliony lub więcej kilku, kilkunastu > setki TB czas odpowiedzi 6

Aplikacje BI Zapytania ad-hoc (okolo 10% aplikacji firmowych) prosty interfejs prezentacji wyników obliczenia ad-hoc drill-down, drill-accross Raporty firmowe (około 90% aplikacji firmowych) zaawansowany układ graficzny biblioteka predefiniowanych raportów subskrypcja raportów, harmonogram odświeżania raportów i ich dystrybucji uprawnienia użytkowników do raportów 7 Aplikacje BI Dedykowane aplikacje analityczne analiza przychodów i promocji przewidywanie trendów, symulacje zawierają specjalizowane algorytmy dla dziedziny zastosowań Pulpity (dashboards), karty wynikowe (scorecards), kokpity menadżerskie (management cockpits) interaktywny interfejs prezentacja zbiorcza najważniejszych danych miary jakości przedsięwzięcia (KPI - key performance indicators) alerty 8

Aplikacje BI Eksploracja danych złożone obliczeniowo algorytmy dedykowane algorytmy dla dziedziny zastosowań wizualizacja wyników 9 Użytkownicy Aktywni: 10% wszystkich użytkowników systemu BI Równocześnie pracujący: 1% użytkowników systemu BI storyborads 10

Wielkość systemu Mały system HD HD: kilkaset MB kilkadziesiąt tabel kilka mln rekordów w tabeli faktów 300 użytkowników kilkadziesiąt raportów, kilka kostek Duży system HD HD: kilkaset TB kilkaset tabel kilkaset mln rekordów w tabeli faktów kilka tysięcy użytkowników ponad 1000 raportów, kilkaset kostek 11 Business Intelligence Dwie kategorie danych wewnątrzfirmowe zewnętrzne (Internet) Dwie różne architektury/technologie analizy danych klasyczne BigData 12

Architektura 1 (podstawowa) ŹRÓDŁA DANYCH WARSTWA POŚREDNIA HURTOWNIA DANYCH WARSTWA ANALITYCZNA OPROGRAMOWANIE ETL Ekstrakcja Transformacja Czyszczenie Agregacja HURTOWNIA DANYCH model wielowymiarowy dane elementarne i zagregowane Zalety dane zintegrowane (spójna struktura i wartości) szybkość dostępu do danych niezależność od awarii źródeł Wady redundancja danych odświeżanie danych 13 Architektura 2 ŹRÓDŁA DANYCH WARSTWA POŚREDNIA HURTOWNIA DANYCH WARSTWA ANALITYCZNA OPROGRAMOWANIE ETL OPERACYJNA SKŁADNICA DANYCH Ekstrakcja Transformacja Czyszczenie Agregacja HURTOWNIA DANYCH dane znormalizowane (3NF) dane elementarne możliwość przeszukiwania/analizow ania model wielowymiarowy dane elementarne i zagregowane 14

Architektura 3 ŹRÓDŁA DANYCH WARSTWA POŚREDNIA HURTOWNIA DANYCH WARSTWA ANALITYCZNA OPROGRAMOWANIE ETL OPERACYJNA SKŁADNICA DANYCH Hurtownie tematyczne Ekstrakcja Transformacja Czyszczenie Agregacja HURTOWNIA DANYCH dane znormalizowane (3NF) dane elementarne możliwość przeszukiwania/analizow ania model wielowymiarowy dane elementarne i zagregowane 15 HD Allegro C. Maar, R. Kudliński: Allegro on the way from XLS based controlling to a modern BI environment. National conference on Data Warehousing and Business Intelligence, Warsaw, 2008 16

Architektura ELT ŹRÓDŁA DANYCH WARSTWA ANALITYCZNA E+L ODS T+L HD 17 Architektura ELT Efektywność dane w bazie danych możliwość przetwarzania za pomocą dedykowanych języków (PL/SQL, SQL PL, Transact SQL) jeden serwer dla ODS i HD większe obciążenie Data provenance Drill through 18

Experiment I P. Wróblewski, M. Wojdowski: Implementacja i porównanie wydajności architektur ETL i ELT. Master thesis, Poznan University of Technology, 2014 Data sources Internet auctions Oracle11g (Object-Relational model) MySQL PostgreSQL XML a collection/table composed of 11 attributes Data warehouse: Oracle11g 19 Experiment I DW schema 20

elapsed time ETL + (MV creation) [sec] Experiment I Transformations dimensions fact table Tools and architectures ETL Oracle Data Integrator (ODI) ETL in a staging area on a separate server ELT ODI TL in a staging area on the same server as a DW ELT ODI + materialized views (MVs) TL in a staging area on the same server as a DW ELT stored packages (SPs) TL in a staging area on the same server as a DW ELT SPs + MVs TL in a staging area on the same server as a DW 21 Experiment I # of rows 22

Experiment II K. Prałat, T. Skrzypczak, G. Stolarek: Efektywność ETL i ELT. Postrgaduate studies, term project, Poznan University of Technology, 2014 Data source flight and weather data in the US, from 1986 until 2008 6 tables in Oracle11g Data warehouse: Oracle11g 23 Data source schema Experiment II 24

Experiment II DW schema 25 Experiment II Architecture 26

Experiment II ETL Informatica ELT Informatica (load), DB views (transform) 27 Systemy komercyjne Tradycyjne Oracle11g, Hypersion Essbase - Oracle Corporation DB2 UDB - IBM Sybase IQ - Sybase MS SQL Server - Microsoft SAP Business Warehouse - SAP Teradata - Teradata Main memory (in-memory) Netezza - IBM Exadata - Oracle SAP Hana - SAP XBone Server - Targit Teradata DW Appliance - Teradata 28

Gartner Report 29 Gartner Report http://www.gartner.com/technology/reprints.do?id =1-1DZLPEP&ct=130207&st=sb Assessment criteria Integration BI infrastructure Metadata management Development tools Collaboration Information Delivery Reporting Dashboards Ad hoc query Microsoft Office integration Mobile BI Analysis Online analytical processing (OLAP) - multidimensional analysis, what-if Interactive visualization Predictive modeling and data mining Scorecards- aligining KPIs with a strategic objective Prescriptive modeling, simulation and optimization 30

OLAP/BI - technologie Modele ROLAP MOLAP HOLAP Składowanie danych indeksy perspektywy zmaterializowane partycjonowanie column storage / row storage kompresja danych i indeksów Przetwarzanie zapytań top-n gwiaździste Przetwarzanie równoległe i rozproszone Jakość danych i ETL/ELT 31 OLAP/BI Trends Big Data mobile BI in-memory BI real-time /right-time /active BI cloud computing 32

Big Data Big Data storage internal company BI system Big Data analytics 33 Big Data Huge Volume Every minute: 48 hours of video are uploaded onto Youtube 204 million e-mail messages are sent 600 new websites are created 600000 pieces of content are created over 100000 tweets are sent (~ 80GB daily) Sources: social data web logs machine generated 34

Big Data Sensors mechanical installations (refineries, jet engines, crude oil platforms, traffic monitoring, utility installations, irrigation systems) one sensor on a blade of a turbine generates 520GB daily a single jet engine can generate 10TB of data in 30 minutes telemedicine telecommunication 35 Big Data High Velocity of data volume growth uploading the data into an analytical system Variety (heterogeneity) of data formats structured - relational data and multidimensional cube data unstructured or semistructured - text data semantic Web XML/RDF/OWL data geo-related data sensor data Veracity (Value) - the quality or reliability of data 36

Big Data - Problems Storage volume fast data access fast processing Real-time analysis analyzing fast-arriving streams of data 37 Types of processing Batch processing - standard DW refreshing Real-time / near real-time data analytics answers with the most updated data up to the moment the query was sent the analytical results are updated after a query has been executed Streaming analytics a system automatically updates results about the data analysis as new pieces of data flow into the system as-it-occurs signals from incoming data without the need to manually query for anything 38

Real-time / Near real-time architecture data stream active component main-memory engine OLTP + OLAP 39 Real-time / Near real-time refreshing refreshing users traditional DW refreshing users real-time DW 40

Big Data Architecture massive data processing server - MDPS (aggregation, filtering) clicks tweets facebook likes location information... real-time decision engine - RTDE analtytics server - AS complex event processor - CEP reporting server - RS 41 Big Data Architecture Scalability RTDE - nb of events handled MDPS - volume of data and frequency of data processing AS - complexity of computation, frequency of queries RS - types of queries, nb of users CEP - # events handled Type of data RTDE - unstructured, semistructured (texts, tweets) MDPS - structured AS - structured RS - structured CEP - unstructured and structured 42

Big Data Architecture Workload RTDE - high write throughput MDPS - long-running data processing (I/O and CPU intensive): data transformations,... AS - compute intensive (I/O and CPU intensive) RS - various types of queries Technologies RTDE - key-value, in-memory MDPS - Hadoop AS - analytic appliences RS - in-memory, columnar DBs Conclusion very complex architecture with multiple components the need of integration 43 IBM Architecture Data warehouse augmentation: the queryable data store. IBM software solution brief. 44

Big Data Architecture 45 Big Data Architecture columnar storage and query coordinate and schedule workflows high level language for processing MapReduce data ingest coordinate and manage all the components http://www.cloudera.com/content/cloudera/en/resources/library/training/ap ache-hadoop-ecosystem.html 46

Big Data Architecture user interface to Hive SQL-like language for data analysis supports selection, join, group by,... high level language for processing MapReduce columnar storage and query based on BigTable manages TB of data web log loader (log scrapper), periodical loading into Hadoop aggregating log entries (for offline analysis) coordinate and manage all the components service discovery, distributed locking,... 47 Big Data Architecture coordinate and schedule workflows schedule and manage Pig, Hive, Java, HDFS actions RDB-like interface to data stored in Hadoop high level languages for processing MapReduce distributed web log loader (log scrapper), periodical loading into Hadoop (for offline analysis) 48

Big Data Architecture workflow (batch job) scheduler (e.g., data extraction, loading into Hadoop) SQL to Hadoop: command line tool for importing any JDBC data source into Hadoop distributed web log loader and aggregator (in real time) table-like data storage + in memory caching managing the services, e.g., detercting addition or removal of Kafka's brokers and consumers, load balancing 49 Big Data Architecture high level languages for processing MapReduce UI for Hadoop (e.g., HDFS file browser, MapReduce job designer and browser, query interfaces for Hive, Pig, Impala, Oozie, application for creating workflows, Hadoop API) workflow coordination and scheduling distributed web log loader (log scrapper), periodical loading into Hadoop (for offline analysis) coordinate and manage all the components 50

Big Data Architecture Windows Azure Java OM Streaming OM HiveQL PigLatin.NET/C#/F (T)SQL NOSQL ETL Tomasz Kopacz - Microsoft Polska: prezentacja Windows Azure, Politechnika Poznańska, czerwiec 2013 51 Data Stores NoSQL Key-value DB data structure collection, represented as a pair: key and value data have no defined internal structure the interpretation of complex values must be made by an application processing the values operations create, read, update (modify), and delete (remove) individual data - CRUD the operations process only a single data item selected by the value of its key Voldemort, Riak, Redis, Scalaris, Tokyo Cabinet, MemcacheDB, DynamoDB 52

Data Stores Column family (column oriented, extensible record, wide column) definition of a data structure includes key definition column definitions column family definitions column family stored separately, common to all data items (~ shared schema) column stored with a data item, specific for the data item CRUD interface H-Base, HyperTable, Cassandra, BigTable, Accumulo, SimpleDB 53 Data Stores Document DB typically JSON-based structure of documents SimpleDB, MongoDB, CouchDB, Terrastore, RavenDB, Cloudant Graph DB nodes, edges, and properties to represent and store data every node contains a direct pointer to its adjacent element Neo4j, FlockDB, GraphBase, RDF Meronymy SPARQL 54

GFS Google implementation of DFS (cf. The Google File System - whitepaper) Distributed FS For distributed data intensive applications Storage for Google data Installation hundreds of TBs of storage, thousands of disks, over a thousand cheep commodity machines The architecture is failure sensitive fault tolerance error detection automatic recovery constant monitoring is required 55 GFS Typical file size: multiple GB Operations on files mostly appending new data multiple large sequential writes no updates of already appended data mostly large sequential reads small random reads occur rarely file size at least 100MB millions of files 56

GFS Files are organized hierarchically in directories Files are identified by their pathnames Operations on files: create, delete, open, close, read, write, snapshot (creates a copy of a file or a directory tree), record append (appends data to the same file concurrently by multiple clients) GFS cluster includes single master multiple chunk servers 57 GFS client 1: file name, chunk index master 2: chunk handle, chunk replica locations 3: chunk handle, byte range sent to one replica management + heartbit messages 4: data............ chunk server chunk server chunk server S. Ghemawat, H. Gobioff, S-T. Leung. The Google File System. http://research.google.com/archive/gfs.html 58

Hadoop Apache implementation of DFS http://hadoop.apache.org/docs/stable/hdfs_design.html 59 Example In 2010 Facebook stored over 30PB in Hadoop Assuming: 30,000 1TB drives for storage typical drive has a mean time between failure of 300,000 hours 2.4 disk drive fails daily 60

Integration with Hadoop IBM BigInsights Cloudera distribution + IBM custom version of Hadoop called GPFS Oracle BigData appliance based on Cloudera for storing unstructured content Informatica HParser to launch Informatica process in a MapReduce mode, distributed on the Hadoop servers Microsoft dedicated Hadoop version supported by Apache for Microsoft Windows and for Azure EMC Greenplum, HP Vertica, Teradata Aster Data, SAP Sybase IQ provide connectors directly to HDFS 61 Technology Application % of organizations surveyed 62

Programming Languages Top Languages for analytics, data mining, data science Sept 2013, source: http://www.datasciencecentral.com/profiles/blogs/top-languages-foranalytics-data-mining-data-science The most popular languages continue to be R (61%) Python (39%) SQL (37%) SAS (20%) 63 Programming Languages Growth from 2012 to 2013 Pig Latin/Hive/other Hadoop-based languages 19% R 16% SQL 14% (the result of increasing number of SQL interfaces to Hadoop and other Big Data systems?) Decline from 2012 to 2013 Lisp/Clojure 77% Perl 50% Ruby 41% C/C++ 35% Unix shell/awk/sed 25% Java 22% 64

Big Data Questionnaire: 339 experts of data management (XII, 2012) Question: what are the plans of using Big Data in their organizations Answers: 14% highly probable 19% no plans Problems 21% not enough knowledge on Big Data 15% no clear profits from using Big Data 9% poor data quality 65 Data Scientist 66

RDBMS vs. NoSQL: the Future? TechTarget: Relational database management system guide: RDBMS still on top http://searchdatamanagement.techtarget.com/essentialg uide/relational-database-management-system-guide- RDBMS-still-on-top "While NoSQL databases are getting a lot of attention, relational database management systems remain the technology of choice for most applications" 67 RDBMS vs. NoSQL: the Future? R. Zicari: Big Data Management at American Express. Interview with Sastry Durvasula and Kevin Murray. ODBMS Industry Watch. Trends and Information on Big Data, New Data Management Technologies, and Innovation. Oct, 2014, available at: http://www.odbms.org/blog/2014/10/big-datamanagement-american-express-interview-sastry-durvasulakevin-murray/ "The Hadoop platform indeed provides the ability to efficiently process large-scale data at a price point we haven t been able to justify with traditional technology. That said, not every technology process requires Hadoop; therefore, we have to be smart about which processes we deploy on Hadoop and which are a better fit for traditional technology (for example, RDBMS)." Kevin Murray. 68

RDBMS Conceptual and logical modeling methodologies and tools Rich SQL functionality Query optimization Concurrency control Data integrity management Backup and recovery Performance optimization buffers' tuning storage tuning advanced indexing in-memory processing Application development tools 69 NoSQL Flexible "schema" suitable for unstructured data Massively parallel processing Cheap hardware + open source software 70

Some other trends Apache Derby: Java-based ANSI SQL database Splice Machine Derby (redesigned query optimizer to support parallel processing) on HBase (parallel processing) + Hadoop (parallel storage and processing) 71