Katedra Informatyki AGH

Jacek Kitowski Katedra Informatyki AGH Grupa Systemów Komputerowych (CSG) C2 IV p. tel. 617-35-20 email: kito@agh.edu.pl http://www.icsr.agh.edu.pl http://www.icsr.agh.edu.pl/~kito/arch2007 Katedra Informatyki AGH

Plan wykładu Wprowadzenie zakres problemowy High Performance Computing (HPC) High Throughput Computing (HTC) Podstawowe pojęcia Metody oceny wydajności systemów Tendencje rozwojowe procesorów Komputery systemów otwartych i modele programowania Taksonomie komputerów sterowanie organizacja przestrzeni adresowej granulacja warstwa komunikacyjna Systemy wieloprocesorowe Programowanie równoległe Systemy wysokiej dostępności Grid Computing Przykłady 4

Cel wykładu Tendencje rozwojowe sprzętu HPC (i oprogramowania) Wykorzystanie w praktyce - synergia: Problem obliczeniowy - struktury danych Środki i narzędzia informatyki - modele programowania Architektura komputera Erwin Knuth Nikolaus Wirth John von Neumann 5

Literatura David E. Culler, Jaswinder Pal Singh Parallel Computer Architecture, Morgan Kaufmann, 1999 R.W. Hockney, C.R. Jesshope Parallel Computers 2, architecture, programming, environments, Adam Hilger 1992 V. Kumar, A. Grama, A. Gupta, G. Karypis Introduction to Parallel Computing, Benjamin/Cummings, 1994 S. Kozielski, Z. Szczerbiński Komputery równoległe, architektura i elementy oprogramowania, WNT 1993 Linda Null, Julia Lobur Struktura organizacyjna i architektura systemów komputerowych, Helion, 2004 D.A. Patterson, J.L. Hennessy Computer Organization and Design The hardware/software interface, Morgan Kaufmann, Elsevier, 2009 I. Foster and Carl Kesselman The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann 1998 6

Literatura cd. William Stallings Organizacja i architektura systemu komputerowego. Projektowanie systemu a jego wydajność, WNT, 2004 Jacek Kitowski Współczesne Systemy Komputerowe, Wyd. CCNS, 2000 D. Loshin: Superkomputery bez tajemnic, MICOM W-wa, 1997 Zbigniew Weiss i Tadeusz Gruźlewski Programowanie współbieżne i rozproszone, WNT 1993 M. Ben-Ari Podstawy programowania współbieżnego i rozproszonego, WNT 1996 K. Zieliński (ed.) Środowiska programowania rozproszonego w sieciach komputerowych, Księgarnia Akademicka Kraków, 1994 T.H. Cormen, C.E. Leiserson, R.L. Rivest: Wprowadzenie do algorytmów, WNT W-wa, 1998 7

Demokratyczna, dynamiczna lista przebojów Horyzont czasowy 2007 1. Bezprzewodowość i mobilność (WLAN, WiFi, Bluetooth, Palm) 2. Zwiększanie przepustowości sieci (kompresja, Content Deliv.) 3. Bezpieczeństwo (ochrona danych przed utratą; ochrona przed niepowołanym dostępem) 4. E-handel, e-banking, s-science, e-learning, e-government, e-... 5. Multimedia (rozwój przekazu strumieniowego, rozrywka) 6. Doskonalenie systemów CRM (interjesy dla PDA, int.tv, UMTS, synteza mowy) 7. Rozwój IT w firmach (intranet, VPN, WWW, virtual organizations) 8. Outsourcing Źródło: Marek Hołyński, ATM Warszawa: Nowe technologie teleinformatyki, Sieci Rozległe, Zakopane, 2003 8

Demokratyczna, dynamiczna lista przebojów Horyzont czasowy 2007 Nie zmieściły się: rozwiązanie problemu ostatniej mili grid computing konsolidacja rynku oprogramowania komputery molekularne VoIP Większość AI (systemy ekspertowe, wirtualna rzeczywistość, sieci neuronowe, systemy agentowe) 9

Wprowadzenie Architektury w kontekscie Zastosowania naukowo-techniczne, przemysłowe, komercyjne poźniej obecnie Zagadnienia: parametry techniczne i funkcjonalność oprogramowania 10

O czym mówimy? Program = algorytm + dane Przetwarzanie = użytkownik + komputer + program Model przetwarzania = przetwarzanie + zarządzanie + komunikacja Organizacja przetwarzania = język + poziom + maszyna wirtualna + implementacja modelu przetwarzania Architektura = sprzęt + oprogramowanie (wiele procesorów) 11

Jakość przetwarzania Zróżnicowane pojęcie, zależne od celu, implementacji... Tradycyjnie: łatwiej wiarygodniej taniej szybciej więcej High Performance Computing (HPC) versus High Throughput Computing (HTC) HPC - execution time, speedup, efficiency... (dot. pojedynczego zadania) HTC - przepustowość instalacji - wydajność w długim horyzoncie, miesiącu... 12

ACK: HP 13

Performance Wzrost wydajności Intel Researcher Moore s Law, 1965 2.0 IPC Architectural Improvement.045.92 IPC 22x Clock Cycle Improvement 2.7 MHz 60 MHz 1976 1993 Source: IEEE Spectrum (1989) (1976-1993) 21x.92 IPC 42x 460x 60 MHz 110x 300 MHz 1997 4600x 14

Systemy Operacyjne IBM Blue Gene Własnościowe Komercyjne typu Unix Linux MS Windows Dedykowane BLTRS (IBM BlueGene/L) Gartner Group OS Rankings, 1997 HP-UX 45 Solaris on SPARC 42 IBM AS/400 40 IBM AIX NT on Intel Digital UNIX SCO on Intel Digital Alpha NT Solaris on Intel Digital OVMS Kamil Iskra Cracow Grid Seminar Blue Gene/L Compute Nodes: BLRTS Flat memory space (no paging, static TLB) No fork() or threads Limited exec() No dynamic libraries No stack/heap overwrite protection No Python, no Java Programming modes: Communication coprocessor Virtual node (Heater) 36 35 32 30 27 26 39 4 Source: Gartner Group 12/97 Kamil Iskra Cracow Grid Seminar 6 Ack: Kamil Iskra, Argonne 15

Clock (MHz) Wzrost szybkości przetwarzania technologia organizacja komputera wielkość i ilość zadań sposób przetwarzania sposób zarządzania (procesy --> procesory) typ problemu (NP, systemy czasu rzeczywistego...) 1000 100 10 Microprocessors DRAM 1990 2000 16

Zależności model programowania środowisko programowania Algorytm Dane Język programowania Zarządzanie 17

Supercomputing Goes Personal 1991 1998 2005 System Cray Y-MP C916 Sun HPC10000 Small Form Factor PCs Architecture 16 x Vector 4GB, Bus 24 x 333MHz Ultra- SPARCII, 24GB, SBus 4 x 2.2GHz Athlon64 4GB, GigE OS UNICOS Solaris 2.5.1 Windows Server 2003 SP1 GFlops ~10 ~10 ~10 Top500 # 1 500 N/A Price $40,000,000 $1,000,000 (40x drop) < $4,000 (250x drop) Customers Government Labs Large Enterprises Every Engineer & Scientist Applications Classified, Climate, Physics Research Manufacturing, Energy, Finance, Telecom Bioinformatics, Materials Sciences, Digital Media Ack: Fab.Gagliardi 18

Data Access Time CPU cycle: 2 GHZ (10-9 ) Memory access: 60 ns (10-9 ) Disk access: 10 ms (10-3 ) Translating the referential by (2 GHZ): CPU cycle ~1 second Memory access ~120 seconds (~2 minutes) Disk access ~20 000 000 seconds (~6 months) Ack: Fab.Gagliardi 19

The Future: Supercomputing on a Chip IBM Cell processor 256 Gflops today 4 node personal cluster => 1 Tflops 32 node personal cluster => Top100 MS Xbox 3 custom PowerPCs + ATI graphics processor 1 Tflops today $300 8 node personal cluster => Top100 for $2500 (ignoring all that you don t get for $300) Intel many-core chips 100 s of cores on a chip in 2015 (Justin Rattner, Intel: http://www.hpcwire.com/hpc/629783.html) 4 cores /Tflop => 25 Tflops/chip Ack: Fab.Gagliardi 20

Architecture/System Continuum... Custom processors with custom interconection (Cray X1, NEC S8, IBM Regatto, Blue Gene) Commodity processors with custom interconnection (SGI Altix, Cray XT3, XD1 (Opteron)) Commodity processors with commodity interconnection (Beowulf clusters,...)... SMPs, Custers, Constellations,... DSM,...... Programming Models 21

Obliczenia Wielkiej Skali Computational Science High Performance Computing K. Wilson* defined computational science as:... a precise mathematical statement, being intractable by traditional methods with a significant scope requires in depth knowledge of science, engineering and the arts... Computational science is about using computers to analyze scientific problems. it is distinct from computer science, which is the study of computers and computation, and... it is different from theory and experiment (...) in that it seeks to gain understanding principally through the analysis of mathematical models (on) high performance computers." *K.G. Wilson, Basic issues for Computational Science, ICTP, 1986 23

Wielkie Wyzwania (Problemy Obliczeniowe) Wielkie Wyzwania*: fundamental problems in science and engineering with potentially broad social, political and scientific impact, which could be advanced by applying high performance computing. Problemy:... Simulation of X-Ray clusters Genome sequencing and structural biology Global climate modeling Fluid turbulence Biotechnology Pollution and dispersion QCD Semi-/superconductor modeling Ocean circulation Vision and cognition... *K.G. Wilson, Grand Challenges to Computational Science, 1987 Modelling, Simulation, Analysis 24

ACK: HP 25

ACK: HP 26

E-Science: Nauka w komputerze Synergy between: theory simulation experiment Theory Data intensive computing (mining) Experiment Numerically intensive computing Simulation Data intensive computing (assimilation) Essential goal: to run larger and more complicated applications faster over time 27

Background Complexity Computing Data E-Infrastructure Separation/relation of concerns Complementarity Coherency Simplicity Efficiency Synergy - coherence Problem Algorithm & implementation Computer Architecture, environments 28

Zasoby obliczeniowe Petascale is coming Earth Simulator, 35 TFlops peak Vector processing Jaguar / Franklin, 120 / 100 TFlops peak, Vector processing 212992 cores 129600 cores BlueGene/L, 596 TFlops peak, Low power, cheap processors ACK: Hank Childs, Lawrence Livermore National Laboratory, ICCS 2008 Roadrunner, 1457 TFlops peak, Heterogeneous / PowerXCell 8i AMD Opteron DC 1.8GHz BE 3.2GHz (12.8 Gflops) TOP500, Nov.2008 29

Explosion of Data Experiments Simulations Archives Literature Petabytes Doubling every 2 years ACK: Fabrizio Gagliardi, Microsoft, ICCS 2008 30

Research e-infrastructures e-infrastructures in Europe: Research Network infrastructure: GEANT pan-european network interconnecting National Research and Education Networks Computing Grid Infrastructure: Enabling Grids for E-SciencE (EGEE project) Transition to the sustainable European Grid Initiative (EGI) currently worked out through EGI_DS project Data & Knowledge Infrastructure: Digital Libraries (DILIGENT) and repositories (DRIVER-II) A series of other projects : Middleware interoperation, applications, policy and support actions, etc. Cyber-Infrastructures around the world: Similar in US and Asia Pacific ACK: Fabrizio Gagliardi, Microsoft, ICCS 2008 31

Defining e-science E-Science: collaborative research supported by advanced distributed computations Multi-disciplinary, Multi-Site and Multi-National Building with and demanding advances in Computing/Computer Sciences Goal: to enable better research in all disciplines System-level Science: beyond individual phenomena, components interact and interrelate to generate, interpret and analyse rich data resources From experiments, observations and simulations Quality management, preservation and reliable evidence to develop and explore models and simulations Computation and data at all scales Trustworthy, economic, timely and relevant results to enable dynamic distributed collaboration Facilitating collaboration with information and resource sharing Security, trust, reliability, accountability, manageability and agility I. Foster, System Level Science and System Level Models, Snowmass, August 1-2, 2007 M. Atkinson, e-science (...), Grid2006 & 2-nd Int.Conf.e-Social Science 2006, National e-science Centre UK 32

Example: Seismic Hazard Analysis (T. Jordan et al., SCEC) Seismicity Paleoseismology Local site effects Geologic structure Faults Seismic Hazard Model InSAR Image of the Hector Mine Earthquake A satellite generated Interferometric Synthetic Radar (InSAR) image of the 1999 Hector Mine earthquake. Shows the displacement field in the direction of radar imaging Stress transfer Each fringe (e.g., from red to red) corresponds to a few centimeters of displacement. I. Foster, ibid 33

Beyond Models An Integrated View of Simulation, Experiment, & (Bio)informatics Synergy between Problem Algorithm & implementation Computer Architecture Problem Specification Simulation Browsing & Visualization SIMS* Analysis Tools Database LIMS + Experimental Design Experiment Browsing & Visualization *Simulation Information Management System + Laboratory Information Management System I. Foster, ibid 34

Przykłady problemów

High Fidelity Numerical Simulation Source: Ansys Source: MSC.Software Impact on Science Research: Simulation joins Theory and Experiment as key methods of Scientific discovery Impact on Industry: virtual testing during the component design process virtual trial and error of design concepts optimization of component performance, quality and manufacturability 38

Virtual Prototyping Virtual prototyping allows: product testing in realistic conditions appropriate design decisions Source: National Crash Analysis Center Example applications: Vehicle performance (aerodynamics, acoustics) and safety (crash-test, occupant safety) Consumer goods (Shock resistance, drop-test, electromagnetic compliance) Medical (Knee joint prosthesis, heart valve implant, vascular surgery) Healthcare implications in the future Virtual Patients Source: ESI Group 39

Capability Computing Example Applications are complex and dynamically constructed from services. Current solutions rely on a human as a source of knowledge Flood forecast simulation complexity: time o(n 4 ), memory o(n 3 ) Workflow construction User Portal Workflow Service Workflow Knowledge Storage Service Meteorology Service Hydrology Service Hydraulics Service Meteorology Visualization Hydrology Visualization Hydraulics Visualization Thanks are due: EU K-WfGrid Project: Knowledge-based Workflow System for Grid Applications, L. Hluchy (II SAS) for flood application 48

Application Architecture Services Meteorology services Watershed integration services Hydrology/hydraulics services Visualization services Asynchronous data delivery Models Meteorological simulation Hydrological simulation models Hydraulic simulation models Visualization tools for simulation outputs 49

Useful Results Thanks are due: L. Hluchy (II SAS) 50

High Performance Computing "Computing resources which provide more than an order of magnitude more computing power than is normally available on one's desktop" JISC New Technology Initiative Development of Computational Science to complete theoretical and experimental sciences New Computer Architectures to run larger applications faster over time Simulation and Modelling Problems (more calculations - more precision) Problems with large amount of data (data-mining, seismic) HPC for Grand Challenges (K. Wilson, 1987) (climate modelling, fluid turbulence, pollution dispersion, human genome, ocean circulation, QCD, semi-/superconductor modelling, biology, combustion systems, vision&cognition, etc.) 60

Obliczenia Wielkiej Skali Computational Science High Performance Computing K. Wilson* defined computational science as:... a precise mathematical statement, being intractable by traditional methods with a significant scope requires in depth knowledge of science, engineering and the arts... Computational science is about using computers to analyze scientific problems. it is distinct from computer science, which is the study of computers and computation, and... it is different from theory and experiment (...) in that it seeks to gain understanding principally through the analysis of mathematical models (on) high performance computers." *K.G. Wilson, Basic issues for Computational Science, ICTP, 1986 escience Experiments in Silico 61

Metacomputing ( Grid computing ) Transparent access to a variety of services Computing Services Graphical, multimedia and Scientific Visualization services Storage (including HSM) services High-bandwith Network Protocols Larry Smarr 1987 Uni. of Illinois Objectives: commodity systems to solve target class of problems novel solution methods flexibility and extensibility site autonomy scalable architectures single name space easy to use 63

Idea of Metacomputing GRID Computing SOA Larry Smarr Ian Foster, Carl Kesselman "Heterogeneous computing in a homogeneous environment" SM-Parallel/Vector RISC Metacomputer DM-MPP Clusters Services usługi SM-MPP Real-Time RAID Disk Farms File Migration Robotic Tape Archiving Compute Services Storage Services Users 3D Graphics Image Processing Scientific Visualization Visualization Services SOA: Service Oriented Architecture 64

TOP500 LiST Compiled by Hans Meuer (University of Mannheim, Germany) Erich Strohmaier and Horst Simon (NERSC/Lawrence Berkeley National Lab.) Jack Dongarra (University of Tenn., Knoxville) http://www.top500.org To provide this new statistical foundation, we have decided in 1993 to assemble and maintain a list of the 500 most powerful computer systems. Our list has been compiled twice a year since June 1993 with the help of high-performance computer experts, computational scientists, manufacturers, and the Internet community in general who responded to a questionnaire we sent out.

A host of parallel machines There are (have been) many kinds of parallel machines For the last 12+ years their performance has been measured and recorded with the LINPACK benchmark, as part of Top500 It is a good source of information about what machines are (were) and how they have evolved http://www.top500.org 68

What is the LINPACK Benchmark LINPACK: LINear algebra PACKage A FORTRAN Matrix multiply, LU/QR/Choleski factorizations, eigensolvers, SVD, etc. LINPACK Benchmark Dense linear system solve with LU factorization (n - # of operations) 2n 3 / 3 o(2n Measure, R [Mflop/s] DP Mflop/s (100x100) TPP Mflop/s (1000x1000) Rpeak Mflop/s The problem size can be chosen You have to report the best performance for the best n, and the n that achieves half of the best performance. 2 ) Linpack Parallel Rmax Nmax N 1/2 Rpeak 69

Zasoby obliczeniowe Petascale is coming Earth Simulator, 35 TFlops peak Vector processing Jaguar / Franklin, 120 / 100 TFlops peak, Vector processing 212992 cores 129600 cores BlueGene/L, 596 TFlops peak, Low power, cheap processors ACK: Hank Childs, Lawrence Livermore National Laboratory, ICCS 2008 Roadrunner, 1457 TFlops peak, Heterogeneous / PowerXCell 8i AMD Opteron DC 1.8GHz BE 3.2GHz (12.8 Gflops) TOP500, Nov.2008 70

Polish Supercomputer Centres ACK CYFRONET AGH AGH University of Science and Technology Cracow (1973) Interdisciplinary Centre for Mathematical and Computational Modelling University of Warsaw Warsaw (1993) Poznan Supercomputing and Newtorking Center Institute of Bioorganic Chemistry PAS Poznan (1993) TASK Academic Computer Centre Gdask University of Technology Gdansk (1994) Wroclaw Centre for Networking and Supercomputing Wroclaw University of Technology Wroclaw (1995) 71 71

Pionier Polish Optic Network 72 ENPG Meeting, Cracow, Sept. 22, 2008 72

ACK CYFRONET-AGH Resources

ACK CYFRONET-AGH Aggregated R peak WORLD Aggregated 74

Możliwości obliczeniowe środowisk Power (Gflops) Computer Peak Performance (GFlops) and Storage Systems (TB) CYFRONET ICM UW PCSS CI TASK WCSS R peak R peak R peak R peak R peak SMP 1849 vector 640 845 768 780 Clusters ~30000 ~20000 ~22000 53330 16000 Total ~32000 ~20600 ~23000 54098 16780 Storage (TB) 1029 200 660 150 200 75

6 entries found. TOP500 Nov.2008 POLISH SITES Rank Site System Cores R max R peak 68 Gdansk University of Technology, CI Task Poland ACTION Cluster Xeon E5345 Infiniband ACTION 5336 38.17 49.73 221 311 337 373 495 Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw Poland Cyfronet Poland PCSS Poznan Poland Nasza Klasa Poland Communications Company (P1) Poland BladeCenter QS22 Cluster, PowerXCell 8i 4.0 Ghz, Infiniband IBM Cluster Platform 3000 BL2x220, L54xx 2.5 Ghz, Infiniband Hewlett-Packard Cluster Platform 3000 BL460c, Xeon 54xx 2.5 GHz, Infiniband Hewlett-Packard Cluster Platform 3000 BL460c/BL2x220, L54xx 2.5 Ghz, GigE Hewlett-Packard Cluster Platform 3000 BL460c, Xeon L54xx 2.5 GHz, GigEthernet Hewlett-Packard 2016 18.57 30.46 2048 16.18 20.48 2048 15.95 20.48 2848 15.3 28.48 2304 12.82 23.04 76

Data Storage and Management Requirements Heterogeneity Resources, management, users requirements Efficient data access High Availability and fault tolerance Cost/performance optimization SAN infrastructure 77

Data Storage Overview Disk systems (raw) capacity: 529 TB 6 TB high performance FC disks 211 TB FATA disks, 312 TB SATA disks Total cache memory of the arrays: 32 GB Automatic Tape Library (raw) capacity: 500TB (for LTO-4 tapes) ENPG Meeting, Cracow, Sept. 22, 2008

ENPG Meeting, Cracow, Sept. 22, 2008 SAN Infrastructure

(3 + 1) Ery Komputerów Shared Memory Vector Processors Cray X-MP Distributed Memory Systems TMC CM-2 Cray Y-MP Alliant FX/80 Cray-2 Convex C2 Convex C3 IBM SP2 Cray T90 TMC CM-5 HP Cluster Intel Paragon Scalable Parallel Computing SPP1000 SPP1600 S/X Class SGI PC XL NEC SX4 Cray T3D/E Source: NCSA University of Illinois at Urbana-Champaign ASCI Initiative Los Alamos Origin2k IBM RS/6k SP 1985 1990 1995 2000 Sandia Lawrence Livermore 80

DOE Accelerated Strategic Computing Initiative (1995) to accelerate the development of massively parallel computers in order to ensure confidence in the safety, performance and reliability of US nuclear stockpile. 10-year, $ 1 billion program Goal: Provide balanced Tera-Scale computing Platforms by 2003/04 Options: ASCI Red (Sandia/Intel) ASCI Blue Pacific (LLNL/IBM) ASCI Blue Mountain (LANL/SGI-CRAY) + ASCI Compaq ASCI Facts Signed 1992 by Bush 81

ASCI cont 82

ASCI initiative ASCI - option RED ASCI - option Blue Mountain 84

10 9 Eventually one billion transistors, or electronic switches, may crowd a single chip, 1,000 times more than possible today. National Geographic, 1982 10 7 10 5 10 3 Gordon Moore FORECAST Source: Intel

Moore s Law meets Super Computing 1980s 1990s TODAY $10M/GFlop $50K/GFlop $1K/GFlo p The evolving value proposition Performance - 1960s-1980s Price / performance 1990s Price / performance / Watt - 2000s NASA VIDEO

PERFORMANCE VS. INITIAL PENTIUM 4 PERFORMANCE through parallelism DUAL/MULTI-CORE PERFORMANCE 3X 10X 2000 2004 2008+ FORECAST SINGLE-CORE PERFORMANCE

What is the LINPACK Benchmark LINPACK: LINear algebra PACKage A FORTRAN Matrix multiply, LU/QR/Choleski factorizations, eigensolvers, SVD, etc. LINPACK Benchmark Dense linear system solve with LU factorization (n - # of operations) 2n 3 / 3 o(2n Measure, R [Mflop/s] DP Mflop/s (100x100) TPP Mflop/s (1000x1000) Rpeak Mflop/s The problem size can be chosen You have to report the best performance for the best n, and the n that achieves half of the best performance. 2 ) Linpack Parallel Rmax Nmax N 1/2 Rpeak 88

Przykładowe wyniki dla zestawu LINPACK Linpack Performance 89

Linpack Performance Linpack - cd. 90

Clusters, Constellations, MPPs These are the only 3 categories today in the Top500 They all belong to the Distributed Memory model (MIMD) (with many twists) Each processor/node has its own memory and cache but cannot directly access another processor s memory. nodes may be SMPs Each node has a network interface (NI) for all communication and synchronization. P0 NI P1 NI Pn NI memory memory... memory interconnect So what are these 3 categories? 91

Ack: J. Dongarra TOP500 list, Nov. 2007 92

Ack: Strohmaier 93

Ack: Strohmaier 97

Ack: Strohmaier 98

Ack: Strohmaier 99

Ack: Strohmaier TOP500 analysis 100

Performance Development & Projections 10 Eflop/s 1 Eflop/s 100 Pflop/s 10 Pflop/s 1 Pflop/s 100 Tflop/s 10 Tflop/s 1 Tflop/s 100 Gflop/s 10 Gflop/s 1 Gflop/s 100 Mflop/s 10 Mflop/s 1 Mflop/s SUM N=1 N=500 1988 1986 1984 1982 1980 1994 1992 1990 2000 1998 1996 2006 2004 2002 2012 2010 2008 2018 2016 2014 2020 1 Gflop/s 1 Tflop/s 1 Pflop/s 1 Eflop/s O(1) Thread O(10 3 ) Threads Page 101 O(10 6 )Threads December 9, 2009 O(10 9 ) Threads

Something s Happening Here From K. Olukotun, L. Hammond, H. Sutter, and B. Smith In the old days it was: each year processors would become faster Today the cycle time is fixed or decreasing Things are still doubling every 18 months Moore s Law reinterpretated. 07 102

Moore s Law Reinterpreted Number of cores per chip doubles every two year, while clock speed decreases (not increases). Need to deal with systems with millions of concurrent threads Future generation will have billions of threads! Need to be able to easily replace interchip parallelism with intro-chip parallelism

Today s Multicores 90% of Top500 Systems Are Based on Multicore IBM Cell (9) Intel Clovertown (4) Sun Niagra2 (8) SciCortex (6) Intel Polaris (80) AMD Opteron (4) IBM BG/P (4) 104

Zasada działania komputera 106

Algorytm cyklu rozkazowego Przebieg jednego cyklu rozkazowego można opisać za pomocą następującego algorytmu: 1. Zawartość miejsca pamięci wewnętrznej wskazywanego przez licznik rozkazów LR zostaje przesłana do układów sterujących procesora, 2. W układach sterujących następuje rozdzielenie otrzymanej informacji na dwa pola: pole operacji i pole argumentów. Pole operacji zawiera adres rozkazu, który należy wykonać. Pole argumentów zawiera adresy, pod którymi są przechowywane dane oraz adres przeznaczenia wyniku. x for y (i x i z 0; i y i n) z i ; 3. Na podstawie wyznaczonych adresów następuje przesłanie z pamięci wewnętrznej argumentów do odpowiednich rejestrów, a na podstawie adresu rozkazu arytmometr wykonuje odpowiednie działanie (operację arytmetyczną lub logiczną) na zawartościach rejestru. 4. Wynik przetwarzania (wynik wykonanej operacji) jest wysyłany do pamięci wewnętrznej pod adres przeznaczenia wyniku. 5. Następuje zmiana wartości licznika rozkazów LR tak, aby wskazywał on kolejny rozkaz dla procesora. 107

Babbage 1832-1870 Zamiast wstępu Potrzeba wyników ilościowych...... od zawsze Inklinacje orbit planetarnych (X w.) (podziękowania dla dr. W. Aldy) Built in 1991 108

Potrzeby lat 1939-1950 1946 : ENIAC P. Eckert, J. Mauchly (Electronic Numerical I think Integrator there is a world market and Computer) for maybe five computers 1936 -The Turing Machine (Thomas Von Neumann Watson Senior, Architecture Chairman (Report of on IBM, the 1943) EDVAC 1945) Los Alamos; 1952 - bomba wodorowa Alan Turing 1912-1954 1903-1957 Architektura von Neumanna - inaczej - sekwencyjna Harvard Mark I IBM ASCC 109

Architektura von Neumanna - inaczej - sekwencyjna Początki... 1936 -The Turing Machine Von Neumann Architecture (Report on the EDVAC 1945) Los Alamos; 1952 - bomba wodorowa Alan Turing 1912-1954 Pamięć operacyjna Układ we/wy Magistrala L2 cache L1 cache Procesor Magistrala Obecny pogląd Schemat funcjonalny z Sterowanie - y Arch. Klasyczna - mainframes y Arch. Magistralowa - wspołczesna `Central Processing Unit (CPU) linearly addressed address space (operational memory) control unit Sequence of instructions operates on sequence of data - sequential computers 1903-1957 110

The scientific market is still about that size 3 computers When scientific processing was 100% of the industry a good predictor $3 Billion: 6 vendors, 7 architectures DOE buys 3 very big ($100-$200 M) machines every 3-4 years Worldwide market is perhaps 5 of the largest computers z During the review, someone said: von Neumann was right. 30,000 word was too much IF all the users were as skilled as von Neumann... for ordinary people, 30,000 was barely enough! -- Edward Teller, 1995 z The memory was approved. z Memory solves many problems! 111

December 1947 : Transistor: William Bradford Shockley, Walter H. Brattain, John Bardeen - Bell Telephone Labs. 112

Wyścig... GAM I XYZ UMC ZAM... 1951: UNIVAC I P. Eckert & J. Mauchly 1951: MESM S.A. Lebedev 113

1952 : IBM 701 pierwsza maszyna USA dla potrzeb obronnych 114

1955 : IBM 704 konstrukcja Gene Amdahl. 5 kflops, pamięć na rdzeniach ferrytowych - 32768 słów 36 bitowych ( 3x IBM701) FORTRAN Lawrence Livermore National Laboratory 115

Mercury 1958 (CERN) FERRANTI 'Mercury' @ CERN [1958-1965] First generation vacuum tube machine (60 microsec clock cycle, 2 cycles to load or store, 3 cycles to add and 5 cycles to multiply 40 bit longwords, no hardware division) with magnetic core storage (1024 40-bit words, 120 microsec access time). Mercury's processor had floating point arithmetic and a B-Register (index register). Magnetic drum auxiliary storage (16 Kwords of 40 bits, 8.75 msec average latency, 64 longwords transferred per revolution). Paper tape I/0. Two Ampex magtape units added in 1962. Autocode compiler. At the end of its career it was connected on-line to an experiment (Missing Mass Spectrometer). In 1966 the Mercury was shipped to Poland as a gift to the Academy of Mining and Metallurgy at Cracow. Manchester University, 1958 116

CERNowski łącznik... Ferranti Mercury 1958 (CERN) First generation vacuum tube machine (60 microsec clock cycle, 2 cycles to load or store, 3 cycles to add and 5 cycles to multiply 40 bit longwords, no hardware division) with magnetic core storage. Autocode compiler. In 1966 the Mercury was shipped to Poland as a gift to the Academy of Mining and Metallurgy at Cracow. 117

1962: : Rang Compagnies Production Part de marché 1 IBM 4806 65.8 % 2 Rand 635 8.7 % 3 Burrough 161 2.2 % 4 CDC 147 2.0 % 5 NCR 126 1.7 % 6 RCA 120 1.6 % 7 General Electric 83 1.1 % 8 Honeywell 41 0.6 % Autres 1186 16.3 % Total 7305 100 % 118

March 1964 : IBM 360/67 IBM 360/67 when it first came with just 500 kbytes of RAM. The DAT (Dynamic Address Translation) gate of electronics is slightly ajar. 119

1964 : CDC 6600 - Seymour Cray. 3 MIPS. 120

1965 : BESM-6 - Sergei Alexeevich Lebedev, 1 MIPS 121

1965 : Digital PDP 8 Caractéristiques techniques du mini ordinateur PDP 8 Processeur 12 bits, cycle de 1.5 microsecondes Mémoire 4Kmots de 12 bits (tores de ferrite) Terminal Teletype ASR33 + cartes perforées Consommation : 780 Watts - Prix : 18000 $ 122

1967: Rang Compagnies Production Part de marché 1 IBM 19773 50.0 % 2 Rand 4778 12.1 % 3 NCR 4265 10.8 % 4 CDC 1868 4.7 % 5 Honeywell 1800 4.6 % 6 Burrough 1675 4.2 % 7 RCA 977 2.5 % 8 General Electric 960 2.4 % Autres 3420 8.7 % Total 39516 100 % 123

1969 : supercomputer CDC 7600 - "pipeline, Seymour Cray, pipelining! Los Alamos 124

1976: Seymour Cray supercomputer Cray I 700000 $. Los Alamos. 1x Processor 64 bits, 83 MHz. 166 MFLOPS. Fréon. 125

1981 : CYBER 205, l'ordinateur le plus puissant de son époque. Sa mémoire centrale est de 32 Mo et il délivre une puissance de 200 MFLOPS. 126

1982 : Cray X-MP, multiprocessor supercomputer. 2 or 4 processors, 105 MHz, 235 Mflops 127

Mars 1984 : IBM PC jr 64 K RAM, 5"25, monitor, 1300 $. 128

1985 : CRAY 2, po raz pierwszy przekroczono 1 Gflop (1.9 Gflop). 4 procesory 250 MHz, 488 Mflops. Unix System V : UNICOS. Mnożenie macierzy (dla 4proc.) 1.7 Gflops. Liquid cooling 2048MB RAM 129

1986 : Thinking Machines Massively parallel computer Connection Machine CM-1 65536 processors - SIMD! 130

ACK: Sterling 131

Terminologia Technologia (typ) procesora - typ instrukcji i sposób realizacji CISC - Complex Instruction Set Computer RISC - Reduced Instruction Set Computer EPIC - Explicity Parallel Instruction Computer Architektura - wykorzystuje daną technologię - teoretyczna implementacja POWER IA-64 PA-RISC Procesor - realizacja sprzętowa architektury POWER5 Itanium2 PA-8700 Technologia półprzewodnikowa ECL CMOS GaAs Procesor stosuje się w odniesieniu do układu mikroprocesorowego; w miejscach wymagających rozróżnienia używane są obie nazwy. 132

O czym mówiliśmy do tej pory? Potrzeba rozwoju architektur komputerowych Synergia badań naukowych: Teoria Eksperyment Computing Synergia: Problem Algorytm Architektura Rys historyczny 133