Eksploracja Zasobów Internetu

Podobne dokumenty
Zarządzanie sieciami telekomunikacyjnymi

Karpacz, plan miasta 1:10 000: Panorama Karkonoszy, mapa szlakow turystycznych (Polish Edition)

Machine Learning for Data Science (CS4786) Lecture11. Random Projections & Canonical Correlation Analysis

Wojewodztwo Koszalinskie: Obiekty i walory krajoznawcze (Inwentaryzacja krajoznawcza Polski) (Polish Edition)

Machine Learning for Data Science (CS4786) Lecture 11. Spectral Embedding + Clustering

Stargard Szczecinski i okolice (Polish Edition)

Revenue Maximization. Sept. 25, 2018

Zakopane, plan miasta: Skala ok. 1: = City map (Polish Edition)

Tychy, plan miasta: Skala 1: (Polish Edition)

Rozpoznawanie twarzy metodą PCA Michał Bereta 1. Testowanie statystycznej istotności różnic między jakością klasyfikatorów

Wojewodztwo Koszalinskie: Obiekty i walory krajoznawcze (Inwentaryzacja krajoznawcza Polski) (Polish Edition)

Helena Boguta, klasa 8W, rok szkolny 2018/2019

MaPlan Sp. z O.O. Click here if your download doesn"t start automatically

POLITYKA PRYWATNOŚCI / PRIVACY POLICY

OpenPoland.net API Documentation

SSW1.1, HFW Fry #20, Zeno #25 Benchmark: Qtr.1. Fry #65, Zeno #67. like

Analysis of Movie Profitability STAT 469 IN CLASS ANALYSIS #2

Domy inaczej pomyślane A different type of housing CEZARY SANKOWSKI

Network Services for Spatial Data in European Geo-Portals and their Compliance with ISO and OGC Standards

SubVersion. Piotr Mikulski. SubVersion. P. Mikulski. Co to jest subversion? Zalety SubVersion. Wady SubVersion. Inne różnice SubVersion i CVS

Miedzy legenda a historia: Szlakiem piastowskim z Poznania do Gniezna (Biblioteka Kroniki Wielkopolski) (Polish Edition)

Weronika Mysliwiec, klasa 8W, rok szkolny 2018/2019

Compressing the information contained in the different indexes is crucial for performance when implementing an IR system

TTIC 31210: Advanced Natural Language Processing. Kevin Gimpel Spring Lecture 8: Structured PredicCon 2

Katowice, plan miasta: Skala 1: = City map = Stadtplan (Polish Edition)

deep learning for NLP (5 lectures)

ERASMUS + : Trail of extinct and active volcanoes, earthquakes through Europe. SURVEY TO STUDENTS.

Instrukcja obsługi User s manual

Extraclass. Football Men. Season 2009/10 - Autumn round

ITIL 4 Certification

Karpacz, plan miasta 1:10 000: Panorama Karkonoszy, mapa szlakow turystycznych (Polish Edition)

Wojewodztwo Koszalinskie: Obiekty i walory krajoznawcze (Inwentaryzacja krajoznawcza Polski) (Polish Edition)

Prices and Volumes on the Stock Market

Pielgrzymka do Ojczyzny: Przemowienia i homilie Ojca Swietego Jana Pawla II (Jan Pawel II-- pierwszy Polak na Stolicy Piotrowej) (Polish Edition)

Miedzy legenda a historia: Szlakiem piastowskim z Poznania do Gniezna (Biblioteka Kroniki Wielkopolski) (Polish Edition)


Wojewodztwo Koszalinskie: Obiekty i walory krajoznawcze (Inwentaryzacja krajoznawcza Polski) (Polish Edition)

TTIC 31210: Advanced Natural Language Processing. Kevin Gimpel Spring Lecture 9: Inference in Structured Prediction

SNP SNP Business Partner Data Checker. Prezentacja produktu

Wojewodztwo Koszalinskie: Obiekty i walory krajoznawcze (Inwentaryzacja krajoznawcza Polski) (Polish Edition)

Patients price acceptance SELECTED FINDINGS

Wojewodztwo Koszalinskie: Obiekty i walory krajoznawcze (Inwentaryzacja krajoznawcza Polski) (Polish Edition)

ARNOLD. EDUKACJA KULTURYSTY (POLSKA WERSJA JEZYKOWA) BY DOUGLAS KENT HALL

Hard-Margin Support Vector Machines

Leba, Rowy, Ustka, Slowinski Park Narodowy, plany miast, mapa turystyczna =: Tourist map = Touristenkarte (Polish Edition)

Karpacz, plan miasta 1:10 000: Panorama Karkonoszy, mapa szlakow turystycznych (Polish Edition)

ARKUSZ PRÓBNEJ MATURY Z OPERONEM

ANKIETA ŚWIAT BAJEK MOJEGO DZIECKA

Egzamin maturalny z języka angielskiego na poziomie dwujęzycznym Rozmowa wstępna (wyłącznie dla egzaminującego)

JĘZYK ANGIELSKI ĆWICZENIA ORAZ REPETYTORIUM GRAMATYCZNE

y = The Chain Rule Show all work. No calculator unless otherwise stated. If asked to Explain your answer, write in complete sentences.

OSI Physical Layer. Network Fundamentals Chapter 8. Version Cisco Systems, Inc. All rights reserved. Cisco Public 1

Emilka szuka swojej gwiazdy / Emily Climbs (Emily, #2)

Zmiany techniczne wprowadzone w wersji Comarch ERP Altum

Machine Learning for Data Science (CS4786) Lecture 24. Differential Privacy and Re-useable Holdout

Ankiety Nowe funkcje! Pomoc Twoje konto Wyloguj. BIODIVERSITY OF RIVERS: Survey to students

Gradient Coding using the Stochastic Block Model


SNP Business Partner Data Checker. Prezentacja produktu

ZDANIA ANGIELSKIE W PARAFRAZIE

GRY EDUKACYJNE I ICH MOŻLIWOŚCI DZIĘKI INTERNETOWI DZIŚ I JUTRO. Internet Rzeczy w wyobraźni gracza komputerowego

Steeple #3: Gödel s Silver Blaze Theorem. Selmer Bringsjord Are Humans Rational? Dec RPI Troy NY USA

Working Tax Credit Child Tax Credit Jobseeker s Allowance

Towards Stability Analysis of Data Transport Mechanisms: a Fluid Model and an Application

Dolny Slask 1: , mapa turystycznosamochodowa: Plan Wroclawia (Polish Edition)

Blow-Up: Photographs in the Time of Tumult; Black and White Photography Festival Zakopane Warszawa 2002 / Powiekszenie: Fotografie w czasach zgielku

Has the heat wave frequency or intensity changed in Poland since 1950?

17-18 września 2016 Spółka Limited w UK. Jako Wehikuł Inwestycyjny. Marek Niedźwiedź. InvestCamp 2016 PL

Financial support for start-uppres. Where to get money? - Equity. - Credit. - Local Labor Office - Six times the national average wage (22000 zł)

OSI Network Layer. Network Fundamentals Chapter 5. Version Cisco Systems, Inc. All rights reserved. Cisco Public 1

PORTS AS LOGISTICS CENTERS FOR CONSTRUCTION AND OPERATION OF THE OFFSHORE WIND FARMS - CASE OF SASSNITZ

Jak zasada Pareto może pomóc Ci w nauce języków obcych?

No matter how much you have, it matters how much you need

Dominika Janik-Hornik (Uniwersytet Ekonomiczny w Katowicach) Kornelia Kamińska (ESN Akademia Górniczo-Hutnicza) Dorota Rytwińska (FRSE)

Traceability. matrix

Poland) Wydawnictwo "Gea" (Warsaw. Click here if your download doesn"t start automatically

EGZAMIN MATURALNY Z JĘZYKA ANGIELSKIEGO

European Crime Prevention Award (ECPA) Annex I - new version 2014

Presented by. Dr. Morten Middelfart, CTO

Goodman Kraków Airport Logistics Centre. 62,350 sqm available. Units from 1,750 sqm for immediate lease. space for growth+

Linear Classification and Logistic Regression. Pascal Fua IC-CVLab

Ankiety Nowe funkcje! Pomoc Twoje konto Wyloguj. BIODIVERSITY OF RIVERS: Survey to teachers

Installation of EuroCert software for qualified electronic signature

Arrays -II. Arrays. Outline ECE Cal Poly Pomona Electrical & Computer Engineering. Introduction

EGZAMIN MATURALNY Z JĘZYKA ANGIELSKIEGO


Website review radcowie.biz

THE RAIL RATES valid from 1st October 2015

Website review pureorganic.pl

Testy jednostkowe - zastosowanie oprogramowania JUNIT 4.0 Zofia Kruczkiewicz

Please fill in the questionnaire below. Each person who was involved in (parts of) the project can respond.

EGZAMIN MATURALNY Z JĘZYKA ANGIELSKIEGO POZIOM ROZSZERZONY CZĘŚĆ I 8 MAJA Godzina rozpoczęcia: 14:00. Czas pracy: 120 minut

Rev Źródło:

Convolution semigroups with linear Jacobi parameters

Sargent Opens Sonairte Farmers' Market

SG-MICRO... SPRĘŻYNY GAZOWE P.103

Healthix Consent Web-Service Specification

tradycyjna normalny multicache bardzo du y mobilna

Transkrypt:

Najnowsze uwzględniające sytuację ograniczonej pamięci (np. na urządzenia mobilne): - WTBC (Wavelet Trees on Bytecodes) R. Grossi, A. Gupta, J.S. Vitter, High-order entropy-compressed text indexes, Proc. of the 14th Annual SIAM/ACM Symposium on Discrete Algorithms (SODA), 2003 N.R. Brisaboa, A. Cerdeira-Pena, G. Navarro, O. Pedreira, Ranked Document Retrieval in (Almost) No Space, Proc. of the 19th Int. Symp. on String Processing and Information Retrieval (SPIRE 2012) - LNCS 7608, Springer, 2012

sqrt(n) evenly spaced skip pointers for a posting list of size n http://nlp.stanford.edu/ir-book/

Tests: sqrt(n) evenly spaced skip pointers for a posting list of size n when both lists are sufficiently small, we never skip but the time difference is negligible due to the size of the lists for larger lists of similar size (i.e. both lists around 30000 values) skip pointers are followed rarely and are therefore actually detrimental to the performance of the intersection function (due to the overhead of checking the skip values but being unable to follow), with the times being 10% to 40% slower for lists with skips when posting list sizes are orders of magnitude different (i.e. list of size 200 vs list of size 20000 like the code above), skip pointers begin to come into their own since the larger list is able to skip a lot more often, the improvement is around 10% to 40% (10 is more likely than 40) http://www.skorks.com/2010/03/faster-list-intersection-using-skip-pointers/

http://idss.cs.put.poznan.pl/site/fileadmin/seminaria/2009/jsuffixarrays.pdf

Distribution refers to the fact that the document collection and its index are split across multiple machines and that answers to the query as a whole must be synthesized from the various collection components. Replication (or mirroring) involves making enough identical copies of the system so that the required query load can be handled. (Google claims: we keep each piece of data stored on at least two servers )

DISTRIBUTED INFORMATION RETRIEVAL Document-Distributed Architectures The simplest distribution regime is to partition the collection and allocate one subcollection to each of the processors Term-Distributed Architectures The index is split into components by partitioning the vocabulary

Distributed indexing In practice, partitioning indexes by vocabulary terms turns out to be nontrivial. Multi-word queries require the sending of long postings lists between sets of nodes for merging, and the cost of this can outweigh the greater concurrency. Load balancing the partition is governed not by an a priori analysis of relative term frequencies, but rather by the distribution of query terms and their co-occurrences, which can drift with time or exhibit sudden bursts. Achieving good partitions is a function of the cooccurrences of query terms and entails the clustering of terms to optimize objectives that are not easy to quantify. Finally, this strategy makes implementation of dynamic indexing more difficult.

Distributed indexing A more common implementation is to partition by documents: each node contains the index for a subset of all documents. Each query is distributed to all nodes, with the results from various nodes being merged before presentation to the user. This strategy trades more local disk seeks for less inter-node communication. One difficulty in this approach is that global statistics used in scoring such as idf must be computed across the entire document collection even though the index at any single node only contains a subset of the documents. These are computed by distributed background processes that periodically refresh the node indexes with fresh global statistics.

Term partitioning has the advantage of requiring fewer disk seek and transfer operations during query evaluation than document-partitioning because each term s inverted list is still stored contiguously on a single machine rather than in fragments across multiple machines. Each processor has full information about a subset of the terms, meaning that to handle a query, only the relevant subset of the processors need to respond. The majority of the processing load still falls to the coordinating machine. In a term-distributed system, having one machine offline is likely to be immediately noticeable.

Document distribution typically results in a better balance of workload than does term partitioning and achieves superior query throughput. It also allows more naturally for index construction and for document insertion. Document distribution also has the pragmatic advantage of still allowing a search service to be provided even when one of the hosts is offline for some reason since any answers not resident on that machine remain available to the system.

Distributed indexing How do we decide the partition of documents to nodes? A common implementation heuristic is to partition the document collection into indexes of documents that are more likely to score highly on most queries and lowscoring indexes with the remaining documents. We only search the low-scoring indexes when there are too few matches in the high-scoring indexes

The Google implementation uses a document-partitioned index with massive replication and redundancy at all levels: the machine, the processor cluster, and the site. Note: Document partitioning remains effective even if the collaborating systems are independent and unable to exchange their index data. A distributed system in which a final result answer list is synthesized from the possiblyoverlapping answer sets provided by a range of different services is called a metasearch engine (e.g.: http://www.dogpile.com/).