Eksploracja Zasobów Internetu

Najnowsze uwzględniające sytuację ograniczonej pamięci (np. na urządzenia mobilne): - WTBC (Wavelet Trees on Bytecodes) R. Grossi, A. Gupta, J.S. Vitter, High-order entropy-compressed text indexes, Proc. of the 14th Annual SIAM/ACM Symposium on Discrete Algorithms (SODA), 2003 N.R. Brisaboa, A. Cerdeira-Pena, G. Navarro, O. Pedreira, Ranked Document Retrieval in (Almost) No Space, Proc. of the 19th Int. Symp. on String Processing and Information Retrieval (SPIRE 2012) - LNCS 7608, Springer, 2012

sqrt(n) evenly spaced skip pointers for a posting list of size n http://nlp.stanford.edu/ir-book/

Tests: sqrt(n) evenly spaced skip pointers for a posting list of size n when both lists are sufficiently small, we never skip but the time difference is negligible due to the size of the lists for larger lists of similar size (i.e. both lists around 30000 values) skip pointers are followed rarely and are therefore actually detrimental to the performance of the intersection function (due to the overhead of checking the skip values but being unable to follow), with the times being 10% to 40% slower for lists with skips when posting list sizes are orders of magnitude different (i.e. list of size 200 vs list of size 20000 like the code above), skip pointers begin to come into their own since the larger list is able to skip a lot more often, the improvement is around 10% to 40% (10 is more likely than 40) http://www.skorks.com/2010/03/faster-list-intersection-using-skip-pointers/

http://idss.cs.put.poznan.pl/site/fileadmin/seminaria/2009/jsuffixarrays.pdf

Distribution refers to the fact that the document collection and its index are split across multiple machines and that answers to the query as a whole must be synthesized from the various collection components. Replication (or mirroring) involves making enough identical copies of the system so that the required query load can be handled. (Google claims: we keep each piece of data stored on at least two servers )

DISTRIBUTED INFORMATION RETRIEVAL Document-Distributed Architectures The simplest distribution regime is to partition the collection and allocate one subcollection to each of the processors Term-Distributed Architectures The index is split into components by partitioning the vocabulary

Distributed indexing In practice, partitioning indexes by vocabulary terms turns out to be nontrivial. Multi-word queries require the sending of long postings lists between sets of nodes for merging, and the cost of this can outweigh the greater concurrency. Load balancing the partition is governed not by an a priori analysis of relative term frequencies, but rather by the distribution of query terms and their co-occurrences, which can drift with time or exhibit sudden bursts. Achieving good partitions is a function of the cooccurrences of query terms and entails the clustering of terms to optimize objectives that are not easy to quantify. Finally, this strategy makes implementation of dynamic indexing more difficult.

Distributed indexing A more common implementation is to partition by documents: each node contains the index for a subset of all documents. Each query is distributed to all nodes, with the results from various nodes being merged before presentation to the user. This strategy trades more local disk seeks for less inter-node communication. One difficulty in this approach is that global statistics used in scoring such as idf must be computed across the entire document collection even though the index at any single node only contains a subset of the documents. These are computed by distributed background processes that periodically refresh the node indexes with fresh global statistics.

Term partitioning has the advantage of requiring fewer disk seek and transfer operations during query evaluation than document-partitioning because each term s inverted list is still stored contiguously on a single machine rather than in fragments across multiple machines. Each processor has full information about a subset of the terms, meaning that to handle a query, only the relevant subset of the processors need to respond. The majority of the processing load still falls to the coordinating machine. In a term-distributed system, having one machine offline is likely to be immediately noticeable.

Document distribution typically results in a better balance of workload than does term partitioning and achieves superior query throughput. It also allows more naturally for index construction and for document insertion. Document distribution also has the pragmatic advantage of still allowing a search service to be provided even when one of the hosts is offline for some reason since any answers not resident on that machine remain available to the system.

Distributed indexing How do we decide the partition of documents to nodes? A common implementation heuristic is to partition the document collection into indexes of documents that are more likely to score highly on most queries and lowscoring indexes with the remaining documents. We only search the low-scoring indexes when there are too few matches in the high-scoring indexes

The Google implementation uses a document-partitioned index with massive replication and redundancy at all levels: the machine, the processor cluster, and the site. Note: Document partitioning remains effective even if the collaborating systems are independent and unable to exchange their index data. A distributed system in which a final result answer list is synthesized from the possiblyoverlapping answer sets provided by a range of different services is called a metasearch engine (e.g.: http://www.dogpile.com/).