
本文共 5635 字,大约阅读时间需要 18 分钟。
-
Overview
From , It uses a data structuree called an inverted index that supports very fast full-text searches. An inverted index lists every unique word that appears in any document and identifies all of the documents each word occurs in.
-
Notions
-
vs
A full-text database or a complete-text database is a database that contains the complete text of books, dissertations, journals, magazines, newspapers or other kinds of textual documents.
A bibliographic database is a database of bibliographic records, an organized digital collection of references to published literature, including journal and newspaper articles, conference proceedings, reports, government and legal publications, books, etc.
-
In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database.
Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases (such as titles, abstracts, selected sections, or bibliographical references)
-
In computer science, an inverted index (also referred to as a postings file or inverted file) is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of documents (named in contrast to a forward index, which maps from documents to content).
The purpose of an inverted index is to allow fast full-text searches, at a cost of increased processing when a document is added to the database.
The inverted file may be the database file itself, rather than its index.
It is the most popular data structure used in document retrieval systems.
The inverted index data structure is a central component of a typical
search engine indexing algorithm
. -
The forward index stores a list of words for each document.
Document Words Document1 the, cow, says, moo Document2 the ,cat, and, the, hat Document3 the, dish, ran, away, with, the, spoon -
Search engine optimisation indexing collects, parses, and stores data to facilitate fast and accurate information retrieval.
-
The Awesome Power of the Inverted Index
The inverted index is a wonder that helps find and make sense of information buried in mounds of data, text and binaries.
An Inverted Index is a simple but powerful way to search documents, images, media, and even data. Unlike just a keyword search, an inverted index allows you to search the inherent structure of any document.
There’s no need to use a table name or special query language to get the information you want. You just type it into a search box and the search engine figures out the rest.
Inverted Indexes were invented decades ago, in the same era that much of the first AI and machine learning algorithms were born. But the vast increase in computing power in recent years has made it possible to make use of the inverted index structure and generate fast search results from huge stores of indexed data and information.
One of the reasons they’re become so popular is the Apache Solr open source project, which created a basic infrastructure for inverted indexes and doing searches over them.
Inverted indexes should become an integral tool for IT innovators because they help companies make sense of the exploding landscape of data, especially data spread across many different forms and locations.
-
Traditional Database (Forward Indexes) vs Search Engines (Inverted Index)
《》
In traditional SQL DB the data will look something like this:
Doc ID Doc Content 1 Welcome to the Hotel California Such a lovely place 2 she’s buying a stairway to Heaven 3 Hey Jude, don’t make it bad 4 Welcome to the heaven Performance in traditional SQL DBs is gained by querying over primary key or by building efficient “indexes” for traversing these db tables.
You can use inverted indexes in
SQL DBs
like postgresql, but they are not as efficient as they are insearch engines
like elasticsearch/lucene etc.The indexes used in SQL like B-Tree index( the default one ), HashIndexes are kind of a
forward indexes
where generally the mapping is fromDocument
(akadoc Id
) to the whole data row.In
Reverse Indexes
the mapping is from “terms” to theDocuments
(as shown in the table below):Term Doc Id buying Doc2 california Doc1 Heaven Doc2, Doc4 hotel Doc1 Jude Doc3 lovely Doc1 stairway Doc2 welcome Doc1, Doc4 and so on… … If you just search “welcome lovely”, you don’t have any exact match in the database but using the
inverted index
we can see that the user is looking forDoc1
,Doc4
(Doc1
having the highest rank score since it is in both the document list for the termwelcome
andlovely
) -
Components of Inverted Indexes
The two main components of a
inverted index
are Dictionary and Posting Lists. -
Dictionary
The dictionary works as a lookup data structure on top of the posting lists.
It has two broad sections of solutions: hashing and search trees.
Given an inverted index and a query, our first task is to determine whether each query term exists in the vocabulary.
-
Posting Lists
The actual index data is stored in posting list.
It is accessed through the search engine’s dictionary. Each term has its own posting list assigned to it.
Since the actual size of posting list is too large and therefore its better to keep this stored over disk to reduce the cost. Only during query processing are the query term’s posting list is loaded into the memory, as required by the query processing routines.
-
Stop words
Some extremely common words that would appear to be of little value in helping select
documents
matching aquery
need are excluded from the vocabulary entirely. Like a, an, and , are, as etc. -
References
发表评论
最新留言
关于作者
