理解inverted index||full-text search||Search Engines-白红宇的个人博客

理解inverted index||full-text search||Search Engines

发布日期：2021-05-07 14:23:47 浏览次数：17 分类：精选文章

本文共 5635 字，大约阅读时间需要 18 分钟。

Overview

From , It uses a data structuree called an inverted index that supports very fast full-text searches. An inverted index lists every unique word that appears in any document and identifies all of the documents each word occurs in.

Notions

vs

A full-text database or a complete-text database is a database that contains the complete text of books, dissertations, journals, magazines, newspapers or other kinds of textual documents.

A bibliographic database is a database of bibliographic records, an organized digital collection of references to published literature, including journal and newspaper articles, conference proceedings, reports, government and legal publications, books, etc.

In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database.

Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases (such as titles, abstracts, selected sections, or bibliographical references)

In computer science, an inverted index (also referred to as a postings file or inverted file) is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of documents (named in contrast to a forward index, which maps from documents to content).

The purpose of an inverted index is to allow fast full-text searches, at a cost of increased processing when a document is added to the database.

The inverted file may be the database file itself, rather than its index.

It is the most popular data structure used in document retrieval systems.

The inverted index data structure is a central component of a typical search engine indexing algorithm.

The forward index stores a list of words for each document.

Document	Words
Document1	the, cow, says, moo
Document2	the ,cat, and, the, hat
Document3	the, dish, ran, away, with, the, spoon

Search engine optimisation indexing collects, parses, and stores data to facilitate fast and accurate information retrieval.

The Awesome Power of the Inverted Index

The inverted index is a wonder that helps find and make sense of information buried in mounds of data, text and binaries.

An Inverted Index is a simple but powerful way to search documents, images, media, and even data. Unlike just a keyword search, an inverted index allows you to search the inherent structure of any document.

There’s no need to use a table name or special query language to get the information you want. You just type it into a search box and the search engine figures out the rest.

Inverted Indexes were invented decades ago, in the same era that much of the first AI and machine learning algorithms were born. But the vast increase in computing power in recent years has made it possible to make use of the inverted index structure and generate fast search results from huge stores of indexed data and information.

One of the reasons they’re become so popular is the Apache Solr open source project, which created a basic infrastructure for inverted indexes and doing searches over them.

Inverted indexes should become an integral tool for IT innovators because they help companies make sense of the exploding landscape of data, especially data spread across many different forms and locations.

Traditional Database (Forward Indexes) vs Search Engines (Inverted Index)

《》

In traditional SQL DB the data will look something like this:

Doc ID	Doc Content
1	Welcome to the Hotel California Such a lovely place
2	she’s buying a stairway to Heaven
3	Hey Jude, don’t make it bad
4	Welcome to the heaven

Performance in traditional SQL DBs is gained by querying over primary key or by building efficient “indexes” for traversing these db tables.

You can use inverted indexes in SQL DBs like postgresql, but they are not as efficient as they are in search engines like elasticsearch/lucene etc.

The indexes used in SQL like B-Tree index( the default one ), HashIndexes are kind of a forward indexes where generally the mapping is from Document(aka doc Id) to the whole data row.

In Reverse Indexes the mapping is from “terms” to the Documents (as shown in the table below):

Term	Doc Id
buying	Doc2
california	Doc1
Heaven	Doc2, Doc4
hotel	Doc1
Jude	Doc3
lovely	Doc1
stairway	Doc2
welcome	Doc1, Doc4
and so on…	…

If you just search “welcome lovely”, you don’t have any exact match in the database but using the inverted index we can see that the user is looking for Doc1, Doc4 (Doc1 having the highest rank score since it is in both the document list for the term welcome and lovely)

Components of Inverted Indexes

The two main components of a inverted index are Dictionary and Posting Lists.

Dictionary

The dictionary works as a lookup data structure on top of the posting lists.

It has two broad sections of solutions: hashing and search trees.

Given an inverted index and a query, our first task is to determine whether each query term exists in the vocabulary.

Posting Lists

The actual index data is stored in posting list.

It is accessed through the search engine’s dictionary. Each term has its own posting list assigned to it.

Since the actual size of posting list is too large and therefore its better to keep this stored over disk to reduce the cost. Only during query processing are the query term’s posting list is loaded into the memory, as required by the query processing routines.

Stop words

Some extremely common words that would appear to be of little value in helping select documents matching a query need are excluded from the vocabulary entirely. Like a, an, and , are, as etc.

References

上一篇：再次理解asyncio/await syntax and asyncio in Python

下一篇：理解Search Engine vs Traditional Database

发表评论

关于作者

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！

-- 愿君每日到此一游！

Overview

Notions

vs

The Awesome Power of the Inverted Index

Traditional Database (Forward Indexes) vs Search Engines (Inverted Index)

Components of Inverted Indexes

Dictionary

Posting Lists

Stop words

References

发表评论

最新留言

关于作者

推荐文章