理解inverted index||full-text search||Search Engines
发布日期:2021-05-07 14:23:47 浏览次数:17 分类:精选文章

本文共 5635 字,大约阅读时间需要 18 分钟。

  • Overview

    From , It uses a data structuree called an inverted index that supports very fast full-text searches. An inverted index lists every unique word that appears in any document and identifies all of the documents each word occurs in.

  • Notions

  • vs

    A full-text database or a complete-text database is a database that contains the complete text of books, dissertations, journals, magazines, newspapers or other kinds of textual documents.

    A bibliographic database is a database of bibliographic records, an organized digital collection of references to published literature, including journal and newspaper articles, conference proceedings, reports, government and legal publications, books, etc.

  • In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database.

    Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases (such as titles, abstracts, selected sections, or bibliographical references)

  • In computer science, an inverted index (also referred to as a postings file or inverted file) is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of documents (named in contrast to a forward index, which maps from documents to content).

    The purpose of an inverted index is to allow fast full-text searches, at a cost of increased processing when a document is added to the database.

    The inverted file may be the database file itself, rather than its index.

    It is the most popular data structure used in document retrieval systems.

    The inverted index data structure is a central component of a typical search engine indexing algorithm.

  • The forward index stores a list of words for each document.

    Document Words
    Document1 the, cow, says, moo
    Document2 the ,cat, and, the, hat
    Document3 the, dish, ran, away, with, the, spoon
  • Search engine optimisation indexing collects, parses, and stores data to facilitate fast and accurate information retrieval.

  • The Awesome Power of the Inverted Index

    The inverted index is a wonder that helps find and make sense of information buried in mounds of data, text and binaries.

    An Inverted Index is a simple but powerful way to search documents, images, media, and even data. Unlike just a keyword search, an inverted index allows you to search the inherent structure of any document.

    There’s no need to use a table name or special query language to get the information you want. You just type it into a search box and the search engine figures out the rest.

    Inverted Indexes were invented decades ago, in the same era that much of the first AI and machine learning algorithms were born. But the vast increase in computing power in recent years has made it possible to make use of the inverted index structure and generate fast search results from huge stores of indexed data and information.

    One of the reasons they’re become so popular is the Apache Solr open source project, which created a basic infrastructure for inverted indexes and doing searches over them.

    Inverted indexes should become an integral tool for IT innovators because they help companies make sense of the exploding landscape of data, especially data spread across many different forms and locations.

  • Traditional Database (Forward Indexes) vs Search Engines (Inverted Index)

    《》

    In traditional SQL DB the data will look something like this:

    Doc ID Doc Content
    1 Welcome to the Hotel California Such a lovely place
    2 she’s buying a stairway to Heaven
    3 Hey Jude, don’t make it bad
    4 Welcome to the heaven

    Performance in traditional SQL DBs is gained by querying over primary key or by building efficient “indexes” for traversing these db tables.

    You can use inverted indexes in SQL DBs like postgresql, but they are not as efficient as they are in search engines like elasticsearch/lucene etc.

    The indexes used in SQL like B-Tree index( the default one ), HashIndexes are kind of a forward indexes where generally the mapping is from Document(aka doc Id) to the whole data row.

    In Reverse Indexes the mapping is from “terms” to the Documents (as shown in the table below):

    Term Doc Id
    buying Doc2
    california Doc1
    Heaven Doc2, Doc4
    hotel Doc1
    Jude Doc3
    lovely Doc1
    stairway Doc2
    welcome Doc1, Doc4
    and so on…

    If you just search “welcome lovely”, you don’t have any exact match in the database but using the inverted index we can see that the user is looking for Doc1, Doc4 (Doc1 having the highest rank score since it is in both the document list for the term welcome and lovely)

  • Components of Inverted Indexes

    The two main components of a inverted index are Dictionary and Posting Lists.

  • Dictionary

    The dictionary works as a lookup data structure on top of the posting lists.

    It has two broad sections of solutions: hashing and search trees.

    Given an inverted index and a query, our first task is to determine whether each query term exists in the vocabulary.

  • Posting Lists

    The actual index data is stored in posting list.

    It is accessed through the search engine’s dictionary. Each term has its own posting list assigned to it.

    Since the actual size of posting list is too large and therefore its better to keep this stored over disk to reduce the cost. Only during query processing are the query term’s posting list is loaded into the memory, as required by the query processing routines.

  • Stop words

    Some extremely common words that would appear to be of little value in helping select documents matching a query need are excluded from the vocabulary entirely. Like a, an, and , are, as etc.

  • References

上一篇:再次理解asyncio/await syntax and asyncio in Python
下一篇:理解Search Engine vs Traditional Database

发表评论

最新留言

做的很好,不错不错
[***.243.131.199]2025年03月23日 06时38分25秒