A Survey of Automatic Query Expansion in Information Retrieval

Read this in "about 10 minutes".

Carpineto, C. and Romano, G. 2012. A survey of automatic query expansion in information retrieval. ACM Comput. Surv. 44, 1, Article 1 (January 2012), 50 pages.
DOI = 10.1145/2071389.2071390 http://doi.acm.org/10.1145/2071389.2071390

Motivation

The relative ineffectiveness of information retrieval systems is largely caused by the inaccuracy with which a query formed by a few keywords models the actual user information need.
One well known method to over- come this limitation is automatic query expansion (AQE), whereby the user’s original query is augmented by new features with a similar meaning.
The following questions need to be addressed. Why is query expansion so important to improve search effectiveness? What are the main steps involved in the design and implementation of an AQE component? What approaches to AQE are available and how do they compare? Which issues must still be resolved before AQE becomes a standard component of large operational information retrieval systems (e.g., search engines)?

Keywords

Query expansion, query refinement, search, word associations, pseudo-relevance feedback, document ranking

Introduction

Given that user queries are usually short and that the natural language is inherently ambiguous, simple retrieval model is in general prone to errors and omissions.

  • problem
    1. term mismatch problem: the indexers and the users do often not use the same words.
      Problem|Description|Result ———— | ————- synonymy|different words with the same or similar meanings, such as “tv” and “television”|a decrease in recall word inflections|such as with plural forms, “television” versus “televisions”|a decrease in recall polysemy|same word with different meanings, such as “java”|a decrease in precision
      To deal with the vocabulary problem, several approaches have been proposed in- cluding interactive query refinement, relevance feedback, word sense disambiguation, and search results clustering. One of the most natural and successful techniques is to expand the original query with other words that best capture the actual user intent, or that simply produce a more useful query—a query that is more likely to retrieve relevant documents.
      While there has been a slight increase in the number of long queries (of five or more words), the most prevalent queries are still those of one, two, and three words.In this situation, the vocabulary problem has become even more serious because the paucity of query terms reduces the possibility of handling synonymy while the heterogeneity and size of data make the effects of polysemy more severe. The need for and the scope of AQE have thus increased.
    2. DOCUMENT RANKING WITH AQE The similarity sim(q,d) between query q and document d can be usually expressed as
      1806251.png

      The basic input to AQE consists of the original query q and a source of data from which to com- pute and weight the expansion terms. The output of AQE is a query q′ formed by an expanded set of terms with their associated weights w′.
      The weight of a term is typically proportional to the term frequency and inversely proportional to the frequency and length of the documents containing the term, others including vector space model [Salton and McGill 1983], probabilistic relevance model [Robertson et al. 1998], statistical language modeling [Zhai and Lafferty 2001b], and deviation from randomness [Amati et al. 2001].
      The new weighted query terms are used to compute the similarity between query q′ and document d
      1806252.png

      The most typical data source for generating new terms is the collection itself being searched and the simplest way of weighting the query expansion terms is to use just the weighting function used by the ranking system.
    3. WHY AND WHEN AQE WORKS Legal, financial, medical, scientific…’s main goal is to retrieve all documents that are relevant to an issue.
      The additional terms, however, may cause query drift—the alteration of the focus of a search topic caused by improper expansion Mitra et al. [1998]—thus hurting precision.
      When an expansion term is correlated with a single term of the original query rather than with the entire query it may easily match unrelated concepts.
      It is also possible that the set of candidate expansion terms as a whole is not relevant to the original query.This may happen, for instance, when AQE is based on the top documents retrieved in response to the original query and such documents are mostly not relevant.
      The relevant documents that match just the original query terms may move lower down in the ranking after query expansion, even if the additional terms are relevant to the query concept.

Now talk about the reason of AQE improving precision.

There may be several types of out-of-context matches causing false drops. In Bodoff and Kambil [1998], for instance, five types were identified: polysemy, ordered relationships among terms (e.g., “wars due to crises” versus “crises due to wars”), out of phrase terms (when a query or document phrase is not treated as a single unit), secondary topic keyword (e.g., “siamese cats” versus “cats”), and noncategorical terms (e.g., “tiger” is simultaneously an instance of “mammal” and of “operating system”).

The additional terms favor a more univocal interpretation of the original query. For example, if the query “tiger, operating system” is expanded with “Mac OS X,” the score of the documents about the computer meaning of “tiger” will increase while the score of the documents about different meanings of “tiger” or different operating systems will decrease.

  1. Alternative strategies
    • Interactive Query Refinement Its main difference from automatic methods is that the system provides several suggestions for query (re)formulation, but the decision is made by the user.
      From a computational point of view, IQE and AQE share the first two computational steps, namely data acquisition and candidate feature generation, whereas IQE does not address the subsequent problems of feature selection and query reformulation.
  • Relevance Feedback Relevance feedback takes the results that are initially returned from a given query and uses information provided by the user about whether or not those results are relevant to perform a new query. The content of the assessed documents is used to adjust the weights of terms in the original query and/or to add words to the query.

  • Word Sense Disambiguation in IR Early work focused on representing words by the text of their dictionary definitions, or by their WordNet synsets. On the whole, however, the application of WSD to IR presents both computational and effectiveness limitations. Furthermore, a typical query context, as in Web searches, may be too short for sense disambiguation.

  • Search Results Clustering Search results clustering (SRC) organizes search results by topic, thus allowing, in principle, direct access to the documents pertaining to distinct aspects of the given query.In contrast to conventional clustering, SRC algorithms try to optimize not only the clustering structure, but also the quality of cluster labels, because a cluster with a poor description is very likely to be entirelyomitted by the user even if it points to a group of strongly related and relevant documents.

Solution:HOW AQE WORKS

  1. Preprocessing of Data Source Many query expansion techniques are based on the information contained in the top-ranked items retrieved in response to the original user query from a collection of documents.Indexing usually comprises
    (1) text extraction from documents like HTML, PDF, MS Word, and so on (if the col- lection is made of such documents);
    (2) tokenization (extraction of individual words, ignoring punctuation and case);
    (3) stop word removal (removal of common words such as articles and prepositions);
    (4) word stemming (reduction of inflected or derivational words to their root form);
    (5) word weighting (assignment of a score that reflects the importance of the word, usually in each document).

  2. Generation and Ranking of Candidate Expansion Features

    • One-to-One Associations. Two simple measures of similarity are the Dice coefficient and the Jaccard index. Given terms terms u and v, the Dice coefficient (D) is defined as
      1806253.png

      The Jaccard index (J) is defined as
      1806254.png

      Computing co-occurrence of terms in the whole document is simple but it has the disadvantage that position is not taken into account, whereas two terms that co-occur in the same sentence seem more correlated than two terms that occur distantly within a document.
      A more comprehensive measure for word association that incorporates term dependency is mutual information [Church and Hanks 1990; van Rijsbergen 1979],defined as
      1806255.png

      Alternatively, we could consider the classical definition of conditional probability to measure the strength of the association of term v to term u
      1806256.png

      In fact, several additional contextual clues extracted from the query logs have been used to help identify useful associations.
  • One-to-Many Associations One-to-one associations tend to add a term when it is strongly related to one of the query terms. However, this may not accurately reflect the relationships of the expansion term to the query as a whole.
    For example, while the word “program” may well be highly associated with the word “computer,” an automatic expansion of all queries containing “program” with “computer” might work well for some queries (e.g., “Java program,” “application program”), but not for others (e.g., “TV program,” “government program,” “space program”).
    If we use term-to-term corre- lations, we might compute the correlation factors of a given candidate expansion term v to every query term, and then combine the found scores to find the correlation to the global query q, e.g. by
    1806257.png

    perhaps more principled, approach to finding one-to-many associations is based on combining multiple relationships between term pairs through a Markov chain framework [Collins-Thompson and Callan 2005]. For each query, a term net- work is constructed that contains pairs of words linked by several types of relations, such as synonyms, stems, co-occurrence, together with transition probabilities.
  • Analysis of Feature Distribution in Top-Ranked Documents. The idea is to use the first documents retrieved in response to the original query as a more detailed description of the underlying query topic, from which to extract the most important terms to be used as expansion features.

  • Query Language Modeling The best terms for query expansion are those with the highest probabilities. These techniques are usually referred to as model-based.

Conclusion

The End!
Reference: http://doi.acm.org/10.1145/2071389.2071390


Author

Typing Theme

Lorem ipsum dolor sit amet, consectetur adipisicing elit. Tempora non aut eos voluptas debitis unde impedit aliquid ipsa.

 The comment for this post is disabled.