Leveraging semantic resources in diversified query expansion
Motivation
Search engines hedge their bets by diversifying top results to cover multiple such possibilities so that the user is likely to be satisfied, whatever be her intended interpretation.Diversified Query Expansion
is the problem of diversifying query expansion suggestions, so that the user can specialize the query to better suit her intent, even before perusing search results.
Keywords
Query expansion
· Diversification
· Semantic search
· Wikipedia
· Entity ranking
Introduction
Search Result Diversification (SRD)
refers to the task of selecting and/or re-ranking search results so that many aspects of the query are covered in the top results.
Diversified Query Expansion (DQE)
is the task of identifying a (small) set of terms (i.e., words) to extend the search query with, wherein the extended search query could be used in the search system to retrieve results covering a diverse set of aspects.
Diversified Entity Recommendations (DER)
is the analogous problem where the output of interest is a ranked list of entities from a knowledge base such that diverse query aspects are covered among the top entities.
This paper considers the diversified query expansion problem and develop a three phase framework
to exploit semantic resources
for the problem. They use the framework to develop methods focusing on Wikipedia and pre-learned word embeddings respectively, leading to techniques called Select-Link-Rank (SLR)
and Select-Embed-Rank (SER)
.
Problem statement and solution framework
problem
Given a document corpus D and a query phrase Q, the diversified query expansion (DQE) problem requires that we generate an ordered (i.e., ranked) list of expansion terms E. Each of the terms in E may be appended to Q to create an extended query phrase that could be processed by a search engine operating over D using a relevance function such as BM25 or PageRank.framework
The three-phase skeletal framework:
Selection
: This phase selects information of relevance to the query from the document corpus used in the retrieval system. Across our methods, we select a subset of terms that are deemed relevant to the query.
Correlation
: The information selected in the first phase is now correlated with external semantic resources. We propose separate methods for correlating with Wikipedia and pre-learned word embeddings, as we will illustrate in the next section.
Ranking
: This phase involves ranking candidate expansion terms in order to arrive at a final result set, E. In both our methods, we make use of diversity-conscious graph node ranking using the vertex reinforced random walk technique, to rank the expansion terms. However, differences in the previous phase across the methods entail consequent differences in this phase as well.Select-Link-Rank: Wikipedia for diversified query expansion
This part is omitted due to the lack of theoretical basis of knowledge graph, and the following method is better to fit my problem.Select-Embed-Rank: Word embeddings for diversified query expansion
Word embeddings
are word-specific vectors learnt by making use of word co-occurrence information. Unlike Wikipedia which is an encyclopaedic semantic resource, word embed- dings can be generated even for specialized corpora.
Figure Pipeline of the SER algorithm
- Select: Selecting candidate expansion terms
The select phase in SER involves selecting thetop-T terms
from across the top-K documents retrieved in response to the search query, Q. TheBo1
measure is used to score terms, resulting in a candidate set Cand(Q, D). Due to the usage of the initial result set from a relevance-only search, SER is also amenable to be used within an IRre-ranking
framework.
where f (a, B) denotes the frequency of the term a in the document collection represented by B . Thus, f (t , D)/|D| denotes the normalized frequency of t in D. It is notable that Bo1 scoring does not involve any parameter that requires tuning. To ensure all aspects of Q have a representation in ResK (Q, D), K needs to be set to a large value; we set both K and T to 1000 in our method. The selected candidate terms are denoted as Cand(Q, D). - Embed: Using word embeddings for term graph construction
SER was designed in order to be able to leverage pre-learned word embeddings such as the Google News word2vec or the Wikipedia/Twitter/Gigaword GloVe vectors in diversified query expansion.This phase involves the construction of an initial term graph using word embedding similarities, followed by refining it by heuristically filtering out edges and vertices.- Term graph construction
Let the word embeddings for a term t be represented by t.V; the word embedding is a numeric vector of fixed dimensionality, usually 100 − 300. We now construct a graph G0(Q) = {V0(Q), E0(Q)}. V0(Q) simply comprises all terms in Cand(Q, D). The edge set is defined as follows:
with the triplet (t,t′,s) denoting that there would be a directed edge from t to t′ bearing a weight s. Any similarity measure that quantifies similarity in [0, 1] could be used,we consistently use cosine similarity, being a popular similarity measure for numeric vector data. - Term graph refinement
To employ some general heuristics to now refine the graph G0(Q) by filtering out nodes and edges, in order to arrive at our final term graph G(Q) = {V(Q), E(Q)}.
1.General Word Filtering Heuristic
Building word vectors that are good at explaining the context in which the word appears in the corpus. This causes words denoting different instances of the same type to map to similar vectors. General word filtering heuristic filters the node set as follows:
where Neighbors(t, E0(Q)) denotes the set of nodes that are connected to t through edges in E0(Q). Thus, all nodes in V 0(Q) that are linked to more than μ% terms under G0(Q) would be eliminated leading to a refined set of nodes.
2.Edge Limit Heuristic
Frequently occurring terms within Cand(Q,D) would typically be placed in dense neighborhoods in the embedding space due to their co-occurrence with a a large variety of terms.In order to limit the influence of such common terms, we limit the maximum number of edges that can originate from a node in the term graph by choosing the top-ρ edges with the highest weights.
where T op-ρ(t, E0(Q)) denotes the top-ρ edges originating from t within E0(Q) when the edges are sorted based on their scores.
- Term graph construction
Rank: Ranking candidate terms
An important distinction between SLR and SER being that the former employs VRRW on the entity graph whereas we use VRRW on the term graph directly.Unlike the SLR version, we do not use node importance weights in the SER graph, and thus, the transition probability is uniform across all the edges. Once the VRRW stabilizes, we are left with a score for each term in the term graph, which we denote as S(t). The DQE output, E is then the set of terms in V (Q) ordered in the decreasing (or non-increasing, to be precise) order of the scores according to S(.).
Vertex reinforced random walk
(VRRW) is similar to PageRank, but it is a time-variant random walk process.
pT(e,e′) is the transition probability from e to e′ at timestamp T.The details is not included in this blog.- Overall algorithms
- Select: Selecting candidate expansion terms
Evaluation
They compare DQE results against LDA-based tsxQuAD where we set the #topics to 5. SLR’s DER results are compared against that of BHN, using both user studies and automatic evaluations in order to assess the empirical performance of methods, SLR and SER.
As to automated diversity evaluation, they compare all three methods, SLR, SER-Wiki and SER-News against the baseline methods tsxQuAD, RM-CombSum-Wiki and RM-CombSum-News. The last two methods are the MMR-based extensions of the method from citekuzi over the Wiki and Google News word embeddings respectively.The results follows:
Figure Jaccard Similarity based Diversity Analysis
Conclusion
The paper considered the task of leveraging external semantic resources for the Diversified Query Expansion task.It developed a three phase skeletal framework that first identifies important terms, then correlates them with external resources, and finally ranks terms to form the DQE output.
Considering my use case of weighing terms in query to improve relevance, it is a method to do query expansion.While in blog system where users search for query-related tutorial, the diversity is not so important, but the framework is valuable to refer.
The End!
Reference: https://link.springer.com/article/10.1007%2Fs11280-017-0468-7