Improving query spelling correction using web search results

Abstract
String transformation is an essential problem in language processing, pronunciation generation, spelling error correction and word transliteration,.we propose a probabilistic approach to the task. Our method is novel and unique in the following aspects. It employs (1) a log-linear (discriminative) model for string transformation, (2) an effective and accurate algorithm for model learning, and (3) an efficient algorithm for string generation. String transformation can be conducted at two different settings, depending on whether or not a dictionary is used. Using String Generation Algorithm, we introduce how to efficiently generate the top k output strings. Given an input string and a set of operators, we are able to transform the input string to the k most likely output strings by applying a number of operators. String transformation becomes approximate string search, which is the problem of identifying strings in a given dictionary that are similar to an input string. Our work aims to learn a model for string transformation which can achieve both high accuracy and efficiency.
1.Introduction:
String transformation can be defined in the following way. Given an input string and a set of operators, we are able to transform the input string to the k most likely output strings by applying a number of operators. Here the strings can be strings of words, characters, or any type of tokens. Each operator is a transformation rule that defines the replacement of a substring with another substring. The likelihood of transformation can represent similarity, relevance, and association between two strings in a specific application. Although certain progress has been made, further investigation of the task is still necessary, particularly from the viewpoint of enhancing both accuracy and efficiency, which is precisely the goal of this work. In natural language processing, pronunciation generation, spelling error correction, word transliteration, and word stemming can all be formalized as string transformation. String transformation can also be used in query reformulation and query suggestion in search.
Many information systems need to support approximate string queries: given a collection of textual strings, such as person names, telephone numbers, and addresses, find the strings in the collection that are similar to a given query string. The following are a few applications. In record linkage, we often need to find from a table those records that are similar to a given query string that could represent the same real-world entity, even though they have slightly different representations, such as Spielberg versus spielburg. In Web search, many search engines provide the ‘Did you mean’ feature, which can benefit from the capability of finding keywords similar to
a keyword in a search query. Other information systems such as Oracle and Lucene also support approximate string queries on relational tables or documents.
Record matching is the well-known problem of matching records that represent the same real-world entity and is an important step in the data cleaning process. An example of record matching is identifying the customer record in a data warehouse from the customer information such as name and address in a sales record. Due to various reasons such as erroneous data entry and different formatting conventions, the name and address information could be represented differently in the two records, making the task of matching them challenging. For example, consider the task of compiling transformations for matching citations. A number of transformations are relevant to this matching task. These include conference and journal abbreviations (VLDB ! Very Large Data Bases), subject related abbreviations (Soft ! Software), date related variations (Nov ! November, and ’76 ! 1976), number related abbreviations (8th ! Eighth), and a large number of variations which do not fall into any particular class (pp ! pages, eds ! editors).
2.Existing System:
In existing work, string transformation can be categorized into two groups. Some work mainly considered efficient generation of strings, assuming that the model is given. Other work tried to learn the model with different approaches, such as a generative model , a logistic regression model , and a discriminative model . However, efficiency is not an important factor taken into consideration in these methods. Most existing methods manage to mine transformation rules from pairs of queries in the search logs.
Disadvantages:
It does not consider on efficiency and accuracy.
It is usually assumed that the model (similarity or distance) is fixed and the objective is to efficiently find all the strings in the dictionary.
3.Proposed System:
String transformation is used in different specific tasks such as database record matching, spelling error correction, query reformulation and synonym mining. Our work aims to learn a model for string transformation which can achieve both high accuracy and efficiency. There are three fundamental problems with string transformation: (1) how to define a model which can achieve both high accuracy and efficiency, (2) how to accurately and efficiently train the model from training instances, (3) how to efficiently generate the top k output strings given the input string, with or without using a dictionary. When a dictionary is used in the transformation, a trie is used to efficiently retrieve the strings in the dictionary.
Advantages:
We can achieve both high accuracy and efficiency.
It is mainly used in large scale and it is used to efficiently searching the strings in the dictionary
Fig1:Architecture Diagram
Fig 2:Data Flow Diagram
4.Algorithm:
String Generation Algorithm:
We introduce how to efficiently generate the top k output strings. We employ top k pruning, which
Fig3: string generation
can guarantee to find the optimal k output strings. We also exploit two special data structures to facilitate efficient generation. We index the rules with an Aho-Corasick tree. When a dictionary is utilized, we index the dictionary with a trie.
1)Rule Index
The rule index stores all the rules and their weights using an Aho-Corasick tree (AC tree) , which can make the references of rules very efficient. In string generation, given an input string, we first retrieve all the applicable rules and their weights from the AC tree in time complexity of input string length plus number of matched entries.
Fig4: Rule Index
2)Top k Pruning
The string generation problem amounts to that of finding the top k output strings given the input string. Fig. 4 illustrates the lattice structure representing the input and output strings. We assume that the input string si consists of tokens ^t1t2 . . . tn$. In the figure, for example, c11 is the token transformed from t1, and c456 1 is the token transformed from t4t5t6. In this case, the top k candidates correspond with the top k paths in the lattice. Note that the length of the candidate string can be shorter or longer than the input string.
Fig 5: Top K Pruning
3) Efficient Dictionary Matching Algorithm:
Sometimes a dictionary is utilized in string transformation in which the output strings must exist in the dictionary,such as spelling error correction, database record matching,and synonym mining. In the setting of using a dictionary,we can further enhance the efficiency. Specifically, we index the dictionary in a trie, such that each string in the dictionary corresponds to the path from the root node to a leaf node. When we expand a path (substring) in candidate generation, we match it against the trie, and see whether the expansions from it are legitimate paths. If not,we discard the expansions and avoid generating unlikely candidates. In other words, candidate generation is guided by the traversal of the trie.
Fig6: Efficient Dictionary Matching
An example-Suppose that the current path represents string ^mic. There are three possible ways to
expand it by either continuously matching to o or applying the transformation rules o ‘ u and o ‘ ro. However, node c in the dictionary trie does not have node u as a child node,which means that no string in the dictionary has ^micu as prefix. In such case, the path will not be considered in candidate generation.
5.Performance Evaluation:
Fig 7:string transformation output
Fig8: string transformation output with username& password
Fig9: string transformation output
Fig 10: string transformation output
Fig 11: string transformation output
Fig 12: string transformation output with search tab
Fig 13: string transformation output
Fig 13: string transformation output
Fig 14: string transformation output with searched words meaning
Fig 15: string transformation spellcheck user info
Fig 16: string transformation spellcheck keystring info
6.Conclusion:
we have proposed a new statistical learning approach to string transformation. Our method is novel and unique in its model, learning algorithm, and string generation algorithm. Two specific applications are addressed with our method, namely spelling error correction of queries and query reformulation in web search. Experimental results on two large data sets and Microsoft Speller Challenge show that our method improves upon the baselines in terms of accuracy and efficiency. Our method is particularly useful when the problem occurs on a large scale.
References:
[1] M. Li, Y. Zhang, M. Zhu, and M. Zhou, ‘Exploring distributional similarity based models for query spelling correction,’ in Proc.21st Int. Conf. Computational Linguistics and the 44th Annu. Meeting Association for Computational Linguistics , Morristown, NJ, USA, 2006, pp. 1025’1032.
[2] A. R. Golding and D. Roth, ‘A winnow-based approach to context-sensitive spelling correction,’Mach. Learn., vol. 34, no. 1’3, pp. 107’130, Feb. 1999.
[3] J. Guo, G. Xu, H. Li, and X. Cheng, ‘A unified and discriminative model for query refinement,’ in Proc. 31st Annu. Int. ACM SIGIR Conf. Research Development Information Retrieval ,NewYork,NY, USA, 2008, pp. 379’386.
[4] A. Behm, S. Ji, C. Li, and J. Lu, ‘Space-constrained gram-based indexing for efficient approximate string search,’ in Proc. 2009 IEEE Int. Conf. Data Engineering, Washington, DC, USA, pp. 604′ 615.
[5] E. Brill and R. C. Moore, ‘An improved error model for noisy channel spelling correction,’ in Proc. 38th Annual Meeting Association for Computational Linguistics, Morristown, NJ, USA, 2000, pp. 286’293.
[6] N. Okazaki, Y. Tsuruoka, S. Ananiadou, and J. Tsujii, ‘A discriminative candidate generator for string transformations,’inProc. Conf. Empirical Methods Natural Language Processing,Morristown, NJ, USA, 2008, pp. 447’456.
[7] M. Dreyer, J. R. Smith, and J. Eisner, ‘Latent-variable modeling of string transductions with finite-state methods,’ in Proc. Conf. Empirical Methods Natural Language Processing ,Stroudsburg,PA,USA, 2008, pp. 1080’1089.
[8] A. Arasu, S. Chaudhuri, and R. Kaushik, ‘Learning string transformations from examples,’ Proc. VLDB Endow.,vol.2,no.1, pp. 514’525, Aug. 2009.
[9]S.Tejada,C.A.Knoblock,and S.Minton,’Learning domain-independent string transformation weights for high accuracy object identification,’ in Proc. 8th ACM SIGKDD Int. Conf.Knowledge Discovery and Data Mining , New York, NY, USA, 2002,pp. 350’359.
[10] M. Hadjieleftheriou and C. Li, ‘Efficient approximate search on string collections,’Proc. VLDB Endow., vol. 2, no. 2, pp. 1660’1661,Aug. 2009.
[11] C. Li, B. Wang, and X. Yang, ‘VGRAM: Improving performance of approximate queries on string collections using variable-length grams,’ in Proc. 33rd Int. Conf. Very Large Data Bases , Vienna, Austria, 2007, pp. 303’314.
[12] X. Yang, B. Wang, and C. Li, ‘Cost-based variable-length-gram selection for string collections to support approximate queries efficiently,’ in Proc. 2008 ACM SIGMOD Int. Conf. Management Data, New York, NY, USA, pp. 353’364.

Essay: Improving query spelling correction using web search results

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: