Latent Semantic Indexing for Patent Information
James Ryley, Ph.D.
http://www.FreePatentsOnline.com / http://www.SumoBrain.com
Summary
Latent Semantic Indexing (LSI) promises more accurate retrieval of information by incorporating statistical information on term meaning and frequency while retrieving documents as a result of a search. LSI’s precision and accuracy has been proven many times on test corpora, but the world’s patent literature poses a significant challenge in effectively implementing an LSI search engine due the size and heterogeneity of the patent corpus. Some of the factors which must be addressed to realize the goal of a more accurate patent search engine are discussed herein.
Introduction to LSI
Latent Semantic Indexing (LSI) is an advanced document searching technique created by combining the Vector Space Model (VSM) of document retrieval and Singular Value Decomposition (SVD)[1]. In the Vector Space Model, each document is plotted as a point in a multi-dimensional space. The location of that point is determined by the terms contained in the document and the frequency with which those terms appear. Each individual term represents a dimension in this space. Much as x, y, and z may be taken to represent the dimensions of traditional three-dimensional space, “car,” “boat,” “big,” “red,” etc. could be 4 dimensions in some higher-order space used to describe documents containing these terms.
Effectively then, the documents are represented by vectors in a potentially very high dimensional space. A query submitted to the system is then assigned a position in the same space using the same technique as that used to assign documents. The distance from the point representing the query to the point representing each document is calculated; the smaller the distance, the higher the relevancy. This distance may be an actual Euclidean distance or an angle between the query vector and the document vectors.
The weakness of VSM is its reliance upon literal vocabulary for word matching. This is also a weakness of Boolean search engines. For example, if a user searches for the term “car,” documents that do not contain the word car will not be retrieved, even if they do contain words such as “auto,” “automobile,” or “vehicle.” This is a serious drawback in light of the fact that there are typically many ways to describe the same concept (often referred to as the problem of synonymy). In patent literature in particular, the legalese, perhaps purposefully obfuscated, magnifies the problem.
Singular Value Decomposition is used to enhance VSM by using statistical techniques to automatically ascertain word relationships that may have a bearing on document relevancy. For example, due to co-occurrence, SVD would likely “learn” that “car,” “auto,” “automobile,” and “vehicle” are somehow related.
This relationship is not necessarily one of synonymy (though it may be), but rather it is a correlation that allows the system to tell the user, in effect, “If you are interested in documents containing ‘car’ you may also want to consider documents containing the related words ‘auto,’ ‘automobile’ or ‘vehicle’.” The end result is a search engine that can find documents on relevant topics even if the document contains none of the search terms. We refer to the ability to find related words or concepts that were not specified in the query as “conceptual search.”
LSI Complexities
The increase in accuracy of LSI over simple term matching is potentially substantial. In fact, on standard test corpora, LSI consistently outperforms VSM and other models by up to 30% [1], and has repeatedly set precision and recall benchmarks at TREC [2-5]. However, the design of an LSI-based search engine is far more complex than that of Vector Space Model or Boolean-based search engines.
Many design choices, and the nature of the document collection, can affect the performance of an LSI-based search engine. For example, some groups have found that LSI does not perform well on large document collections [6-8] (and the world patent literature is far, far larger than the traditional test databases used in academics). Others have found that poorly chosen specific parameters, e.g., “k,” the rank chosen for SVD calculations, can substantially reduce LSI recall and precision [9-11]. Decisions on how to deal with pre-processing of document terms, clustering, stemming, stop words, and other topics also have the potential to affect the precision and recall of an LSI-based search engine.
With respect to one example of using an LSI-based search on patent documents found in the literature, Moldovan found only minimal improvement over VSM when using LSI specifically on subsets of the US patent corpus, and in fact found that in some cases LSI actually performed slightly worse than SVM [11]. However, while Moldovan tested multiple values for k, k seems to be the only parameter that was thoroughly investigated. Moldovan’s pre-processing parameters were seemingly arbitrarily chosen, the Porter stemming algorithm was used exclusively (with no mention of performance without stemming, or the use of other algorithms such as Snowball), the SMART stop word list was used (which is almost certainly not optimal for patent documents) and no clustering was performed.
Given the number of factors involved in optimizing an LSI-based search engine, it is unsurprising that some groups find that, without comprehensively addressing all the factors involved in optimizing an LSI-based search engine, they cannot consistently replicate the performance advantages of LSI over SVM or Boolean engines. However, we believe that LSI is a powerful tool for patent document retrieval and we will address several factors that must be heeded when creating an LSI-based search engine for patents.
Clustering
Large document collections can pose a problem for LSI [6-8] if not treated appropriately. Lack of homogeneity between document topics exacerbates the problem. This has been attributed to the elevated noise level naturally occurring in very large, heterogeneous collections. The world’s patent documents are certainly both large and heterogeneous.
The main problem with large, heterogeneous document collections is that Singular Value Decomposition becomes less able to discriminate noise from information as the corpus size increases. One of the reasons for this is the problem of polysemy (one word having multiple meanings). For example, consider the word “cell.” “Cell” can be used in numerous contexts, including “cell phone,” “stem cell,” “jail cell,” “fuel cell” and “electrolytic cell. Polysemy blurs the definition of a word, correlating it with multiple disparate topics and thereby weakening its connection to any one topic.
Large document collections also pose computational problems, requiring a prohibitive amount of memory for SVD calculations.
However, clustering can address this problem [12]. Clustering breaks a document collection into sub-collections with improved homogeneity, which are then treated separately for the purposes of SVD. Numerous clustering algorithms are available [13], and they can roughly be divided into agglomerative and divisive techniques.
Agglomerative techniques start with a single document, and then find its nearest neighbor (forming a two-document cluster). The next two closest documents are likewise joined, and the process (if being carried to conclusion) is repeated until all documents are part of a single cluster. Agglomerative techniques are also called “hierarchical,” because a tree structure is formed which can show the relatedness of the various documents and clusters by examining in which iteration they were joined (the earlier in the process two document fall into the same cluster, the more related they are). In practicality, unless one wished to examine the hierarchical structure in its entirety, the process would not need to be carried out to the point where there was only a single cluster left; in fact, that would defeat the purpose. The process would be terminated when the average cluster size, or other metrics, met user-defined parameters.
Divisive techniques work in the opposite manner. The starting point is the entire corpus. The collection is then split repeatedly, assigning documents to the clusters most similar to them (often as measured by traditional Vector Space Model relevancy ranking). The process is terminated when clusters are obtained with the appropriate characteristics for processing with SVD.
The most important characteristics of a cluster with respect to LSI and SVD are number of documents and the internal cluster similarity (a measure reported by commonly available clustering software, such as CLUTO [14]). Reducing the number of documents in a cluster reduces the rank of the SVD matrices, which reduces the memory and time required for computation. High internal similarity in a cluster is indicative of topic homogeneity, which reduces the problem of polysemy. However, one must be careful not to over-cluster because. Since each cluster is treated separately during the SVD calculations, overly-small clusters reduce SVD’s ability to create the proper correlations between conceptually-related terms.
Singular Value Decomposition Parameters
Once the database is clustered, Singular Value Decomposition must be performed. Singular Value Decomposition is a linear algebra technique that is applied to a document/term matrix (that is, a table containing and entry for each document in the database and a list of each term that occurs in a given document, with an indication of how many times that term occurs) to generate a second matrix of reduced rank.
“Rank” when used with respect to SVD, refers to the shortest dimension of the matrix. For example, in a matrix containing information on 10,000 documents, with a total of 50,000 terms, the rank of the matrix would be said to be 10,000. “k” is used to refer to rank, and the choice of what k will be output from the SVD operation is arbitrary. The output rank can be anything less than the input rank. In practice, values between approximately 100 and 300 have often been found to give the best performance.
The rank of the matrix can be interpreted as the number of concepts present in the documents. If too low a k is chosen, multiple concepts may be inappropriately merged into a single concept. If k is too high, then noise is being modeled, and the appropriate relationships between vocabulary and concepts may not be determined. In fact, if the output rank were chosen so that it was equal to the input rank, no conceptual aliasing would be done and the results would be equivalent to that of VSM.
The choice of k has been show to have substantial effect on accuracy [9-11], and no satisfactory method exists for determining optimum k from theory; empirical testing must be done. This is challenging in a corpus as large and diverse as patents, but cannot be ignored if best accuracy is to be achieved.
Conclusion
Creating an accurate LSI-based patent search engine involves many choices, some of which are still active areas of research with no known optimal solution. One of the main issues which must be addressed is that large collections must be clustered carefully, to reduce the problem of polysemy, and to increase computational tractability, while not over clustering. Additionally, an appropriate k must be chosen during SVD so as not to force false associations between terms with a low k, while at the same time not modeling noise with a high k, thereby losing the advantage that LSI has over the Vector Space Model.
This is not an exhaustive list of the issues that must be addressed when creating a robust LSI-based search engine. Other topics to consider include treatment of stop words, rare words, word stemming, term frequency normalization, and field weighting.
Hopefully, an awareness of some of the details involved in
constructing an LSI-based search will help patent information professionals
interpret the apparent discrepancies in the literature with regard to the
usefulness of LSI for patent searching (which can be difficult given that not
all papers report k, internal similarity of clusters, number of documents and
number of terms, and other important parameters). We are confident that LSI is
a valuable tool for document retrieval, both in the patent industry and other
fields. But, for best performance, great care must be taken when implementing
an LSI-based engine to tailor the specific LSI implementation to the document
collection at hand.
References
1. Deerwester, S., Dumais, S., Landauer, T., Furnas, G., Harshman, R., Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science, 1990. 41(6): p. 391-407.
2. Text REtrieval Conference (TREC).
3. Dumais, S., LSI meets TREC: A Status Report. The First Text REtrieval Conference (TREC1), National Institute of Standards and Technology Special Publication 1993: p. 137-152.
4. Dumais, S., Latent Semantic Indexing (LSI) and TREC-2. The Second Text REtrieval Conference (TREC2), National Institute of Standards and Technology Special Publication, 1994: p. 105-116.
5. Dumais, S., Latent Semantic Indexing (LSI): TREC-3 Report. The 3rd Text Retrieval Conference (TREC-3), D. Harman Ed. 219-230. NIST Special Publication, 1995: p. 219-230.
6. Chen, C.S., N.; Post, M.; Basu, C.; Bassu, D.; Behrens, C., Telcordia LSI Engine: Implementation and Scalability Issues. Proceedings of the Eleventh International Workshop on Research Issues in Data Engineering, 2001: p. 51-58.
7. Bassu, D.a.B., C., Distributed LSI: Scalable Concept-based Information Retrieval with High Semantic Resolution. Proceedings of the 3rd SIAM International Conference on Data Mining (Text Mining Workshop), 2003.
8. Husbands, P., Simon, H., Ding, C., Term norm distribution and its effects on latent semantic indexing. Information Processing and Management, 2005. 41(4): p. 77-787.
9. Ding, C., A Similarity-based Probabability Model for Latent Semantic Indexing. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval 1999: p. 58-65.
10. Kontostathis, A., Pottenger, W., A framework for understanding LSI performance. Proceedings of ACM SIGIR Workshop on Mathematical/Formal Methods in Information Retrieval, 2003.
11. Moldovan, A., Bot, R., Wanka, G., Latent Semantic Indexing for Patent Documents. Technische Universität Chemnitz, Fakultät für Mathematik (Germany). Preprint, 2004.
12. Gao, J., Zhang, J., Clustered SVD strategies in latent semantic indexing. Information Processing and Management, 2004. 41: p. 1051-1063.
13. Jain, A., Murty, M., Flynn, P., Data Clustering: A Review. ACM Computing Surveys, 1999. 31(3): p. 264-323.
14. Karypis, G., CLUTO - A Clustering Toolkit. University of Minnesota - Computer Science and Engineering Technical Report Abstract, 2002.