Models such as Latent Semantic Indexing (LSI) ( Dumais, 1991) can access the hidden semantic structure in a document collection. Some advanced vector space models address the common text analysis problems of synonymy and polysemy. Vector space models transform textual data into numeric vectors and matrices, then employ matrix analysis techniques to discern key features and connections in the document collection. Vector Space Model Search EnginesĪnother information retrieval technique uses the vector space model ( Salton, 1971), developed by Gerard Salton in the early 1960's, to sidestep some of the information retrieval problems of the Boolean model. Accommodating a growing collection is easy - the programming remains simple, and only the storage and parallel processing capabilities need to grow.īaeza-Yates & Ribeiro-Neto (1999), Frakes & Baeza-Yates (1992), and Korfhage (1997) all contain chapters with excellent introductions to the Boolean model and its extensions. The car maintenance query example illustrates the main drawbacks of Boolean search engines - they fall prey to two of the most common information retrieval problems, synonymy and polysemy.īoolean models scale well to very large document collections. Fuzzy Boolean engines use fuzzy logic to categorize this document as somewhat relevant and return it to the user. As a result, an apparently relevant document entitled "Automobile Maintenance" will not be returned. For example, a title search for car AND maintenance on a Boolean engine causes the virtual machine to return all documents that have both words in the title. Other more advanced set theoretic techniques, such as the so-called "fuzzy sets", try to remedy this black-white Boolean logic by introducing shades of gray. The inability to identify partial matches can lead to poor performance ( Baeza-Yates & Ribeiro-Neto, 1999). Thus, a document is judged as relevant or irrelevant - there is no concept of a "partial match" between documents and queries. The Boolean information retrieval model considers which keywords are present or absent in a document or title. Any number of logical statements can be combined using the three Boolean operators. For example, the Boolean AND of two logical statements x and y means that both x AND y must be satisfied, while the Boolean OR of these same two statements means that at least one of these statements must be satisfied. The adjective "Boolean" refers to the use of Boolean algebra, whereby words are logically combined with the Boolean operators AND, OR, and NOT. More refined descendents of this model are still used by most libraries. The Boolean model of information retrieval, one of the earliest and simplest retrieval methods, uses exact matching to match documents to a user "query" or information request by finding documents that are "relevant" in terms of matching the words in the query. Specifics of some of these techniques can easily become very complicated and are, in general, hard to come by, since many vendors refuse to share in this competitive environment. I outline here just a few of the most basic information retrieval techniques, in order to provide a context for the techniques we will study in more detail.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |