[ Japanese Version ]

Information Retrieval on Distributed Environment

The use of the Internet has grown dramatically over the past few years. Information sources are available everywhere, both within the internal networks of organizations and on the Internet. This growth represents an incredible wealth of information. The increasing amount of information requires information retrieval (IR) systems in order for users to access information effectively.

The ultimate goal of Information Retrieval on distributed environments is to develop systems that would let users access all information sources available in the network, but in a way that would give the users the impression of a single large IR database.

Centralized Information Retrieval



The problem of locating relevant information in distributed information sources is partially solved by large-scale centralized retrieval systems such as Altavista and Google. In centralized systems, documents from around the network are copied to a centralized database, where they are indexed and made searchable. This centralized system suffers from a number of limitations, including coverage limitation, outdated data, and unavailable documents due to limited access.

Distributed Information Retrieval

Distributed IR is based on multi-database model in which the existence of multiple sources is modeled explicitly. On distributed IR, a broker receives a user's query and sends the query to appropriate sources. The query is processed in each sources. Distributed IR enables IR in environments where source contents are proprietary or carefully controlled, or where access is limited.

The most famous application that uses distributed information retrieval is the meta-search Engines in WWW such as SavvySearch and Metacrawler. Since these systems send queries to centralized retrieval systems and receive results, so we can not get the full merits of distributed IR by using these systems.

Distributed IR consists of three major steps as follows:

Source Selection Given a set of information sources, the system determines which sources are most likely to contain relevant documents for a given query.
Query Processing The users' query is translated into the query language of the respective sources and processed at the selected sources, producing lists with a set of individual results.
Result Merging These lists are merged into a single list of documents to be presented to the user.

In these three step, the Source Selection is the most important one for search effectiveness because selection correctness affects the effectiveness of the search. Some papers have reported that good source selection can result in higher retrieval effectiveness than that achieved in a centralized system,

In the Source Selection, the first task is to represent what each source contains. This representation is called the Source Description. A simple source description is represented by the words that occur in the source and their frequencies of occurrence. These representations are not effective if a users' query or a document contains polysemous words. This is because the same word is used to describe different things in polysemous words, both in queries and in documents. So, a semantic knowledge about source is needed for a source description. Next, a method is required for selecting sources based on a user's query and a source description.

In our method, we use thesaurus. This thesaurus is automatically constructed from documents in a source, as a source description. Our source description requires no manpower and all the data for selection are automatically constructed or already prepared. Thus, our method is more scalable than other existing methods.

With co-occurrence-based thesauruses, the form of a particular thesaurus depends on the documents used for its construction. Terms used in documents in a source differ from one another and meanings of a term differ, depending on the situation in which the term is used. The difference is a characteristic of the source and is used for selection. Terms used in each source are distinguished by the words that occur in a source and their frequencies of occurrence. However, methods using only statistical data face the problems caused by polysemous words. In our method, the meanings of a term are distinguished between by the relationship between the term and other terms. The relationship appear in the co-occurrence-based thesaurus. By using a co-occurrence-based thesaurus, our method overcomes the problems caused by polysemous words.