PaperFinder: A tool for scalable search of digital libraries

Athanasios E. Papathanasiou Evangelos P. Markatos Stavros A. Papadakis
Institute of Computer Science (ICS)
Foundation for Research & Technology - Hellas (FORTH), Crete
P.O.Box 1385
Heraklion, Crete, GR-711-10 GREECE
papathan@ics.forth.gr
http://www.ics.forth.gr/proj/arch-vlsi/OS/usewebnet.html

1. Introduction
2. Design - User Interface
3. Resource-Discovery Mode of Operation
4. Conclusion
5. Acknowledgments
References

1. Introduction

The invention and spread of the World-Wide Web made the process of paper publication significantly easier than before, and added a large repository of on-line (electronic) papers to our body of knowledge. To make the process of information gathering easier for scientists, Digital Libraries usually maintain Search Engines, which can be used to find papers about a specified topic. However, the use of these search Engines suffers from two significant disadvantages. First, the user cannot guide his query to the Search Engines of different Digital Libraries at once. The same query has to be sent to several Search Engines in order to retrieve papers from conferences/journals organized by different institutions. Second, Search Engines do not keep track of the documents retrieved during a previous issue of the same query. Thus, every time a query is issued by a user (almost) the same results will be returned to him.

In this project, we use the basic concepts behind the implementation of USEwebNET [1] in order to face the problems described above. In the rest of this poster paper, we will describe PaperFinder, a tool that continually searches digital libraries of scientific publications, filters only the relevant papers, and delivers them to interested scientists through a friendly user-interface.

2. Design - User Interface

The purpose of PaperFinder is to provide the user with a flexible tool, able to simplify the time consuming task of information filtering. To achieve its purpose, PaperFinder has been designed as an added-value service on top of some well known Digital Libraries, like those maintained by ACM and USENIX. Among these characteristics most important are the maintenance of a personal profile for each user of the system and an innovative sorting algorithm for the results of a query.

The basic idea behind PaperFinder is the ability to maintain information about a user's interests, and query several digital libraries for new articles in regular time periods. A query may be specified by a set of keywords, which characterize the user's topic of interest, the names of authors, whose articles the user is interested in finding, a date (the oldest publication date the articles of interest could have), and the Digital Libraries the query should be directed to. PaperFinder keeps track of the articles found for every topic of interest, so that a paper that has already been viewed and approved or rejected by the user will never be presented to him for a second time. While examining the papers retrieved by a query, the user has the ability to read their abstract or full text, if these are available, and save those that are especially interesting in separate folders. In addition to the above the system maintains information about the current status of each paper retrieved. Specifically, an article may be marked as Rejected, Read (if it has been viewed at least once) or/and Saved (if it has been saved to at least one folder). After a paper has been marked as rejected it will never be shown to the user again.

The interface of PaperFinder consists of three basic options:

Setup: View existing queries, create a new query, or modify/delete an old one (figure 1).
Results: View new papers found for each query, and make several operations or them, like reading, saving and rejecting (figure 2).
Folders: View the contents of selected folders (figure 3).

Figure 1: Setup screen of PaperFinder. The user has registered several queries about ``distributed systems'', ``loop scheduling'' and ``scheduling'' in several digital libraries like that of the ACM, USENIX, etc.

Figure 2: Results of a query about ``distributed systems'' in the on-line digital library of the USENIX Association.

Figure 3: A folder that contains papers related to the area of loop scheduling.

The back-end of PaperFinder consists of three cooperating modules. The first module is responsible for contacting the Digital Libraries and retrieving the results related with each query. These results are saved in separate internal files. The second one updates the profile of each user. Specifically, it compares the new articles with those already found in the user's profile. If a paper does not appear in the profile, it is added as a new one. Finally, the third module is responsible for sorting the new papers. The new papers are sorted according to one or more seed papers (or authors), which may be specified optionally by the user when creating a query.

3. Resource-Discovery Mode of Operation

PaperFinder works in two modes: the keyword-based mode and the resource-discovery mode. In the keyword-based mode, users simply supply PaperFinder with a few keywords that describe their field of interest, like ``digital libraries'' or ``process scheduling''.

In the resource discovery mode, the user presents one or more ``seed papers'' and expects PaperFinder to discover new papers that are related to the seed papers. Paperfinder uses query generalization and filtering to discover papers related to the seed paper(s):

Query Generalization: The goal of this step is to find several papers that are (more or less) related to the mentioned seed paper. To do so, PaperFinder forms queries whose results should return the seed paper as well as several other papers. Such queries can be formed by taking one keyword from the seed paper's title and searching for it, or, by taking (each) one of the co-authors and searching for papers co-authored by him (her). Once all the queries have been sent to the digital library, the results are merged, duplicates are removed and the output is given to the second stage.
Filtering: The goal of this step is to filter the papers found in the previous stage and return the most relevant to the user. Finding which papers are relevant can be tricky and result in a flood or irrelevant publications. To find relevant papers PaperFinder applies several similarity metrics to the papers found to measure how similar they are to the seed paper.

One of the similarity metrics we currently use is a sorting algorithm based on author-distance, a notion that originated from the Erdos number

. Based on the Erdos number, we define a similar metric we call author-distance. Two authors have an author-distance number of one, if they have co-authored at least one paper. Their author-distance is two if they are not co-authors but there exists at least one third author who has written a paper with both of them, and so on. We are currently experimenting with several other distance metrics between papers. such metrics include:

Weighted-author distance: the distance between two co-authors is defined as the number of papers they have co-authored over the total number of papers they have authored. The distance between two authors that are not co-authors is found by the transitive closure of the distances among co-authors.
Keyword distance: given two papers and a set of keywords, the distance between the papers is defined as the number of keywords that appears in (the title/abstract/text) of both papers.

Our intention is to define (and experiment with) several search metrics and present the users with a choice of the most promising ones.

4. Conclusion

To conclude, PaperFinder is a useful information filtering tool because it

capitalizes on the familiar and effective user-interface of USENET news
maintains user-profiles with information about what papers users already know and what topics they would like to learn,
exploits search engines of Digital Libraries in order to find out new information, helping users in their research,
reduces information pollution by not repeating previously read papers, and
reduces network congestion and server load by running and updating databases periodically (and preferably) at nights.

5. Acknowledgments

This work was supported in part by the USENIX Association. We deeply acknowledge this financial support.

References

1: Evangelos P. Markatos, Christina Tziviskou, and Athanasios E. Papathanasiou: Effective Resource Discovery on the World Wide Web. In WebNet 98--World Conference of the WWW, Internet, and Intranet, Orlando, Florida, USA, 1998.

Footnotes ...

...Papadakis

The authors are also with the University of Crete.

...number

Paul Erdos, the late widely-traveled and incredibly prolific Hungarian mathematician of the highest caliber, wrote hundreds of mathematical research papers in many different areas, many in collaboration with others. His Erdos number is 0. His co-authors have Erdos number 1. People other than Erdos who have written a joint paper with someone with Erdos number 1 but not with Erdos have Erdos number 2, and so on.