PaperFinder: A tool for scalable search of digital libraries
Athanasios E. Papathanasiou
Evangelos P. Markatos
Stavros A. Papadakis
Institute of Computer Science (ICS)
Foundation for Research & Technology - Hellas (FORTH), Crete
P.O.Box 1385
Heraklion, Crete, GR-711-10 GREECE
papathan@ics.forth.gr
http://www.ics.forth.gr/proj/arch-vlsi/OS/usewebnet.html
Table of Contents
1. Introduction
The invention and spread of the World-Wide Web made the process
of paper publication significantly easier than before, and added
a large repository of on-line (electronic) papers to our body of
knowledge.
To make the process of information gathering
easier for scientists, Digital Libraries usually maintain Search
Engines, which can be used to find papers about a specified topic.
However, the use of these search Engines suffers from two significant
disadvantages.
First, the user cannot guide his query to the Search Engines of different Digital
Libraries at once. The same query has to be sent to several
Search Engines in order to retrieve papers from conferences/journals
organized by different institutions.
Second, Search Engines do not keep track of the documents retrieved
during a previous issue of the same query. Thus, every time a query
is issued by a user (almost) the same results will be returned to him.
In this project, we use the basic concepts behind the implementation
of USEwebNET [1] in order to face the problems described above.
In the rest of this poster paper, we will describe PaperFinder, a tool that
continually searches digital libraries of scientific
publications, filters only the relevant papers,
and delivers them to interested scientists through a friendly user-interface.
2. Design - User Interface
The purpose of PaperFinder is to provide the user with a flexible tool, able to simplify
the time consuming task of information filtering.
To achieve its purpose, PaperFinder has been
designed as an added-value service
on top of some well known Digital Libraries, like those maintained by ACM and USENIX.
Among these characteristics most important are the maintenance of a personal
profile for each user of the system and an innovative sorting algorithm
for the results of a query.
The basic idea behind PaperFinder is the ability to maintain information about a
user's interests, and query several digital
libraries for new articles in regular
time periods. A query may be specified by a set of keywords, which characterize
the user's topic of interest, the names of authors, whose articles the user is
interested in finding, a date
(the oldest publication date the articles of interest
could have), and the Digital Libraries the query should be directed to. PaperFinder
keeps track of the articles found for every topic of interest, so that a paper that
has already been viewed and approved or rejected by the user will never be presented to
him for a second time. While examining the papers retrieved by a query, the user has
the ability to read their abstract or full text, if these are available, and save those
that are especially interesting in separate folders. In addition to the above the system
maintains information
about the current status of each paper retrieved. Specifically, an article may be
marked as Rejected, Read (if it has been viewed at least once) or/and
Saved (if it has been saved to at least one folder). After a paper has been
marked as rejected it will never be shown to the user again.
The interface of PaperFinder consists of three basic options:
- Setup: View existing queries, create a new query, or modify/delete an old one
(figure 1).
- Results: View new papers found for each query, and make several operations or them,
like reading, saving and rejecting (figure 2).
- Folders: View the contents of selected folders
(figure 3).
Figure 1: Setup screen of PaperFinder. The user has registered several
queries about ``distributed systems'', ``loop scheduling'' and ``scheduling''
in several digital libraries like that of the ACM, USENIX, etc.
Figure 2: Results of a query about ``distributed systems''
in the on-line digital library of the USENIX Association.
Figure 3: A folder that contains papers related to the area
of loop scheduling.
The back-end of PaperFinder consists of three cooperating modules.
The first module is responsible for contacting the Digital Libraries
and retrieving the results related with each query. These results
are saved in separate internal files. The second
one updates the profile of each user. Specifically, it compares the
new articles with those already found in the user's profile. If a
paper does not appear in the profile, it is added as a new one.
Finally, the third module is responsible for sorting the new
papers. The new papers are sorted according to one or more seed papers (or authors),
which may be specified optionally by the user when creating a query.
3. Resource-Discovery Mode of Operation
PaperFinder works in two modes: the keyword-based mode and the
resource-discovery mode.
In the keyword-based mode, users simply supply PaperFinder with a few
keywords that describe their field of interest, like ``digital libraries''
or ``process scheduling''.
In the resource discovery mode, the user presents one or more
``seed papers'' and expects PaperFinder to discover new papers
that are related to the seed papers.
Paperfinder uses query generalization and filtering to discover
papers related to the seed paper(s):
-
Query Generalization: The goal of this step is to find several papers
that are (more or less) related to the mentioned seed paper.
To do so, PaperFinder forms queries whose results should return
the seed paper as well as several other papers.
Such queries can be formed by taking one keyword from
the seed paper's title and searching for it,
or, by taking (each) one of the co-authors and searching for papers
co-authored by him (her). Once all the queries have been sent to the digital
library, the results are merged, duplicates are removed and the output is given to the second stage.
-
Filtering: The goal of this step is to filter the
papers found in the previous stage and return the most relevant to the
user. Finding which papers are relevant can be tricky and result in
a flood or irrelevant publications. To find relevant papers PaperFinder
applies several similarity metrics to the papers found to measure
how similar they are to the seed paper.
One of the similarity metrics we currently use
is a sorting algorithm based on author-distance, a notion that originated
from the Erdos number
.
Based on the Erdos number, we define a similar metric we call author-distance.
Two authors have an author-distance number of one,
if they have co-authored at least one paper.
Their author-distance is two if they are not co-authors but there exists
at least one third author who has written a paper with both of them,
and so on.
We are currently experimenting with several other distance metrics
between papers. such metrics include:
-
Weighted-author distance: the distance between two co-authors is defined
as the number of papers they have co-authored over the total number of
papers they have authored. The distance between two authors that are
not co-authors is found by the transitive closure of the distances among
co-authors.
-
Keyword distance: given two papers and a set of keywords, the distance between
the papers is defined as the number of keywords that appears in
(the title/abstract/text) of both papers.
Our intention is to define (and experiment with) several search metrics
and present the users with a choice of the most promising ones.
To conclude, PaperFinder is a useful information filtering tool because it
- capitalizes on the familiar and effective user-interface of USENET news
- maintains user-profiles with information about what papers users already know
and what topics they would like to learn,
- exploits search engines of Digital Libraries in order
to find out new information, helping users in their research,
- reduces information pollution by not repeating previously read papers, and
- reduces network congestion and server load by running and updating databases
periodically (and preferably) at nights.
5. Acknowledgments
This work was supported in part by the USENIX Association. We deeply acknowledge this financial
support.
References
- 1
- Evangelos P. Markatos, Christina Tziviskou, and Athanasios E. Papathanasiou: Effective Resource
Discovery on the World Wide Web.
In WebNet 98--World Conference of the WWW, Internet, and Intranet,
Orlando, Florida, USA, 1998.
Footnotes ...
...Papadakis
The authors are also with the University of Crete.
...number
Paul Erdos, the late widely-traveled and
incredibly prolific Hungarian mathematician of the highest caliber, wrote
hundreds of mathematical research papers in many different areas, many in
collaboration with others. His Erdos number is 0. His co-authors have
Erdos number 1. People other than Erdos who have written a joint paper
with someone with Erdos number 1 but not with Erdos have Erdos number 2,
and so on.