About the Course
The field of
Information Retrieval has become extremely important in recent
years due to the intriguing challenges presented in tapping the Internet
and the Web as an inexhaustible source of information. The success of Web
search engines is a testimony to this fact. The technical challenges of
searching and browsing for information in such unstructured domains such
as the Web and other document collections are vast. Querying is often a
gray and fuzzy process; e.g. when many results match a query, IR systems
attempt to return top hits ranked by “relevance”. Several ingenuous query
paradigms as well as search algorithms have been developed for these
purposes.
However, while
the Web is indeed vast as an information source, it is estimated that much
larger amounts of recorded data is locked up in more structured sources
such as databases, which are often the propriety information of private
corporations and government agencies. Searching for information within
databases is currently accomplished in completely different ways as
compared to searching for information in unstructured data sources. Often
the data explorer has to know comprehensive query languages (such as SQL),
as well as important information on how the data is structured into
different tables and columns (the database schema). Database searching is
all black and white; there is currently no room for fuzzy concepts such as
relevance retrieval and ranking. A SQL query returns a precise set of
results, i.e., those tuples that satisfy its selection criterion.
In recent
years, researchers have pondered on the problems of bringing the two areas
together, i.e., of searching within structured as well as unstructured
data. Can some of the techniques of Information Retrieval, such as fuzzy
lookup, keyword search, relevance retrieval, etc. be transplanted into the
world of databases? Will this complicate the performance and reliability
issues that database researchers have grappled for years? Will the
classical DBMS, a software system that is the result of several man years
of engineering, need to be radically changed, or can these new searching
techniques be accomplished by adding a thin layer on top? Likewise, can
some of the lessons learned in database systems be transported back to the
field of Information Retrieval, such as schema-aware search and retrieval,
query algebra and languages, as well as performance and reliability
developments?
This class
will explore the recent efforts by researchers in these extremely
important and challenging fields. We will read and discuss latest research
literature gleaned from premier conferences in databases and information
retrieval. It is hoped that this class will spur students to pursuing
further research in these areas.
The following
is a tentative list of topics which we will attempt to cover:
-
Structured versus Unstructured Search: Introduction
-
Ranking
in IR: Basics
-
TF-IDF Ranking
-
Probabilistic IR
-
Link Analysis
-
Keyword
Queries in Databases
-
DBXplorer
-
Discover
-
Banks
-
Ranking
of Database Query Results
-
Empty Answers Problem
-
Many Answers Problem
-
Data Summarization
-
DB and IR
integration
-
Top-K algorithms
We will cover
various topics in breadth, understand the central contributions of these
efforts and try and predict future research directions.
Prerequisites
Advanced
Algorithms and Database II are the prerequisite courses. However,
exceptions will be made on a case by case basis, especially if the student
has prior exposure or demonstrates initiative to quickly learn these
concepts on his/her own.
Presentations
The actual
reading list, consisting of recent research papers, will be selected and
finalized by the first week of classes. Each student will present one or
more papers (depending on the enrollment) during the semester. Students
will participate in class discussions during and after each
presentation. Attendance is required.
Project
Additionally
to reading papers, students will have the option of attempting a
programming project during the semester. The projects will involve
developing portions of information retrieval systems for structured
databases based on the techniques suggested in the papers. The projects
will also be tested out using real data that the students should get
access to. A long-term objective is that the more promising projects will
serve as infrastructure/test-beds for students to continue with their
research in these areas beyond the course.
Evaluation
The grade will
be based on the paper presentations, class attendance and participation,
and performance in the projects.
Course Schedule
|