Dr. Gautam Das

Research

Publications

Professional Teaching Personal

Home

 

DATABASES AND INFORMATION RETRIEVAL

 
Spring 2005
 
 

Dr. Gautam Das
Office: 302 Nedderman Hall
Phone: 817 272 7595
Email: gdas@cse.uta.edu

Office Hours: Tue-Thu 12:00-1:00pm (or by appt)

 
   

 

About the Course

The field of Information Retrieval has become extremely important in recent years due to the intriguing challenges presented in tapping the Internet and the Web as an inexhaustible source of information.  The success of Web search engines is a testimony to this fact. The technical challenges of searching and browsing for information in such unstructured domains such as the Web and other document collections are vast. Querying is often a gray and fuzzy process; e.g. when many results match a query, IR systems attempt to return top hits ranked by “relevance”. Several ingenuous query paradigms as well as search algorithms have been developed for these purposes.

However, while the Web is indeed vast as an information source, it is estimated that much larger amounts of recorded data is locked up in more structured sources such as databases, which are often the propriety information of private corporations and government agencies. Searching for information within databases is currently accomplished in completely different ways as compared to searching for information in unstructured data sources. Often the data explorer has to know comprehensive query languages (such as SQL), as well as important information on how the data is structured into different tables and columns (the database schema). Database searching is all black and white; there is currently no room for fuzzy concepts such as relevance retrieval and ranking. A SQL query returns a precise set of results, i.e., those tuples that satisfy its selection criterion.

 In recent years, researchers have pondered on the problems of bringing the two areas together, i.e., of searching within structured as well as unstructured data. Can some of the techniques of Information Retrieval, such as fuzzy lookup, keyword search, relevance retrieval, etc. be transplanted into the world of databases? Will this complicate the performance and reliability issues that database researchers have grappled for years? Will the classical DBMS, a software system that is the result of several man years of engineering, need to be radically changed, or can these new searching techniques be accomplished by adding a thin layer on top? Likewise, can some of the lessons learned in database systems be transported back to the field of Information Retrieval, such as schema-aware search and retrieval, query algebra and languages, as well as performance and reliability developments?

This class will explore the recent efforts by researchers in these extremely important and challenging fields. We will read and discuss latest research literature gleaned from premier conferences in databases and information retrieval. It is hoped that this class will spur students to pursuing further research in these areas.

The following is a tentative list of topics which we will attempt to cover:

  • Structured versus Unstructured Search: Introduction

  • Ranking in IR: Basics

  •         TF-IDF Ranking

  •         Probabilistic IR

  •         Link Analysis

  • Keyword Queries in Databases

  •         DBXplorer

  •         Discover

  •         Banks

  • Ranking of Database Query Results

  •         Empty Answers Problem

  •         Many Answers Problem

  •         Data Summarization

  • DB and IR integration

  •         Top-K algorithms

We will cover various topics in breadth, understand the central contributions of these efforts and try and predict future research directions.

Prerequisites

Advanced Algorithms and Database II are the prerequisite courses. However, exceptions will be made on a case by case basis, especially if the student has prior exposure or demonstrates initiative to quickly learn these concepts on his/her own.

Presentations

The actual reading list, consisting of recent research papers, will be selected and finalized by the first week of classes. Each student will present one or more papers (depending on the enrollment) during the semester. Students will participate in class discussions during and after each presentation. Attendance is required.

Project

Additionally to reading papers, students will have the option of attempting  a programming project during the semester. The projects will involve developing portions of information retrieval systems for structured databases based on the techniques suggested in the papers. The projects will also be tested out using real data that the students should get access to. A long-term objective is that the more promising projects will serve as infrastructure/test-beds for students to continue with their research in these areas beyond the course.

Evaluation

The grade will be based on the paper presentations, class attendance and participation, and performance in the projects.

 

Course Schedule

Date

         Topics/Papers

            Presenter           Slides

 

 

     

1/18  - 2/08

  • Introduction to DB and IR

 
  • Gautam Das
2/10 (Thu)
  • No class (makeup will be announced)

2/15 (Tue)
  • Amit Singhal: Modern Information Retrieval: A Brief Overview. IEEE Bulletin 2001

 

  • Parin Sangoi

2/17 (Thu)
  • No class (makeup will be announced)

   

 

2/22 (Tue)
  • L. Page, S. Brin, R. Motwani, T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. 1999
  • Soumya Sanyal
2/24 (Thu)
  • J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM 46(1999)
  • Raman  Adaikkalavan
3/01 (Tue)
  • Sanjay Agrawal, Surajit Chaudhuri, Gautam Das: DBXplorer: A System For Keyword-Based Search Over Relational Databases. ICDE 2002
 
  • Bhushan Chaudhari
3/03 (Thu)
  • Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, S. Sudarshan: Keyword Searching and Browsing in Databases using BANKS. ICDE 2002
  • Ahn Seoyoung
3/08 (Tue)
  • V. Hristidis, L. Gravano, Y. Papakonstantinou: Efficient IR-Style Keyword Search over Relational Databases. VLDB, 2003

 

  • Jing Chen

 

3/10 (Thu)
  • R. Goldman, N. Shivakumar, S. Venkatasubramanian, H. Garcia-Molina: Proximity Search in Databases. VLDB 1998

 

  • Arjun Saraswat
 

3/14 - 3/20

 

 

         Spring Vacation

 

 
3/22 (Tue)
  • Sanjay Agrawal, Surajit Chaudhuri, Gautam Das, Aristides Gionis: Automated Ranking of Database Query Results. CIDR 2003
 
  • Parin Sangoi
 
3/24 (Thu)
  • Surajit Chaudhuri, Gautam Das, Vagelis Hristidis, Gerhard Weikum: Probabilistic Ranking of Database Query Results. VLDB 2004

 

  • Weimin Hi
3/29 (Tue)
  • F. Geerts, H. Mannila, E. Terzi: Relational link-based ranking . The 30th International Conference on Very Large Data Bases (VLDB'04) , 2004
 
 
  • Nishant Kapoor
 
3/31 (Thu)
  • Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang: Automatic Categorization of Query Results. SIGMOD Conference 2004
 
 
  • Arjun Saraswat
4/05 (Tue)
  • Ronald Fagin, Amnon Lotem, Moni Naor: Optimal Aggregation Algorithms for Middleware. PODS 2001
 
 
  • Nishant Kapoor
 
4/07 (Thu)
  • Nicolas Bruno, Luis Gravano, Amelie Marian: Evaluating Top-k Queries over Web-Accessible Databases.ICDE 2002
  • Bhushan Chaudhari
4/12 (Tue)
  • R. Fagin, Ravi Kumar and D. Sivakumar: Comparing top k lists. SIAM J. Discrete Mathematics 17, 1 (2003)

 

  • Soumya Sanyal
4/14 (Thu)
  • Vagelis Hristidis, Nick Koudas, Yannis Papakonstantinou: PREFER: A System for the Efficient Execution of Multi-parametric Ranked Queries. SIGMOD, 2001
 
  • Ahn Seoyoung
4/19 (Tue)
  • Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid:
    Supporting Top-k Join Queries in Relational Databases. VLDB 2003

 

  • Jing Chen
4/21 (Thu)
  • class projects
     
4/26 (Tue)
  • class projects
     
4/28 (Thu)
  • class projects
     
5/03 (Tue)
  • class projects
     
5/05 (Thu)
  • class projects
     
         
         

 

       

Home | Research | Publications | Professional | Teaching | Personal