Gautam Das

DATABASES AND INFORMATION RETRIEVAL

Spring 2005

Dr. Gautam Das
Office: 302 Nedderman Hall
Phone: 817 272 7595
Email: gdas @cse.uta.edu

Office Hours: Tue-Thu 12:00-1:00pm (or by appt)


About the Course The field of Information Retrieval has become extremely important in recent years due to the intriguing challenges presented in tapping the Internet and the Web as an inexhaustible source of information. The success of Web search engines is a testimony to this fact. The technical challenges of searching and browsing for information in such unstructured domains such as the Web and other document collections are vast. Querying is often a gray and fuzzy process; e.g. when many results match a query, IR systems attempt to return top hits ranked by “relevance”. Several ingenuous query paradigms as well as search algorithms have been developed for these purposes. However, while the Web is indeed vast as an information source, it is estimated that much larger amounts of recorded data is locked up in more structured sources such as databases, which are often the propriety information of private corporations and government agencies. Searching for information within databases is currently accomplished in completely different ways as compared to searching for information in unstructured data sources. Often the data explorer has to know comprehensive query languages (such as SQL), as well as important information on how the data is structured into different tables and columns (the database schema). Database searching is all black and white; there is currently no room for fuzzy concepts such as relevance retrieval and ranking. A SQL query returns a precise set of results, i.e., those tuples that satisfy its selection criterion. In recent years, researchers have pondered on the problems of bringing the two areas together, i.e., of searching within structured as well as unstructured data. Can some of the techniques of Information Retrieval, such as fuzzy lookup, keyword search, relevance retrieval, etc. be transplanted into the world of databases? Will this complicate the performance and reliability issues that database researchers have grappled for years? Will the classical DBMS, a software system that is the result of several man years of engineering, need to be radically changed, or can these new searching techniques be accomplished by adding a thin layer on top? Likewise, can some of the lessons learned in database systems be transported back to the field of Information Retrieval, such as schema-aware search and retrieval, query algebra and languages, as well as performance and reliability developments? This class will explore the recent efforts by researchers in these extremely important and challenging fields. We will read and discuss latest research literature gleaned from premier conferences in databases and information retrieval. It is hoped that this class will spur students to pursuing further research in these areas. The following is a tentative list of topics which we will attempt to cover: Structured versus Unstructured Search: Introduction Ranking in IR: Basics TF-IDF Ranking Probabilistic IR Link Analysis Keyword Queries in Databases DBXplorer Discover Banks Ranking of Database Query Results Empty Answers Problem Many Answers Problem Data Summarization DB and IR integration Top-K algorithms We will cover various topics in breadth, understand the central contributions of these efforts and try and predict future research directions. Prerequisites Advanced Algorithms and Database II are the prerequisite courses. However, exceptions will be made on a case by case basis, especially if the student has prior exposure or demonstrates initiative to quickly learn these concepts on his/her own. Presentations The actual reading list, consisting of recent research papers, will be selected and finalized by the first week of classes. Each student will present one or more papers (depending on the enrollment) during the semester. Students will participate in class discussions during and after each presentation. Attendance is required. Project Additionally to reading papers, students will have the option of attempting a programming project during the semester. The projects will involve developing portions of information retrieval systems for structured databases based on the techniques suggested in the papers. The projects will also be tested out using real data that the students should get access to. A long-term objective is that the more promising projects will serve as infrastructure/test-beds for students to continue with their research in these areas beyond the course. Evaluation The grade will be based on the paper presentations, class attendance and participation, and performance in the projects. Course Schedule
Date	Topics/Papers	Presenter	Slides

1/18 - 2/08	Introduction to DB and IR	Gautam Das	slides slides slides
2/10 (Thu)	No class (makeup will be announced)
2/15 (Tue)	Amit Singhal: Modern Information Retrieval: A Brief Overview. IEEE Bulletin 2001	Parin Sangoi	slides

2/17 (Thu)	No class (makeup will be announced)
2/22 (Tue)	L. Page, S. Brin, R. Motwani, T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. 1999	Soumya Sanyal	slides
2/24 (Thu)	J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM 46(1999)	Raman Adaikkalavan	slides
3/01 (Tue)	Sanjay Agrawal, Surajit Chaudhuri, Gautam Das: DBXplorer: A System For Keyword-Based Search Over Relational Databases. ICDE 2002	Bhushan Chaudhari	slides
3/03 (Thu)	Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, S. Sudarshan: Keyword Searching and Browsing in Databases using BANKS. ICDE 2002	Ahn Seoyoung	slides
3/08 (Tue)	V. Hristidis, L. Gravano, Y. Papakonstantinou: Efficient IR-Style Keyword Search over Relational Databases. VLDB, 2003	Jing Chen
3/10 (Thu)	R. Goldman, N. Shivakumar, S. Venkatasubramanian, H. Garcia-Molina: Proximity Search in Databases. VLDB 1998	Arjun Saraswat	slides
3/14 - 3/20	*Spring Vacation*
3/22 (Tue)	Sanjay Agrawal, Surajit Chaudhuri, Gautam Das, Aristides Gionis: Automated Ranking of Database Query Results. CIDR 2003	Parin Sangoi
3/24 (Thu)	Surajit Chaudhuri, Gautam Das, Vagelis Hristidis, Gerhard Weikum: Probabilistic Ranking of Database Query Results. VLDB 2004	Weimin Hi	slides
3/29 (Tue)	F. Geerts, H. Mannila, E. Terzi: Relational link-based ranking . The 30th International Conference on Very Large Data Bases (VLDB'04) , 2004	Nishant Kapoor
3/31 (Thu)	Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang: Automatic Categorization of Query Results. SIGMOD Conference 2004	Arjun Saraswat	slides
4/05 (Tue)	Ronald Fagin, Amnon Lotem, Moni Naor: Optimal Aggregation Algorithms for Middleware. PODS 2001	Nishant Kapoor
4/07 (Thu)	Nicolas Bruno, Luis Gravano, Amelie Marian: Evaluating Top-k Queries over Web-Accessible Databases.ICDE 2002	Bhushan Chaudhari	slides
4/12 (Tue)	R. Fagin, Ravi Kumar and D. Sivakumar: Comparing top k lists. SIAM J. Discrete Mathematics 17, 1 (2003)	Soumya Sanyal
4/14 (Thu)	Vagelis Hristidis, Nick Koudas, Yannis Papakonstantinou: PREFER: A System for the Efficient Execution of Multi-parametric Ranked Queries. SIGMOD, 2001	Ahn Seoyoung
4/19 (Tue)	Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid: Supporting Top-k Join Queries in Relational Databases. VLDB 2003	Jing Chen	slides
4/21 (Thu)	class projects
4/26 (Tue)	class projects
4/28 (Thu)	class projects
5/03 (Tue)	class projects
5/05 (Thu)	class projects