The heterogeneous, autonomous, and highly dynamic nature of databases on the Web renders most traditional data management techniques ineffective. Consequently, keeping abreast of relevant information as it evolves in this environment requires a large amount of human effort. The goal of this research is to develop, implement, and evaluate methods for data management in a heterogeneous, autonomous, and dynamic environment. The history of changes made to data is a major focus. Methods for efficiently discovering, storing, browsing, and querying changes are developed. In sharp contrast to traditional database techniques, these methods try to make the best possible use of the structure in data without requiring data to conform to a fixed structure or schema. The educational component includes redesigning the current database curriculum; development of new courses on XML, related Web technologies, and semistructured data; and experimenting with pedagogical tools such as teaching by program dissection (the use of the source code of a large program as an anchor for studying related topics). In addition to graduate students, undergraduate students are exposed to research by their work on the system prototype, which in turn benefits from their efforts. In summary, this research will bring the benefits of database technology (e.g., faster, easier, and more precise information access) to the Web by developing methods that cope with the heterogeneous, autonomous, and dynamic nature of Web databases.
In our effort to efficiently manage changes in semistructured data, we are working on methods for indexing changes to XML data. In the process, we have developed a method for context-sensitive indexing of XML (or similar) data, based on extending inverted files. This method is implemented in the Cextor system, which we have made freely available. Cextor is implemented as a context-sensitive search engine for XML. We have extended the standard boolean queries used by many Information Retrieval systems to include context specifiers. These context specifiers are similar in spirit to those used by some current search engines (e.g., title:databases). However, in contrast to earlier work on indexing such data, the context is data dependent and not known in advance. Further, Cextor does not stop at listing documents matching a given user query. Instead, it uses the contexts of search term occurrences in the documents forming a query result to group those documents into a tree structure. This tree structure is a convenient method for displaying very large search results, which occur frequently when querying the Web. Building on our earlier work in differencing hierarchical data, we have also developed a method for computing the difference between two streams of hierarchical data.
Our recent work has focused on streaming XML data. In this environment, each data item is presented to the system only once, in an order determined by the source. Revisiting a data item is possible only if it is buffered by the system. However, naive buffering methods rarely work well because of the limited buffer space relative to the size of the stream (which may be unbounded). We have implemented the XSQ query engine for evaluating XPath queries on such data. We are now focusing on performance improvements for XSQ and on generalizations to a wider class of queries.
Sudarshan S. Chawathe, "Comparing hierarchical data in external memory", (1999). In Proceedings of the Twenty-fifth International Conference on Very Large Data Bases. September 1999. Edinburgh, Scotland, U.K.
Feng Peng and Sudarshan S. Chawathe, "XPath Queries on Streaming Data", (2003). In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). June 2003. San Diego, California.
Feng Peng and Sudarshan S. Chawathe, "XSQ: Streaming XPath Queries", (2003). Demonstration description. In Proceedings of the International Conference on Data Engineering. March 2003. Bangalore, India.
Feng Peng and Sudarshan S. Chawathe, "XSQ: A Streaming XPath Engine", (2003). Technical report. University of Maryland
Sudarshan S. Chawathe, "Managing Historical XML Data", (2003). Advances in Computers Bibliography: Volume 57, Chapter 3. Elsevier Science.
Sudarshan S. Chawathe, "Semistructured Data in Relational Databases", (2003). In Practical Handbook of Internet Computing. CRC Press.
Thomas Baby and Sudarshan S. Chawathe, "Context-Sensitive Search and Exploration of XML text", (2001). Technical Report. University of Maryland.
Sudarshan S. Chawathe, "Differencing Very Large Data Streams", (2002). Technical Report, University of Maryland.
Sudarshan S. Chawathe, "Tracking Hidden Groups Using Communications", (2003). In Proceedings of the NSF/NIJ Symposium on Intelligence and Security Informatics. June 2003. Lecture Notes in Computer Science (LNCS) 2665. Tucson, Arizona.
Akhil Gupta and Sudarshan S. Chawathe, "XML Stream-Skipping Using XHints", (2003). Technical report. University of Maryland.
Abheek Anand and Sudarshan S. Chawathe, "Disseminating Data in a Peer-to-Peer Network", (2003). Technical report. University of Maryland
Feng Peng and Sudarshan S. Chawathe, "Streaming Evaluation of XPath Queries with Subqueries", (2003). Technical report. University of Maryland
This project has generated new methods for detecting and managing changes in semistructured data, especially XML. The Cextor system is our implementation of a context-aware XML search engine. Another area of focus has been interactive exploration of semistructured data. Here, we have developed methods that provide very fast initial summaries of unknown datasets as well as permitting gradual increase in level of detail. The VQBD system (link below) implements some of these methods. The XSQ XPath query engine for streaming XML data is based on new automaton-based methods for XPath query evaluation. Work on a sister project has resulted in data mining methods implemented in the Seus system (link below).