Publications by Julia Imhof | Publications

Status message

The Publications site is currently under construction, as a result some publications might be missing.

2008

Evaluation Strategies for XQuery Full-Text

Systems Group Master's Thesis, no. ETH Zürich; Department of Computer Science, January 2008
Supervised by: Prof. Donald Kossmann

More and more repositories like the IEEE INEX collection [1], LexisNexis [2] or the Library of Congress collection [3] store their documents in XML format. To be able to search over these XML documents, there is a need for efficient and accurate full-text search features. There are two different approaches that could be taken into consideration: On one hand, one could use traditional fulltext approaches, but these are not suitable for XML documents as they do not take the structure of the document into account. On the other hand, one could use XQuery and XPath. These languages can express a variety of queries, but queries over XML text content are limited. Full-text search in XQuery is mainly handled by the function fn:contains($e,keyword) that checks whether the keyword keyword is contained in the element denoted by $e. This function is too limited for complex queries. It cannot express queries e.g. like the following query that includes phrase matching, distance predicates and stemming: Example The sample document contains books (see Appendix A). Find the books that contain the phrases ”Beating the Dealer” with stemming and ”WIN STRATEGY” with stemming and case insensitive in distance at most 10 words from each other. Before this Master thesis, MXQuery Full-Text only supported keyword and phrase queries [4]. In this Master thesis, we extended MXQuery Full-Text with the features of the XPath 2.0 and XQuery 1.0 Full-Text W3C specification [5]. This extension includes an improved store and indexes, i.e. a stem index, an n-gram index, an nextword index and a B+ tree index. Additionally, we implemented the MatchOptions that support queries including stemming, the use of thesauri, diacritics sensitivity, case options and wildcards. To support logical operators, we extended MXQuery Full-Text with an FTAndIterator, an FTOrIterator and an FTUnaryNotIterator. For queries including positional filters, we added an FTSelectionIterator that tests the positional predicates. In addition, we had a look at several scoring models to design and implement our own scoring method. To test our implementation, we run the XPath 2.0 and XQuery 1.0 Use Cases [6] and did some benchmarking to evaluate the performance and memory usage.

@mastersthesis{abc,
	abstract = {More and more repositories like the IEEE INEX collection [1], LexisNexis [2]
or the Library of Congress collection [3] store their documents in XML format.
To be able to search over these XML documents, there is a need for efficient
and accurate full-text search features. There are two different approaches that
could be taken into consideration: On one hand, one could use traditional fulltext
approaches, but these are not suitable for XML documents as they do
not take the structure of the document into account. On the other hand, one
could use XQuery and XPath. These languages can express a variety of queries,
but queries over XML text content are limited. Full-text search in XQuery is
mainly handled by the function fn:contains($e,keyword) that checks whether the
keyword keyword is contained in the element denoted by $e. This function is
too limited for complex queries. It cannot express queries e.g. like the following
query that includes phrase matching, distance predicates and stemming:
Example The sample document contains books (see Appendix A). Find the
books that contain the phrases \&$\#$148;Beating the Dealer\&$\#$148; with stemming and \&$\#$148;WIN
STRATEGY\&$\#$148; with stemming and case insensitive in distance at most 10 words
from each other.
Before this Master thesis, MXQuery Full-Text only supported keyword and
phrase queries [4]. In this Master thesis, we extended MXQuery Full-Text with
the features of the XPath 2.0 and XQuery 1.0 Full-Text W3C specification [5].
This extension includes an improved store and indexes, i.e. a stem index, an
n-gram index, an nextword index and a B+ tree index. Additionally, we implemented
the MatchOptions that support queries including stemming, the use of
thesauri, diacritics sensitivity, case options and wildcards. To support logical
operators, we extended MXQuery Full-Text with an FTAndIterator, an FTOrIterator
and an FTUnaryNotIterator. For queries including positional filters, we
added an FTSelectionIterator that tests the positional predicates. In addition,
we had a look at several scoring models to design and implement our own scoring
method. To test our implementation, we run the XPath 2.0 and XQuery
1.0 Use Cases [6] and did some benchmarking to evaluate the performance and
memory usage.},
	author = {Julia Imhof},
	school = {ETH Z{\"u}rich},
	title = {Evaluation Strategies for XQuery Full-Text},
	year = {2008}
}