Legacy Range index
(2Q19)
This article describes eXist-db's old range index. This has been replaced by a redesigned range index (since eXist 2.2). The old index is still available for compatibility reasons. Its use is discouraged.
Introduction
Range indexes provide a shortcut for the database to directly select nodes based on
their typed values. They are used when matching or comparing nodes by way of standard
XPath operators and functions. Without a range index, comparison operators like
=
, >
or <
will default to a "brute-force"
inspection of the DOM, which can be extremely slow if eXist-db has to search through
maybe millions of nodes: each node has to be loaded and cast to the target type.
To see how range indexes work, consider the following fragment:
<items>
<item n="1">
<name>
Tall Bookcase
</name>
<price>
299.99
</price>
</item>
<item n="2">
<name>
Low Bookcase
</name>
<price>
199.99
</price>
</item>
</items>
With this short inventory, the text nodes of the <price>
elements have dollar
values expressed as a floating-point number, (e.g. "299.99"), which has an XML Schema Definition (XSD) data
type of xs:double
. Using this built-in type to define a range index,
we can improve the efficiency of searches for <price>
values. During indexing,
eXist-db will apply this data type selection by attempting to cast all <price>
values as double floating point numbers and add appropriate values to the index. Values
that cannot be cast as double are ignored. This range index will be used by any
expression that compares <price>
to an xs:double
value. For
instance:
//item[price > 100.0]
For non-string data types, the range index provides the query engine with a more
efficient method of data conversion. Instead of retrieving the value of each selected
element and casting it as a xs:double
type, the engine can evaluate
the expression by using the range index as a form of lookup index. Without an index,
eXist-db has to do a full scan over all price <price>
elements, retrieve the
string values of their text node and cast them to a double number. This is a
time-consuming process which also scales very badly with growing data sets. With a
proper index, eXist-db needs just a single index lookup to evaluate price =
100.0
. The range expression price > 100.0
is processed
with an index scan starting at 100.
For string data, the index will also be used by the standard functions
fn:contains()
, fn:starts-with()
,
fn:ends-with()
and fn:matches()
.
To illustrate this functionality, let's return to the previous example. If you define
a range index of type xs:string
for element <name>
, a query
on this element to select tall bookcases using fn:matches()
will be
supported by the following index:
//item[fn:matches(name, '[Tt]all\s[Bb]')]
Note that fn:matches
will by default try to match the regular
expression anywhere in the string. We can thus speed up the query
dramatically by using "^" to restrict the match to the start of the string:
//item[fn:matches(name, '^[Tt]all\s[Bb]')]
If you really need to search for an exact substring in a longer text sequence, it
is often better to use the NGram index instead of the range index, i.e. use
ngram:contains()
instead of fn:contains()
.
Unfortunately, there's no equivalent NGram function for
fn:matches()
.
In general, three conditions must be met in order to optimize a search using a range index:
-
The range index must be defined on all items in the input sequence.
For example, suppose you have two collections in the database: C1 and C2. If you have a range index defined for collection C1, but your query happens to operate on both C1 and C2, then the range index would not be used. The query optimizer selects an optimization strategy based on the entire input sequence of the query. Since, in this example, since only nodes in C1 have a range index, no range index optimization would be applied.
-
The index data type (first argument type) must match the test data type (second argument type).
In other words, with range indexes, there is no promotion of data types (i.e. no data type precedes or replaces another data type). For example, if you defined an index of type
xs:double
on<price>
, a query that compares this element's value with a string literal would not use a range index, for instance://item[price = '1000.0']
In order to apply the range index, you would need to cast the value as a type
xs:double
, i.e.://item[price = xs:double($price)] (where $price is any test value)
Similarly, when we compare
xs:double
values withxs:integer
values, as in, for instance://item[price = 1000]
the range index would again not be used since the
<price>
data type differs from the test value type, although this conflict might not seem as obvious as it is with string values. -
The right-hand argument has no dependencies on the current context item.
That is, the test or conditional value must not depend on the value against which it is being tested. For example, range indexes will not be applied given the following expression:
//item[price = self]
Concerning range indexes on strings there's another restriction to be considered: up to version 1.3, range indexes on strings can only be used with the default Unicode collation. Also, string indexes will always be case sensitive (while n-gram and full text indexes are not). It is not (yet) possible to define a string index on a different collation (e.g. for German or French) or to make it case-insensitive.
Range index configuration
Range index configuration is done in collection.xconf
documents (see
indexing):
<!-- Range indexes -->
<create qname="title" type="xs:string"/>
<create qname="author" type="xs:string"/>
<create qname="year" type="xs:integer"/>
<!-- "old" context-dependant configuration using the path attribute: -->
<create path="//booktitle" type="xs:string"/>
A range index is configured by adding a <create>
element directly below the
root <index>
element. As explained above, the node to be indexed is either
specified through a path
or a qname
attribute.
Unlike the new range index, the
<create>
elements of the old range index are not
wrapped inside a <range>
tag.
As range indexes are type specific, the type
attribute is always
required. The type should be one of the atomic XML schema types, currently including
xs:string
, xs:integer
and its derived types
xs:double
and xs:float
,
xs:boolean
and xs:dateTime
. If the name of the
type is unknown, the index configuration will be ignored and you will get a warning
written into the logs.
Please note that the index configuration will only apply to the node specified via the
path
or qname
attribute, not to descendants of
that node. Consider a mixed content element like:
<mixed>
<s>
un
</s>
<s>
even
</s>
</mixed>
If an index is defined on <mixed>
, the key for the index is built from the
concatenated text nodes of element <mixed>
and all its descendants, so
uneven
in this case. The created index will only be used to evaluate
queries on <mixed>
, but not for queries on <s>
. However, you can
create an additional index on <s>
without getting into conflict with the
existing index on <mixed>
.
Configuration by path vs. configuration by qname
It is important to note the difference between the path
and
qname
attributes used in above examples. Both attributes are used to
define the elements or attributes to which the index should apply. However, the
path
attribute creates context-dependant
indexes, while the qname
attribute does not.
The path
attribute takes a simple path expression: <create path="//book/title" type="xs:string"/>
The path
expression looks like XPath, but it's not. Index path syntax
uses the following components to construct paths:
-
Elements are specified by their qname
-
Attributes are specified by
@attributeName
, so if the attribute is calledattrib1
, one uses@attrib1
in the index specification. -
Child nodes are selected using the forward-slash (
/
) -
All descendant nodes in a tree are selected using the double forward-slash (
//
)
The example above creates a range index of type string on all <title>
elements which are children of <book>
elements, which may occur at an arbitrary
position in the document tree. All other <title>
elements, e.g. those being
children of <section>
nodes, are not indexed. The path expression thus defines
a selective index, which is also
context-dependant: we always need look at the context of each
<title>
node before we can determine if this particular title is to be indexed
or not.
This kind of context-dependant index definition helps to keep the index small, but unfortunately it makes it hard for the query optimizer to properly rewrite the expression tree without missing some nodes. The optimizer needs to make an optimization decision at compile time, where the context of an expression is unknown or at least not exactly known (read the blog article to get the whole picture). This means that some of the optimization techniques can not be applied to context-dependant indexes!
We therefore had to introduce an alternative configuration method which is not
context-dependant. To keep things simple, we decided to define the index on the
qname of the element or attribute alone and to ignore the context
altogether: <create qname="title" type="xs:string"/>
This results in an index being created on every <title>
element found in the
document node tree. Section titles will be indexed as well as chapter or book titles.
Indexes on attributes are defined as above by prepending "@" to the attribute's
name: <create qname="@type" type="xs:string"/>
This defines an index on all attributes named "type", but not on elements with the same name.
Defining indexes on qnames may result in a considerably larger index, but it allows the query engine to apply all available optimization techniques, which can improve query times by an order of magnitude. As always, there's a trade-off here between performance and storage space. In many cases, the performance win can be dramatic enough to justify the increase in index size.
Important:
To be on the safe side, and to benefit from current and future improvements in the
query engine, you should prefer qname
over
path
, unless you really need to exclude certain nodes from
indexing.