I recently became aware that the implementation of arbitrary-length property paths I created for Sesame’s SPARQL engine suffered a serious flaw: it didn’t scale. At all. The chosen strategy (see my earlier blog post on this topic) of evaluating paths of ever-increasing length using multi-joins slowed down to a crawl (through tar) on any serious dataset at about length 7 or 8. Queries involving such paths took 15 minutes or more to complete. Clearly, this would need a little improvement.

Looking at the implementation, it turned out there were two main bottlenecks in the performance:

  1. complete re-evaluation of all paths of length (n – 1) whenever we expand the path length
  2. a computationally heavy cycle-detection check

After fiddling about with some ineffectual half-measures for one or both of these problems I decided to take a step back, and completely rethink the algorithm. It occurred to me that if we did a proper breadth-first search implementation we could eliminate the computationally expensive multi-join altogether (and even more importantly, the hugely inefficient repeat performances on known paths for every increase of the length). We could simply use a FIFO queue to pop intermediate values on and keep doing simple lookups of patterns of length 1.

Let me clarify with an example. Suppose you have the following path-expression:

  ex:alice foaf:knows+ ?x 

In the naive join-based algorithm, this would be evaluated by first looking up all paths of length 1:

  BGP(ex:alice, foaf:knows, ?x)

If the result of this is not empty, we continue by looking up paths of length 2, adding a filter-condition to guard against cycles:

 And(?a1 != ex:alice, ?a1 != ?x),
  BGP(ex:alice, foaf:knows, ?a1),
  BGP(?a1, foaf:knows, ?x)))

I won’t bore you with the details for paths of length 3 and further, but I trust you appreciate that this can quickly become quite big and complex. Also, notice the fact that in the expression for length 2, we have BGP(ex:alice, foaf:knows, ?a1), which is essentially the same expression we already evaluated for paths of length 1. We already know the answers to this bit, yet we still create a new join which will re-evaluate the damn thing, for every increase of the length, again and again.

Enter the new algorithm. I got rid of joins completely. Instead, I introduced a FIFO-queue of values. We iterate over results for length 1, reporting all resulting solutions, and putting all bindings found for ?x onto the queue. Then, when all results for length 1 are exhausted, we examine the queue. If it is not empty, we pop the first value, and create a new BGP where we replace the start with the popped value. We iterate over solutions to this BGP, and again, we report all solutions and add all bindings of ?x thus found to the stack. We continue in this fashion until finally the last iteration is exhausted and the queue is empty.

This is really classic breadth-first search, so no new wheels have been invented here, but it is amazing (although in retrospect really quite understandable) how much more performant this graph traversal mechanism is than the earlier join-based approach. One experiment I did (using the DBPedia category thesaurus) was the following query:

   ?sub skos:broader+ <>

This query (which effectively retrieves all subcategories of Category:Business) took over 15 minutes to complete in the original approach. After switching to the new BFS-based approach, it completes in roughly 120ms.

One thing I haven’t touched on yet in this new implementation is how to detect cycles. In the original approach, we retrieved solutions that encompassed the complete path (joins with intermediate variables) and we used filters to check that all of the intermediate vars are pairwise distinct. Clearly, we can’t do that in this approach, because we no longer have all these intermediate vars available at once.

The solution is actually rather simple, if you stop and think of what a cycle actually is: a cycle is a path where at the end you have gotten back where you started (in other news: water is quite wet). Therefore, you simply have to mark where you’ve been before (that is, which solutions you’ve previously reported), and you’re home free.

A list internally records all reported combinations of the start value (ex:alice) and the end value (each binding of ?x) of our path. Whenever we find a new solution during iteration, we check if the reported combination already exists in that list, and if so, we discard it.

Why not just record the end values? Well… Think about it. Do all path expressions start with a fixed start point?

This approach to cycle detection has two consequences:

  • cycle detection has become a cheap lookup in a list.
    This is quick, but puts a burden on memory space. In practice, however, the result of property-paths stays small enough to be able to keep such a list without problems.
  • We now automatically filter out duplicate results

The second consequence is really useful in practice: the old approach reported back all possible paths of all possible lengths, so if for example Bob is a friend of Alice but also a friend of a friend of Alice, we would get back Bob twice – which is clearly redundant information. In the new algorithm, Bob is only reported once, saving bandwidth, computation time, and the user having to jump through a lot of hoops to get unique results.

The new and improved algorithm will be available in the next Sesame release (2.6.1).

It’s not yet up on the news page (mainly due to New Zealand being so far ahead of Europe – ha!), but Sesame 2.5.1 is now available for download.

This is a maintenance release that fixes a number of issues in the newly developed SPARQL 1.1 Query/Update support. Additionally, the Rio parser has improved support for validation of XML Schema datatypes and language tags.

For a full overview of changes, see the ChangeLog.

From We are very pleased to announce the release of Sesame 2.5.0. This is a major new release with some very exciting features:

  • SPARQL 1.1 Query Language support
    Sesame 2.5 features near-complete support for the SPARQL 1.1 Query Language Last Call Working Draft , including all new builtin functions and operators, improved aggregation behavior and more.
  • SPARQL 1.1 Update support
    Sesame 2.5 has full support for the new SPARQL 1.1 Update Working Draft. The Repository API has been extended to support creation of SPARQL Update operations, the SAIL API has been extended to allow Update operations to be passed directly to the underlying backend implementation for optimized execution. Also, the Sesame Workbench application has been extended to allow easy execution of SPARQL update operations on your repositories.
  • SPARQL 1.1 Protocol support
    Sesame 2.5 fully supports the SPARQL 1.1 Protocol for RDF Working Draft. The Sesame REST protocol has been extended to allow update operations via SPARQL on repositories. A Sesame server therefore now automatically publishes any repository as a fully compliant SPARQL endpoint.
  • Binary RDF support
    Sesame 2.5 includes a new binary RDF serialization format. This format has been derived from the existing binary tuple results format. It’s main features are reduced parsing overhead and minimal memory requirements (for handling really long literals, a.o.t.).

Sesame 2.5 SPARQL improvements have been made possible by Ontotext, in cooperation with Aduna. We hope you enjoy the new features and look forward to receiving your feedback.

See the release notes for a full overview of all changes.

The Sesame framework comes with a pre-packaged web service (often referred to as the Sesame server). This web service enables deployment of Sesame as an online RDF database server, with multiple SPARQL query endpoints and full update capabilities.

In the default setup of the Sesame server, however, there is no security at all: everybody can access all available Sesame repositories, can query them, add data, remove data, and even change the server configuration (e.g. creating new databases or removing existing ones). Clearly this is not a desirable setup for a server which is publicly accessible.

Fortunately, it is possible to restrict access to a Sesame server, using standard Java web application technology: the Deployment Descriptor.

In this recipe, we will look at setting up some basic security constraints for a Sesame server, using the web application’s deployment descriptor, in combination with basic HTTP authentication. For the purpose of this explanation, we assume you have a default Sesame server running on Apache Tomcat 6.

Sesame HTTP REST protocol

The Sesame server implements a RESTful protocol for HTTP communication (it’s a superset of SPARQL protocol). What that comes down to is that the protocol defines specific locations, reachable by a specific URL using a specific HTTP method (GET, POST, etc.), for each repository and each operation on a repository.

This is good news for our security, as it means we can easily reuse HTTP-based security restrictions: since each operation on a repository maps to a specific URL and a specific method, we can have fairly fine-grained access control by simply restricting access to particular URL patterns and/or HTTP methods. read more

The SPARQL query language is extensible by nature: it allows implementors to define their own custom operators if the standard set of operators is not sufficient for the needs of some application.

Sesame’s SPARQL engine has been designed with this extensibility in mind: it allows you to define your own custom functions and use them as part of your SPARQL queries, like any other function. In this new recipe, I’ll show how to create a simple custom function. Specifically, we are going to implement a boolean function that detects if some string literal is a palindrome. read more…

I am very proud to announce the release of Sesame 2.4.0. This is a major new release featuring support for SPARQL 1.1 Query.

Sesame 2.4.0 implements all features of SPARQL 1.1 Query as outlined in the
October 14 W3C Working Draft, with the exception of basic federated query. SPARQL 1.1 for Sesame is developed by Ontotext in cooperation withAduna.

The list of new SPARQL query features includes:

  • Use of expressions in the SELECT clause
  • Property paths
  • Subqueries
  • Negation: (NOT) EXISTS and MINUS
  • Set membership: (NOT) IN
  • Conditionals: IF
  • Various new builtin functions: COALESCE, BNODE, IRI, isNumeric, strLang, strDt

Apart from this impressive array of new query language features, Sesame 2.4.0 also implements a number of bug fixes and improvements, including scalability and performance improvement in the Native store. For a full overview, see the release notes.

I’ve just finished the implementation of SPARQL 1.1 property paths in Sesame. The major thing still missing was the implementation of negated property sets.

Negated property sets enable you to formulate a query like, for example: “give me back two resources x and y which are related in any direction via some property, but not via foaf:knows“. In SPARQL 1.1, this would look like:

   SELECT ?x ?y
   WHERE { 
           ?x !(foaf:knows|^foaf:knows) ?y .

The current SPARQL 1.1 draft defines a new abstract symbol for supporting negated property sets, called NegatedPropertySet. The SPARQL algebra, in turn, maps this abstract symbol directly to a new algebraic operator, so the algebra is extended with an additional operator in order to support negated property sets, and it also gives a specific evaluation semantics for this new operator.

However, although perhaps useful in terms of brevity, it is in fact not necessary to thus extend the algebra. Negated property sets do not actually introduce additional expressivity to the language (in contrast to, for example, arbitrary-length property paths): the above query could have been formulated in SPARQL 1.0:

   SELECT ?x ?y
   WHERE { 
       { ?x ?p1 ?y . FILTER (?p1 != foaf:knows) }
       { ?y ?p2 ?x . FILTER (?p2 != foaf:knows) }

This simple fact makes it possible to implement negated property sets without having to extend Sesame’s query model with an additional algebra operator. The advantage of this is that all of Sesame’s existing query optimizing/rewriting/evaluation strategies can immediately handle negated property sets, without having to be ‘recalibrated’ or indeed extended to take a complex additional operator into account.

So, the SPARQL parser processes a negated property set and translates it to the necessary collection of Joins, Filter comparisons and Unions. The algorithm is roughly as follows:

 let NPS be a negated property set with elements e_1...e_n.
 let s be the subject variable of the NPS.
 let ap be the (anonymous) predicate variable of the NPS.
 let O be the set of object variables of the NPS. 
 let F and F_i be two sets of filter conditions.
 let p(e) be the predicate IRI of e.

 for each e in NPS :
    create a filter condition f: p(e) != ap .
    if e is inverted: add f to F_i, otherwise add f to F .
 let J be a Join on basic graph patterns. 
 if F is not empty:
    for each o in O :
       add BGP(s, ap, o) to J .

 let I be a Join on basic graph patterns. 
 if F_i is not empty:
    for each o in O :
       add BGP(o, ap, s) to I .
 if I and J are both not empty:
    return Union(Filter(J, F), Filter(I, F_i)) .
 else if I is not empty :
    return Filter(I, F_i).
 else if J is not empty :
    return Filter(J, F).

The end result of applying this algorithm to the example SPARQL 1.1 query we saw above would be the following (slightly adapted for readability) Sesame query algebra expression:

Projection({x, y}, 
      Filter(StatementPattern(y, ap, x), Compare(!=, ap, foaf:knows)),
      Filter(StatementPattern(x, ap, y), Compare(!=, ap, foaf:knows))

Short and sweet. Of course, it gets less short and sweet when using more complex property sets, or property sets in combination with other property path features, but the algorithm caters to that. Such more complex expression just result in a larger set of unions and joins.

I have more or less finished implementing parsing and basic evaluation for SPARQL 1.1 property paths in Sesame. The only thing still missing is negated property sets. Negated sets of properties… Who the heck wants that? Honestly.

Anyway. Apart from the fact that the property path syntax is not quite simple to parse, there are some concepts involved which make evaluation of property-paths tricky – especially if arbitrary-length paths are involved.

Arbitrary-length paths are paths specified with a ‘+’ or ‘*’ modifier. For example, the path ?x foaf:knows+/foaf:name ?name specifies a match where either ?x knows someone with name ?name, or knows someone who knows someone with name ?name, or … you get the point. The trouble with such queries in ‘traditional’ evaluation is that there is no a-priori length: you can not simply use the SPARQL parser to build a set of joins.

A path of fixed length n will usually be translated by the parser to n-1 joins on individual statement patterns. In order to allow evaluation of non-fixed length paths, we need something that amounts to graph traversal. Fortunately, Sesame’s default evaluation strategy (essentially lazy iteration over bindings) is well suited for this task. What I have done is the following:

  1. the SPARQL parser translates an arbitrary-length property path to a new algebra operator, called (rather appropriately I thought) ArbitraryLengthPath.
  2. the default EvaluationStrategy for such an operator is based on a dynamically expanding iterator, which creates new joins while being evaluted.

Effectively, this new iterator (called PathIteration, currently located as an inner class in the EvaluationStrategyImpl) implements a depth-first graph traversal strategy. It reports back results iteratively and expands the path sequence length (using an ever-increasing iterative creation of nested joins) until it reaches a length at which the nested join no longer produces any matches. Cycle-detection is implemented by the simple expedient of adding an boolean comparison operator (NEQ) on top of our join: if the start node of the path has the same value as the end node, we have a cycle.