Wednesday, June 22, 2005

Metastore Scaleability Concerns

I'm sure that I am not the only semhead concerned about scaleability issues when we start to pump millions of RDF documents into our datastore. The company that was once behind the open-source kowari metastore had a commercial offering described as...

The Tucana Knowledge Server has been developed to fill the market need for managing large quantities of RDF statements Acknowledging the problems that traditional relational database management systems (RDBMSs) have with storing large quantities of RDF data, the Tucana Knolwedge Server implements a native RDF database and consists of high-level APIs, a query engine and an underlying data store. TKS is implemented entirely in Java and is a scalable, distributed, secure, transaction-safe database built on a directed graph data model and optimized for searching metadata.

A single instance of the Knowledge Server has been used to store 350 million RDF statements and Tucana continues to improve the engine to maintain its position as the most scalable RDF data store available. Multiple instances of the Tucana Knowlege Server can be combined and treated as a "virtual database", offering another approach towards scalability. Any instance of the Knowledge Server may be used as the entry point for such a "federated" query, and will subsequently query any number of remote servers, collect their intermediate results and join on them to produce a single, coordinated result.

The problem is that Tucana seems to have gone out of business. There are myriad reasons why they might have gone out of business and I'm trying to get some information about that and whether it is still viable to base a solution on kowari. Another question I'd like to put out to the community at large is whether it makes sense to setup a hybrid architecture where low-volume data flows into the metastore, but high-volume data flows into a relational database. When we need to run a report based on a semantic query against the metastore we first load it with relevant instance data culled from the relational database and transformed back into RDF dynamically.

The side by side metastore/db idea also stands to facilitate adoption of this architecture, since it is a lot easier to sell the powers-that-be on a combined solution than it is to convince them to go from a (somewhat) unproven technology.

I had to put parentheses around unproven in the statement above because I believe that this technology is only unproven in the wide commercial arena. It is my understanding that metastores and a lot of the technologies that underlie the Semantic web are actually quite proven by medical research teams and defense contractors.