Thursday, June 18, 2009

Triple Stores Aren't

Once a thing has acquired a name, it's rare that it can escape that name even if the underlying concept has changed so as to make the name inaccurate. Only when the name causes misunderstandings will people start to adopt a new, more accurate name. I am trained as an engineer, but I know very little about engines; so far that has never caused any problems for me. It's sometimes funny when someone worries about getting lead poisoning from a pencil lead but it doesn't cause great harm. It's no big deal that there's hardly any nickel in nickels. Columbus "discovered" the "indians" in 1492; We've known that these people were not in India for a long time, but it's only recently that we've started using the more respectful and more accurate term "Native Americans".

I'm going to see some old friends this evening, and I'm sure they'll be pretty much how I remember them, but I'll really notice how the kids have grown. That's what this week has been like for me at the Semantic Technology Conference . I've not really worked in the semantic technology area for at least 7 years (though I've been making good use of its ideas), but a lot of the issues and technologies were like old friends, wiser and more complex. But being away for a while makes me very aware of things have changed- things that people who have been in the field for the duration might not have been conscious of, because the change has occurred gradually. One of the things I've noticed also involves a name that's no longer accurate. It might confuse newcomers to the field, and may even cause harm by lulling people into thinking they know something that isn't true. It's the fact that triple stores are no longer triple stores.

RDF (subject,predicate,object) triples are the "atom" of knowledge in a semantic-technology information store. One of the foundational insights of semantic technology is that there is great flexibility and development efficiency to be gained by moving data models out of relational database table designs and into semantic models. Once you've done that, you can use very simple 3-column tables to store the three pieces of the triples. You need to do much more sophisticated indexing, but it's the same indexing for any data model. Thus, the triple store.

As I discussed in my "snotty" rants on reification, trying to rely on just the triples keeps you from doing many things that you need to do in many types of problems. It's much more natural to treat the triple as a first-class object, either by reification or by objectification (letting the triple have its own identifier). What I've learned an this conference is that all the triple stores in serious use today use more that 3 columns to store the triples. Instead of triples, RDF atoms are now stored as 4-tuples, 5-tuples, 6-tuples or 7-tuples.

Essentially all the semantic technology information stores use at least an extra column for graph id (used to identify a graph that a particlar triple is part of). At the conference, I was told that this is needed in order to implement the contextual part of SPARQL. (FROM NAMED, I assume. Note to self: study SPARQL on the plane going home!) In addition, some of the data stores have a triple id column. In a post on the Freebase Blog, Scott Meyer reported that Freebase uses tuples which have IDs, 6 "primitives "and a few odds and ends" to store an RDF "triple" (the pieces which stor the triple are called left, right, type and value). Freebase is an append-only data store, so it needs to keep track of revisions, and it also tracks the creator of the tuple.

Is there anything harmful with the misnomerization of "triple", enough for the community to try their best to start talking about "tuples"? I think there is. Linked Data is the best example of how a focus on the three-ness of triples can fool people into sub-optimal implementations. I heard this fear expressed several times during the conference, although not in those words. More than once, people expressed concern that once data had been extracted via SPARQL and gone into the Linked Data cloud, there was no way to determine where the data had come from, what its provenance was, or whether is could be trusted. He was absolutely correct- if the implementation was such that the raw triple was allowed to separate from its source. If there was a greater understanding of the un-three-ness of real rdf tuplestores, then implementers of linked data would be more careful not to obliterate the id information that could enable trust and provenance. I come away from the conference both excited by Linked Data and worried that the Linked Data promoters seemed to brush-off this concern.

I'll write some more thoughts from the conference tomorrow, after I've googled a few things with Bing.

0 comments:

Contribute a Comment

Note: Only a member of this blog may post a comment.