Weekend Triple Billionaire at SWAT4LS 2009


In November 2009, I was at the SWAT4LS (Semantic Web Applications and Tools for Life Sciences) 2009 conference in Amsterdam. While cleaning up my writing directory, I noticed that I never blogged about this, so here are some seriously late notes.

My colleague Jerven and I presented a poster and a ten-minute highlight talk called Weekend Triple Billionaire. It’s about scaling problems we see at UniProt when working with RDF. Here’s the workshop proceedings, and a direct link to the PDF.

According to our (limited) testing and research, current triple stores are not able to store and query our data of three billion triples. We summarized this into a problem statement-style short paper and a poster. Here’s the abstract:

The UniProt Knowledgebase offers both manually curated and automatically generated information on proteins, and is one of the leading biological databases. While it is one of the largest free data sets that is available in RDF, our infrastructure and website are not based on RDF. We present numbers about the volume and growth of UniProt and show why this volume of data prevents using RDF triple stores and SPARQL with currently available tools.

I think the talk was well received, and we saw a lot of interest in putting up a publicly accessible triple store for UniProt, as it’s one of the most important Life Sciences databases.

Lots of the talks were really interesting. Alan Ruttenberg presented Science Commons and the new CC0 Creative Commons license for scientific data. He also talked about some of the infrastructure behind Neurocommons. I found it interesting that it’s written in a mix of Java, Jython and Common Lisp. Also, their new RDF Herd package manager might solve an issue with the Semantic Web today: it’s like RPM for RDF, with incremental updates and thus provides clean versioning and more efficient data transfer.

Michael Schroeder from the TU Dresden presented some impressive feats of GoPubmed. Their text mining apparently has F-scores over 80%, and they are working on generalizing the GoPubmed approach to the whole web via GoWeb. I guess it also helps a lot that their web design and usability are way above the norm for academic projects, as users are biologists, not computer nerds.

Finally, Barend Mons gave the most speculative and forward-looking talk with his keynote about the Concept Web Alliance and his vision of a future where traditional scientific publications play a minor role in comparison to “nano publications” which can be as small as one RDF triple and can go live as research progresses. For more info, check their web page which does a better job of presenting the idea than I can do here.

Of course, personal discussions and getting to know people is maybe the biggest point of such meetings. And it was indeed great to have lively discussions with the attendants. In addition to the aforementioned, it was great to talk to Deyan Peychev from Ontotext who are doing some serious OWL reasoning with our UniProt RDF. He had valuable suggestions for improving our OWL, some of which are already implemented. It was also a pleasure to meet Erik Antezana of OntoPerl fame.

It was a great day with lots of new inspiration and valuable face time with some of the leading researchers of the field. Hope to see you next year, or next month in Tokyo!