RDF or not in Gen2Phen – 6th Assembly Meeting

This is a personal account and not necessarily my employer‘s view.

Until two weeks ago, I had never heard of Gen2Phen. Then my colleague Livia asked me to join her to go to their 6th general assembly meeting and present something about UniProt in RDF.

Gen2Phen is a big consortium, including SIB, working on genotype-to-phenotype information. They have two years to go in their grant, and are thinking about adopting SemWeb technologies to enhance data exchange and integration, data interpretation, and to impress funding agencies. Therefore, they invited someone—me in the end— from SIB to speak about our experiences.

My presentation consisted of two parts, an introduction to RDF and why we provide it, and a tour of UniProt‘s RDF. I aimed for 15 minutes, and got only five to present it due to the packed schedule. So I explained the very gist of “why RDF”, showed some examples, and talked about the problems we are encountering.

The problems got, predictably, most attention. Semantic Web “believers” spreading the vision are plenty. Hands-on experiences with complex data sets such as UniProt’s are rarer. I need to write about this in depth at some point. Suffice it to say, I think I dampened some enthusiasm. This despite the fact that I repeatedly stressed that I think of RDF and related technologies as valuable building blocks in the bigger picture, and as clear steps forward on some problems. But the Semantic Web seems to be an all-or-nothing affair for most people.

Tony Brooks is right in saying that given there are only two years left to go for Gen2Phen, it might be late to start with SemWeb technology. A large modeling effort and uncertain scalability challenges could delay the benefits until it’s too late. On the other hand, it’s not that much work to start experimenting. Install Virtuoso and D2R, fire up Protege, write some RDF using Jena, and get a feeling for the whole thing. Design some RDF schema that expresses the basics of the information at the heart of Gen2Phen, and see if existing systems can add it as in- and output format. That would be my recommendation, which I might or might not have gotten across — it was a packed event about an unfamiliar project where the SemWeb was only one of many sessions, so communication was somewhat difficult.

The meeting as such was very nice. Good conversations and awesome food — La Maison de la Lozere in Montpellier was brilliant. So was the city itself; I enjoyed wandering around the beautiful old town.

One other presentation I found interesting was Gudmundur Thorisson’s about ORCID. This initiative aims to unambiguously researchers with an ID instead of their name, which might occur many times. ORCID will then map an article’s DOI to the IDs of the authors, when it’s submitted. Also, and perhaps even more important, ORCID aims to do the same for data sets. Science really needs more, larger, better data sets in the open for people to analyze and train their algorithms on, but currently there is very little benefit for researchers to publish them. ORCID is not really functioning yet, but is backed by more than 120 organizations, and so has a decent chance at becoming the de facto norm in academia.

Biohackathon 2010

In February I’ve been to the third Biohackathon in Tokyo, sponsored by the Japanese Database Center for Life Science (DBCLS) and Computational Biology Research Center (CBRC). As I’ve been travelling some more since then, I only got around to writing up my personal summary of the week just now. Here we go.

The Biohackathon is an annual meeting of bioinformatics developers. Toshiaki Katayama of the University of Tokyo, and founder of BioRuby, brought the hackathon idea to Japan, and lead the organization of the hackathon in the most perfect way. From the locations and the hotel, to the network and the catering (and the fact that there was catering!), it was all top notch. Not to mention the generosity of the sponsoring institutions to actually invite us all!

Now, where to start. It was such a packed and amazing week, and I feel very lucky for having gotten the chance to attend. Plus, it was my first trip to Japan, so the country itself was exciting enough! The schedule of the hackathon was simple enough: the first day was a symposium with lots of talks and the chance to learn about the other attendees and their projects. Day two to five were dedicated solely to hacking and discussion as people saw fit. It was my first meeting of that kind, and it was exciting to have that much freedom to turn the week into an interesting and useful time.

Arriving on Sunday morning, we first got our toes wet in Japan by placing an order in a noodle kitchen by randomly picking something on the menu. We wandered around the neighborhood of Tokyo University, or Todai, a charming part of town with small, old houses and narrow lanes I didn’t expect in Tokyo, and ended up in a quite amazing whisky bar and made some new friends. Good start.

The first actual hackathon day took us to the CBRC in Odaiba, a new and all shiny stretch of the city along the bay, dedicated to science and technology. But before enjoying the view from the cafeteria, we settled down to listen to talks and introduce ourselves to each other in the breaks. With about 60-ish attendees, the hackathon had a good size, allowing diversity but staying manageable. The idea of posting a mini-bio for each attendee along the walls was fantastic, as you could stroll around and get a good idea of who was there, and from what backgrounds they came.

A few of the participants presented the projects they’re working on, and they were all very interesting. You can find the list of speakers and their slides on the wiki. My colleague Jerven Bolleman presented our RDF efforts at UniProt. The day ended with a very nice buffet and some more socializing, and left everyone energized and motivated for a week of hacking.

The rest of the week took place at DBCLS on Todai campus, where people could form groups to their liking and pick among several rooms for quiet hacking. Inspired by the BioRuby and BioPython folks that were present, I started exploring the RDF support in Perl. We do all our RDF work in Java, as do most Semantic Web people, but I feel that puts off many people. Perl hits a sweet spot with its conciseness and pragmatism, and its position in bioinformatics is traditionally strong. I believe that good Perl support would be a major step forward to making biologists and bioinformaticiens warm up to RDF & co – I wrote a somewhat passionate mail about this on the hackathon mailing list recently, that I will post here, too. Anyway, so there are quite a few RDF-related modules on CPAN, most of them gathered at [http://www.perlrdf.org], and I set out to try and compare them, and write some example code, possibly something to explore the UniProt RDF. While I didn’t get that far due to participating in lots of other discussions, it was very interesting to try this out, and I put a State of RDF in Perl page on the wiki and some example code on github. I also exchanged a lot of mails with Greg Williams of RDF::Trine, which was great. I’ll blog about this subject later.

While there were many different groups hacking away, on text mining and RDF generation and all kinds of things, one subject struck me as the subject of this Biohackathon: URIs. How to publish one’s own data with stable, sensible, and dereferenceable URIs, and what to use in your RDF when linking to others who don’t have such nice URIs? This question was discussed many times during the whole week.

Francois Belleau of bio2rdf led many of the discussions (thanks!), which focused mostly on central naming schemes/services for URIs. There seems to be a conflict between keeping content dereferencable and keeping URLs very stable for use as resource identifiers. For the latter goal you don’t need URLs, any string will do as long as it’s unique and stable. So this goal would benefit from a central registry like, as advocated by Francois, lsrn.org/uniprot/P12345, because it would provide a predictable way of naming things uniquely. But it adds a single point of failure to the dereferencing of content. Andrea Splendiani remarked that he never followed a single URL from RDF anyway, while I argued that linking content is the point of the web and keeps the Semantic Web hackable – that will have to be yet another future blog post, I guess! Using providers’ actual URLs is often crappy because they don’t provide a predictable scheme (a=x&b=y vs. b=y&a=x), and you only get HTML anyway.

Opinions differed, and they still do. We arrived at an agreement on “Polite URIs” towards the end, but the discussion has been re-started on the mailing list.

And we haven’t even mentioned the dismal state of versioned URIs, (like UniProt’s non-existing ones…), which I also discussed with Andrea. He proposed including the entry version into the URI. Whole releases could be done via named graphs, although that sounds complicated. I was concerned about people who don’t care and just want to say “this protein” – for them (i.e., their reasoners), uniprot/P12345/v1 is not the same as uniprot/P12345/v2, but it should be. This seems impossible to resolve, it’s one or the other. Uh, ideas anyone?

I guess you got the idea by now – there was so much more happening this week that I can’t summarize it all. Fortunately, others also wrote about it. Brad Chapman wrote about his SPARQL and Python hacking, and the #biohackathon2010 Twitter tag has lots of interesting tidbits.

Let’s end with paraphrasing Toshiaki’s closing notes: a “clique of the world-top-level developers in bioinformatics” met, some great coding and discussion took place, and now that data providers understand the Semantic Web a lot better, services will come.

Thanks to all organizers, the people at DBCLS and CBRC who made this possible, to the participants who brought so much enthusiasm and knowledge to the event, and to Toshiaki in particular for tirelessly working throughout the week to keep everything running smoothly. And for taking us out for great dinners and giving us a tour of the Human Genome Center super computer in the week after the hackathon!

Sayonara!

Getting ready for Biohackathon 2010 in Tokyo!

Biohackathon 2010

My coworker Jerven and I were fortunate enough to be invited to the third Biohackathon at Tokyo University, sponsored by the Database Center for Life Science (DBCLS) and the Computational Biology Research Center (CBRC). This Saturday we’re gonna take off to one week of hacking and socializing, then I’m gonna spend a few days on my own to explore Tokyo and its surroundings. I can’t wait!

This year’s hackathon will be all about Interpretation of biological knowledge with the Semantic Web technologies. As the community in this field has a strong interest in UniProtKB data, we’ll see what we can do to make it easier for people to integrate it with other semantic efforts.

Weekend Triple Billionaire at SWAT4LS 2009

In November 2009, I was at the SWAT4LS (Semantic Web Applications and Tools for Life Sciences) 2009 conference in Amsterdam. While cleaning up my writing directory, I noticed that I never blogged about this, so here are some seriously late notes.

My colleague Jerven and I presented a poster and a ten-minute highlight talk called Weekend Triple Billionaire. It’s about scaling problems we see at UniProt when working with RDF. Here’s the workshop proceedings, and a direct link to the PDF.

According to our (limited) testing and research, current triple stores are not able to store and query our data of three billion triples. We summarized this into a problem statement-style short paper and a poster. Here’s the abstract:

The UniProt Knowledgebase offers both manually curated and automatically generated information on proteins, and is one of the leading biological databases. While it is one of the largest free data sets that is available in RDF, our infrastructure and website are not based on RDF. We present numbers about the volume and growth of UniProt and show why this volume of data prevents using RDF triple stores and SPARQL with currently available tools.

I think the talk was well received, and we saw a lot of interest in putting up a publicly accessible triple store for UniProt, as it’s one of the most important Life Sciences databases.

Lots of the talks were really interesting. Alan Ruttenberg presented Science Commons and the new CC0 Creative Commons license for scientific data. He also talked about some of the infrastructure behind Neurocommons. I found it interesting that it’s written in a mix of Java, Jython and Common Lisp. Also, their new RDF Herd package manager might solve an issue with the Semantic Web today: it’s like RPM for RDF, with incremental updates and thus provides clean versioning and more efficient data transfer.

Michael Schroeder from the TU Dresden presented some impressive feats of GoPubmed. Their text mining apparently has F-scores over 80%, and they are working on generalizing the GoPubmed approach to the whole web via GoWeb. I guess it also helps a lot that their web design and usability are way above the norm for academic projects, as users are biologists, not computer nerds.

Finally, Barend Mons gave the most speculative and forward-looking talk with his keynote about the Concept Web Alliance and his vision of a future where traditional scientific publications play a minor role in comparison to “nano publications” which can be as small as one RDF triple and can go live as research progresses. For more info, check their web page which does a better job of presenting the idea than I can do here.

Of course, personal discussions and getting to know people is maybe the biggest point of such meetings. And it was indeed great to have lively discussions with the attendants. In addition to the aforementioned, it was great to talk to Deyan Peychev from Ontotext who are doing some serious OWL reasoning with our UniProt RDF. He had valuable suggestions for improving our OWL, some of which are already implemented. It was also a pleasure to meet Erik Antezana of OntoPerl fame.

It was a great day with lots of new inspiration and valuable face time with some of the leading researchers of the field. Hope to see you next year, or next month in Tokyo!