Biohackathon 2010

In February I’ve been to the third Biohackathon in Tokyo, sponsored by the Japanese Database Center for Life Science (DBCLS) and Computational Biology Research Center (CBRC). As I’ve been travelling some more since then, I only got around to writing up my personal summary of the week just now. Here we go.

The Biohackathon is an annual meeting of bioinformatics developers. Toshiaki Katayama of the University of Tokyo, and founder of BioRuby, brought the hackathon idea to Japan, and lead the organization of the hackathon in the most perfect way. From the locations and the hotel, to the network and the catering (and the fact that there was catering!), it was all top notch. Not to mention the generosity of the sponsoring institutions to actually invite us all!

Now, where to start. It was such a packed and amazing week, and I feel very lucky for having gotten the chance to attend. Plus, it was my first trip to Japan, so the country itself was exciting enough! The schedule of the hackathon was simple enough: the first day was a symposium with lots of talks and the chance to learn about the other attendees and their projects. Day two to five were dedicated solely to hacking and discussion as people saw fit. It was my first meeting of that kind, and it was exciting to have that much freedom to turn the week into an interesting and useful time.

Arriving on Sunday morning, we first got our toes wet in Japan by placing an order in a noodle kitchen by randomly picking something on the menu. We wandered around the neighborhood of Tokyo University, or Todai, a charming part of town with small, old houses and narrow lanes I didn’t expect in Tokyo, and ended up in a quite amazing whisky bar and made some new friends. Good start.

The first actual hackathon day took us to the CBRC in Odaiba, a new and all shiny stretch of the city along the bay, dedicated to science and technology. But before enjoying the view from the cafeteria, we settled down to listen to talks and introduce ourselves to each other in the breaks. With about 60-ish attendees, the hackathon had a good size, allowing diversity but staying manageable. The idea of posting a mini-bio for each attendee along the walls was fantastic, as you could stroll around and get a good idea of who was there, and from what backgrounds they came.

A few of the participants presented the projects they’re working on, and they were all very interesting. You can find the list of speakers and their slides on the wiki. My colleague Jerven Bolleman presented our RDF efforts at UniProt. The day ended with a very nice buffet and some more socializing, and left everyone energized and motivated for a week of hacking.

The rest of the week took place at DBCLS on Todai campus, where people could form groups to their liking and pick among several rooms for quiet hacking. Inspired by the BioRuby and BioPython folks that were present, I started exploring the RDF support in Perl. We do all our RDF work in Java, as do most Semantic Web people, but I feel that puts off many people. Perl hits a sweet spot with its conciseness and pragmatism, and its position in bioinformatics is traditionally strong. I believe that good Perl support would be a major step forward to making biologists and bioinformaticiens warm up to RDF & co – I wrote a somewhat passionate mail about this on the hackathon mailing list recently, that I will post here, too. Anyway, so there are quite a few RDF-related modules on CPAN, most of them gathered at [http://www.perlrdf.org], and I set out to try and compare them, and write some example code, possibly something to explore the UniProt RDF. While I didn’t get that far due to participating in lots of other discussions, it was very interesting to try this out, and I put a State of RDF in Perl page on the wiki and some example code on github. I also exchanged a lot of mails with Greg Williams of RDF::Trine, which was great. I’ll blog about this subject later.

While there were many different groups hacking away, on text mining and RDF generation and all kinds of things, one subject struck me as the subject of this Biohackathon: URIs. How to publish one’s own data with stable, sensible, and dereferenceable URIs, and what to use in your RDF when linking to others who don’t have such nice URIs? This question was discussed many times during the whole week.

Francois Belleau of bio2rdf led many of the discussions (thanks!), which focused mostly on central naming schemes/services for URIs. There seems to be a conflict between keeping content dereferencable and keeping URLs very stable for use as resource identifiers. For the latter goal you don’t need URLs, any string will do as long as it’s unique and stable. So this goal would benefit from a central registry like, as advocated by Francois, lsrn.org/uniprot/P12345, because it would provide a predictable way of naming things uniquely. But it adds a single point of failure to the dereferencing of content. Andrea Splendiani remarked that he never followed a single URL from RDF anyway, while I argued that linking content is the point of the web and keeps the Semantic Web hackable – that will have to be yet another future blog post, I guess! Using providers’ actual URLs is often crappy because they don’t provide a predictable scheme (a=x&b=y vs. b=y&a=x), and you only get HTML anyway.

Opinions differed, and they still do. We arrived at an agreement on “Polite URIs” towards the end, but the discussion has been re-started on the mailing list.

And we haven’t even mentioned the dismal state of versioned URIs, (like UniProt’s non-existing ones…), which I also discussed with Andrea. He proposed including the entry version into the URI. Whole releases could be done via named graphs, although that sounds complicated. I was concerned about people who don’t care and just want to say “this protein” – for them (i.e., their reasoners), uniprot/P12345/v1 is not the same as uniprot/P12345/v2, but it should be. This seems impossible to resolve, it’s one or the other. Uh, ideas anyone?

I guess you got the idea by now – there was so much more happening this week that I can’t summarize it all. Fortunately, others also wrote about it. Brad Chapman wrote about his SPARQL and Python hacking, and the #biohackathon2010 Twitter tag has lots of interesting tidbits.

Let’s end with paraphrasing Toshiaki’s closing notes: a “clique of the world-top-level developers in bioinformatics” met, some great coding and discussion took place, and now that data providers understand the Semantic Web a lot better, services will come.

Thanks to all organizers, the people at DBCLS and CBRC who made this possible, to the participants who brought so much enthusiasm and knowledge to the event, and to Toshiaki in particular for tirelessly working throughout the week to keep everything running smoothly. And for taking us out for great dinners and giving us a tour of the Human Genome Center super computer in the week after the hackathon!

Sayonara!