RDF or not in Gen2Phen – 6th Assembly Meeting

This is a personal account and not necessarily my employer‘s view.

Until two weeks ago, I had never heard of Gen2Phen. Then my colleague Livia asked me to join her to go to their 6th general assembly meeting and present something about UniProt in RDF.

Gen2Phen is a big consortium, including SIB, working on genotype-to-phenotype information. They have two years to go in their grant, and are thinking about adopting SemWeb technologies to enhance data exchange and integration, data interpretation, and to impress funding agencies. Therefore, they invited someone—me in the end— from SIB to speak about our experiences.

My presentation consisted of two parts, an introduction to RDF and why we provide it, and a tour of UniProt‘s RDF. I aimed for 15 minutes, and got only five to present it due to the packed schedule. So I explained the very gist of “why RDF”, showed some examples, and talked about the problems we are encountering.

The problems got, predictably, most attention. Semantic Web “believers” spreading the vision are plenty. Hands-on experiences with complex data sets such as UniProt’s are rarer. I need to write about this in depth at some point. Suffice it to say, I think I dampened some enthusiasm. This despite the fact that I repeatedly stressed that I think of RDF and related technologies as valuable building blocks in the bigger picture, and as clear steps forward on some problems. But the Semantic Web seems to be an all-or-nothing affair for most people.

Tony Brooks is right in saying that given there are only two years left to go for Gen2Phen, it might be late to start with SemWeb technology. A large modeling effort and uncertain scalability challenges could delay the benefits until it’s too late. On the other hand, it’s not that much work to start experimenting. Install Virtuoso and D2R, fire up Protege, write some RDF using Jena, and get a feeling for the whole thing. Design some RDF schema that expresses the basics of the information at the heart of Gen2Phen, and see if existing systems can add it as in- and output format. That would be my recommendation, which I might or might not have gotten across — it was a packed event about an unfamiliar project where the SemWeb was only one of many sessions, so communication was somewhat difficult.

The meeting as such was very nice. Good conversations and awesome food — La Maison de la Lozere in Montpellier was brilliant. So was the city itself; I enjoyed wandering around the beautiful old town.

One other presentation I found interesting was Gudmundur Thorisson’s about ORCID. This initiative aims to unambiguously researchers with an ID instead of their name, which might occur many times. ORCID will then map an article’s DOI to the IDs of the authors, when it’s submitted. Also, and perhaps even more important, ORCID aims to do the same for data sets. Science really needs more, larger, better data sets in the open for people to analyze and train their algorithms on, but currently there is very little benefit for researchers to publish them. ORCID is not really functioning yet, but is backed by more than 120 organizations, and so has a decent chance at becoming the de facto norm in academia.

FrOSCamp 2010 Zuerich

So, another one of those belated meeting/event reports: on 2010-09-17, I was in Zurich for the first-ever FrOSCamp. It was an Open Source/Free Software event with an exhibition floor, talks, and “a fancy party with creative commons licensed beer and music”—what’s not to like!

I presented my “Praktisches RDF in Perl” talk that I recycled from the German Perl Workshop, to spread the word some more. This time, I had prepared an English version, but as I only had German speakers in the audience, I presented in German.

Unfortunately my presentation only drew a handful of people this time. Note to self: work on the abstract some more. I had suspected that my FrOSCamp one was wordy and not catchy, but didn’t get around to rewriting it. At least the audience were pretty engaged and asked lots of questions, which I prefer to a larger crowd that’s half asleep.

The presentation was recorded and is now online as slides+audio. This was a first for me. I could forget about it while presenting, but I was pretty nervous listening to it for the first time, not sure what mess of incoherent rambling and half-finished sentences to expect. Fortunately, I found it ok in the end. Of course, I found several things to improve, but I guess that’s expected for someone who doesn’t present often and is just getting started. My list of the main points to improve is:

  • The introduction should be much shorter and more focussed. A bit like a sales pitch, not as in being obnoxious and fake, but as in focussed on getting the audience’s attention and appreciation for the topic.
  • Too many sentences didn’t flow properly. Simply doing one or two more dry runs should fix that.
  • Have some more visualizations such as diagrams on the slides.

On the other hand, I was pleased with a few things about my presentation: the style of having little text on the slides and more verbal explanation worked well, the code samples seemed to be the right size to digest during a talk, and the questions at the end showed that people had gotten the key points.

Before my presentation, I got to see Renee Baecker‘s talk about Perl::Critic. I’m using it on my code and thus knew the basics, but I appreciated the advanced example towards the end, where Renee walked us through writing our own critic rules. This works via PPI, so you can find patterns in the AST that match the constructs you want to check. I also found it interesting to hear Renee’s personal experience with the severity levels: he’s typically on 3, sometimes 2, but 1 is too harsh.

Other than that, I was mainly hanging out at the Perl booth, a first for me! The booth was staffed by Renee and Roman from Winterthur (CH), two really nice guys whom I had a great time with, discussing everything from Perl modules to freelancing.

BTW, remember the blurb from the FrOSCamp website I quoted at the top about creative commons licensed beer? That wasn’t a joke. FreeBeer is an organic beer, produced by an independent brewery near Zurich, and the recipe is online under a CC license. And it tastes great! A cloudy, full blonde just how I like it :-)

Semantic hacking: RDF in Perl

Last week, I attended the German Perl Workshop 2010. It was a fun event and I’ll write more on it in the next post.

I gave a 20-minute presentation there called “Semantisches Hacking: RDF in Perl‎”. At Swiss-Prot, we do all our RDF work in Java, but I got interested in how things look on the Perl side, and the Biohackathon in February got me started on exploring that.

Executive summary: The RDF-in-Perl community is organized at www.perlrdf.org, and the core of the available modules is RDF::Trine and RDF::Query by Gregory Williams. For example code, have a look at my simple demo scripts.

At the workshop, I had an audience of about 50-100 people, none of whom had ever worked with RDF or seriously looked into it. So I first introduced RDF in the simplest way possible, as there wasn’t much time, then showed off RDF::Trine and RDF::Query with code examples.

The talk was well received, and I had some interesting conversations afterward where people wanted to know more about RDF. Their questions mainly centered around ontologies/vocabularies, the additional time required to do this properly, and how to build an app on top of a triple store. I had talked about integrating RDF into existing apps in my presentation, for instance using Trine’s support for RDFa, as_hashref, JSON and other possibilities.

Here are the links to the slides (in German), the scripts I took the code snippets from, and the workshop page for the talk.

I think I managed to raise some awareness for RDF and perlrdf.org, and an understanding of the core ideas in an audience where almost no one had had any exposure to these topics, and showed some example code in a way the audience seemed to be able to follow, so I’d say it was a success.

Biohackathon 2010

In February I’ve been to the third Biohackathon in Tokyo, sponsored by the Japanese Database Center for Life Science (DBCLS) and Computational Biology Research Center (CBRC). As I’ve been travelling some more since then, I only got around to writing up my personal summary of the week just now. Here we go.

The Biohackathon is an annual meeting of bioinformatics developers. Toshiaki Katayama of the University of Tokyo, and founder of BioRuby, brought the hackathon idea to Japan, and lead the organization of the hackathon in the most perfect way. From the locations and the hotel, to the network and the catering (and the fact that there was catering!), it was all top notch. Not to mention the generosity of the sponsoring institutions to actually invite us all!

Now, where to start. It was such a packed and amazing week, and I feel very lucky for having gotten the chance to attend. Plus, it was my first trip to Japan, so the country itself was exciting enough! The schedule of the hackathon was simple enough: the first day was a symposium with lots of talks and the chance to learn about the other attendees and their projects. Day two to five were dedicated solely to hacking and discussion as people saw fit. It was my first meeting of that kind, and it was exciting to have that much freedom to turn the week into an interesting and useful time.

Arriving on Sunday morning, we first got our toes wet in Japan by placing an order in a noodle kitchen by randomly picking something on the menu. We wandered around the neighborhood of Tokyo University, or Todai, a charming part of town with small, old houses and narrow lanes I didn’t expect in Tokyo, and ended up in a quite amazing whisky bar and made some new friends. Good start.

The first actual hackathon day took us to the CBRC in Odaiba, a new and all shiny stretch of the city along the bay, dedicated to science and technology. But before enjoying the view from the cafeteria, we settled down to listen to talks and introduce ourselves to each other in the breaks. With about 60-ish attendees, the hackathon had a good size, allowing diversity but staying manageable. The idea of posting a mini-bio for each attendee along the walls was fantastic, as you could stroll around and get a good idea of who was there, and from what backgrounds they came.

A few of the participants presented the projects they’re working on, and they were all very interesting. You can find the list of speakers and their slides on the wiki. My colleague Jerven Bolleman presented our RDF efforts at UniProt. The day ended with a very nice buffet and some more socializing, and left everyone energized and motivated for a week of hacking.

The rest of the week took place at DBCLS on Todai campus, where people could form groups to their liking and pick among several rooms for quiet hacking. Inspired by the BioRuby and BioPython folks that were present, I started exploring the RDF support in Perl. We do all our RDF work in Java, as do most Semantic Web people, but I feel that puts off many people. Perl hits a sweet spot with its conciseness and pragmatism, and its position in bioinformatics is traditionally strong. I believe that good Perl support would be a major step forward to making biologists and bioinformaticiens warm up to RDF & co – I wrote a somewhat passionate mail about this on the hackathon mailing list recently, that I will post here, too. Anyway, so there are quite a few RDF-related modules on CPAN, most of them gathered at [http://www.perlrdf.org], and I set out to try and compare them, and write some example code, possibly something to explore the UniProt RDF. While I didn’t get that far due to participating in lots of other discussions, it was very interesting to try this out, and I put a State of RDF in Perl page on the wiki and some example code on github. I also exchanged a lot of mails with Greg Williams of RDF::Trine, which was great. I’ll blog about this subject later.

While there were many different groups hacking away, on text mining and RDF generation and all kinds of things, one subject struck me as the subject of this Biohackathon: URIs. How to publish one’s own data with stable, sensible, and dereferenceable URIs, and what to use in your RDF when linking to others who don’t have such nice URIs? This question was discussed many times during the whole week.

Francois Belleau of bio2rdf led many of the discussions (thanks!), which focused mostly on central naming schemes/services for URIs. There seems to be a conflict between keeping content dereferencable and keeping URLs very stable for use as resource identifiers. For the latter goal you don’t need URLs, any string will do as long as it’s unique and stable. So this goal would benefit from a central registry like, as advocated by Francois, lsrn.org/uniprot/P12345, because it would provide a predictable way of naming things uniquely. But it adds a single point of failure to the dereferencing of content. Andrea Splendiani remarked that he never followed a single URL from RDF anyway, while I argued that linking content is the point of the web and keeps the Semantic Web hackable – that will have to be yet another future blog post, I guess! Using providers’ actual URLs is often crappy because they don’t provide a predictable scheme (a=x&b=y vs. b=y&a=x), and you only get HTML anyway.

Opinions differed, and they still do. We arrived at an agreement on “Polite URIs” towards the end, but the discussion has been re-started on the mailing list.

And we haven’t even mentioned the dismal state of versioned URIs, (like UniProt’s non-existing ones…), which I also discussed with Andrea. He proposed including the entry version into the URI. Whole releases could be done via named graphs, although that sounds complicated. I was concerned about people who don’t care and just want to say “this protein” – for them (i.e., their reasoners), uniprot/P12345/v1 is not the same as uniprot/P12345/v2, but it should be. This seems impossible to resolve, it’s one or the other. Uh, ideas anyone?

I guess you got the idea by now – there was so much more happening this week that I can’t summarize it all. Fortunately, others also wrote about it. Brad Chapman wrote about his SPARQL and Python hacking, and the #biohackathon2010 Twitter tag has lots of interesting tidbits.

Let’s end with paraphrasing Toshiaki’s closing notes: a “clique of the world-top-level developers in bioinformatics” met, some great coding and discussion took place, and now that data providers understand the Semantic Web a lot better, services will come.

Thanks to all organizers, the people at DBCLS and CBRC who made this possible, to the participants who brought so much enthusiasm and knowledge to the event, and to Toshiaki in particular for tirelessly working throughout the week to keep everything running smoothly. And for taking us out for great dinners and giving us a tour of the Human Genome Center super computer in the week after the hackathon!

Sayonara!

Weekend Triple Billionaire at SWAT4LS 2009

In November 2009, I was at the SWAT4LS (Semantic Web Applications and Tools for Life Sciences) 2009 conference in Amsterdam. While cleaning up my writing directory, I noticed that I never blogged about this, so here are some seriously late notes.

My colleague Jerven and I presented a poster and a ten-minute highlight talk called Weekend Triple Billionaire. It’s about scaling problems we see at UniProt when working with RDF. Here’s the workshop proceedings, and a direct link to the PDF.

According to our (limited) testing and research, current triple stores are not able to store and query our data of three billion triples. We summarized this into a problem statement-style short paper and a poster. Here’s the abstract:

The UniProt Knowledgebase offers both manually curated and automatically generated information on proteins, and is one of the leading biological databases. While it is one of the largest free data sets that is available in RDF, our infrastructure and website are not based on RDF. We present numbers about the volume and growth of UniProt and show why this volume of data prevents using RDF triple stores and SPARQL with currently available tools.

I think the talk was well received, and we saw a lot of interest in putting up a publicly accessible triple store for UniProt, as it’s one of the most important Life Sciences databases.

Lots of the talks were really interesting. Alan Ruttenberg presented Science Commons and the new CC0 Creative Commons license for scientific data. He also talked about some of the infrastructure behind Neurocommons. I found it interesting that it’s written in a mix of Java, Jython and Common Lisp. Also, their new RDF Herd package manager might solve an issue with the Semantic Web today: it’s like RPM for RDF, with incremental updates and thus provides clean versioning and more efficient data transfer.

Michael Schroeder from the TU Dresden presented some impressive feats of GoPubmed. Their text mining apparently has F-scores over 80%, and they are working on generalizing the GoPubmed approach to the whole web via GoWeb. I guess it also helps a lot that their web design and usability are way above the norm for academic projects, as users are biologists, not computer nerds.

Finally, Barend Mons gave the most speculative and forward-looking talk with his keynote about the Concept Web Alliance and his vision of a future where traditional scientific publications play a minor role in comparison to “nano publications” which can be as small as one RDF triple and can go live as research progresses. For more info, check their web page which does a better job of presenting the idea than I can do here.

Of course, personal discussions and getting to know people is maybe the biggest point of such meetings. And it was indeed great to have lively discussions with the attendants. In addition to the aforementioned, it was great to talk to Deyan Peychev from Ontotext who are doing some serious OWL reasoning with our UniProt RDF. He had valuable suggestions for improving our OWL, some of which are already implemented. It was also a pleasure to meet Erik Antezana of OntoPerl fame.

It was a great day with lots of new inspiration and valuable face time with some of the leading researchers of the field. Hope to see you next year, or next month in Tokyo!