RDF or not in Gen2Phen – 6th Assembly Meeting

This is a personal account and not necessarily my employer‘s view.

Until two weeks ago, I had never heard of Gen2Phen. Then my colleague Livia asked me to join her to go to their 6th general assembly meeting and present something about UniProt in RDF.

Gen2Phen is a big consortium, including SIB, working on genotype-to-phenotype information. They have two years to go in their grant, and are thinking about adopting SemWeb technologies to enhance data exchange and integration, data interpretation, and to impress funding agencies. Therefore, they invited someone—me in the end— from SIB to speak about our experiences.

My presentation consisted of two parts, an introduction to RDF and why we provide it, and a tour of UniProt‘s RDF. I aimed for 15 minutes, and got only five to present it due to the packed schedule. So I explained the very gist of “why RDF”, showed some examples, and talked about the problems we are encountering.

The problems got, predictably, most attention. Semantic Web “believers” spreading the vision are plenty. Hands-on experiences with complex data sets such as UniProt’s are rarer. I need to write about this in depth at some point. Suffice it to say, I think I dampened some enthusiasm. This despite the fact that I repeatedly stressed that I think of RDF and related technologies as valuable building blocks in the bigger picture, and as clear steps forward on some problems. But the Semantic Web seems to be an all-or-nothing affair for most people.

Tony Brooks is right in saying that given there are only two years left to go for Gen2Phen, it might be late to start with SemWeb technology. A large modeling effort and uncertain scalability challenges could delay the benefits until it’s too late. On the other hand, it’s not that much work to start experimenting. Install Virtuoso and D2R, fire up Protege, write some RDF using Jena, and get a feeling for the whole thing. Design some RDF schema that expresses the basics of the information at the heart of Gen2Phen, and see if existing systems can add it as in- and output format. That would be my recommendation, which I might or might not have gotten across — it was a packed event about an unfamiliar project where the SemWeb was only one of many sessions, so communication was somewhat difficult.

The meeting as such was very nice. Good conversations and awesome food — La Maison de la Lozere in Montpellier was brilliant. So was the city itself; I enjoyed wandering around the beautiful old town.

One other presentation I found interesting was Gudmundur Thorisson’s about ORCID. This initiative aims to unambiguously researchers with an ID instead of their name, which might occur many times. ORCID will then map an article’s DOI to the IDs of the authors, when it’s submitted. Also, and perhaps even more important, ORCID aims to do the same for data sets. Science really needs more, larger, better data sets in the open for people to analyze and train their algorithms on, but currently there is very little benefit for researchers to publish them. ORCID is not really functioning yet, but is backed by more than 120 organizations, and so has a decent chance at becoming the de facto norm in academia.