Archive for January, 2010

Weekend Triple Billionaire at SWAT4LS 2009


In November 2009, I was at the SWAT4LS (Semantic Web Applications and Tools for Life Sciences) 2009 conference in Amsterdam. While cleaning up my writing directory, I noticed that I never blogged about this, so here are some seriously late notes.

My colleague Jerven and I presented a poster and a ten-minute highlight talk called Weekend Triple Billionaire. It’s about scaling problems we see at UniProt when working with RDF. Here’s the workshop proceedings, and a direct link to the PDF.

According to our (limited) testing and research, current triple stores are not able to store and query our data of three billion triples. We summarized this into a problem statement-style short paper and a poster. Here’s the abstract:

The UniProt Knowledgebase offers both manually curated and automatically generated information on proteins, and is one of the leading biological databases. While it is one of the largest free data sets that is available in RDF, our infrastructure and website are not based on RDF. We present numbers about the volume and growth of UniProt and show why this volume of data prevents using RDF triple stores and SPARQL with currently available tools.

I think the talk was well received, and we saw a lot of interest in putting up a publicly accessible triple store for UniProt, as it’s one of the most important Life Sciences databases.

Lots of the talks were really interesting. Alan Ruttenberg presented Science Commons and the new CC0 Creative Commons license for scientific data. He also talked about some of the infrastructure behind Neurocommons. I found it interesting that it’s written in a mix of Java, Jython and Common Lisp. Also, their new RDF Herd package manager might solve an issue with the Semantic Web today: it’s like RPM for RDF, with incremental updates and thus provides clean versioning and more efficient data transfer.

Michael Schroeder from the TU Dresden presented some impressive feats of GoPubmed. Their text mining apparently has F-scores over 80%, and they are working on generalizing the GoPubmed approach to the whole web via GoWeb. I guess it also helps a lot that their web design and usability are way above the norm for academic projects, as users are biologists, not computer nerds.

Finally, Barend Mons gave the most speculative and forward-looking talk with his keynote about the Concept Web Alliance and his vision of a future where traditional scientific publications play a minor role in comparison to “nano publications” which can be as small as one RDF triple and can go live as research progresses. For more info, check their web page which does a better job of presenting the idea than I can do here.

Of course, personal discussions and getting to know people is maybe the biggest point of such meetings. And it was indeed great to have lively discussions with the attendants. In addition to the aforementioned, it was great to talk to Deyan Peychev from Ontotext who are doing some serious OWL reasoning with our UniProt RDF. He had valuable suggestions for improving our OWL, some of which are already implemented. It was also a pleasure to meet Erik Antezana of OntoPerl fame.

It was a great day with lots of new inspiration and valuable face time with some of the leading researchers of the field. Hope to see you next year, or next month in Tokyo!

A simple Markdown journal in Emacs


I wanted to get into daily journaling since a long time. Keeping a journal makes it easy to find and go back to all kinds of things you have encountered and thus saves time. But more important is, I believe, that journaling structures your thoughts, like any kind of writing. You need to think clearly about something before you can write it down. Then, the act of writing it down anchors it more deeply in your memory.

Alas, all my previous attempts at journaling failed. Whatever the reason, whether on paper or on the computer, I never felt entirely at home with the solutions I tried. So what’s a hacker to do – write his own solution.

I know that there are already many ways to write a journal in Emacs. Well, here’s another one: simple-journal. It’s tiny and simple, it produces a Markdown format that I love, and having it written myself I feel more inclined to actually use it.

The journal looks like this:

### 2010-01-10

- **18:15** - "XML serializations should be hidden away from
  human view lest small children accidentally see them and become
  frightened." - from the paper *Representing disjunction and
  quantifiers in RDF*, McDermottDou02.pdf.

### 2010-01-17

- **14:45** - Set up a minimal Wicket application with Netbeans (a
  first for me, version 6.8) and Jetty. I want to try out working
  asynchronously with JSON using Wicket. Here are the steps to get the
  application running, serving up an empty page:

  - Start a plain Java SE project in Netbeans.
  - ... 

Being in Markdown, it’s very readable, and can readily be converted to well-formatted HTML. The format is completely hard-coded in the code for the moment. The goal was just to quickly get something simple and small working. Nevertheless, I’m always happy about feedback and ideas.

There’s one item on the TODO list that I’d really like to have, but I’m wondering about how to implement it a somewhat simple and efficient way: showing all entries that have “TODO” in them, or that start with “TODO”. Do I have to go through the buffer to collect them and show them in a temporary buffer? That would be a bit more programming than I feel is appropriate for such a task, and the temp buffer wouldn’t be in sync with the journal. Planet Emacs, ideas? ;-)

Intro to Perl 6 by Damian Conway


These are some lightly edited notes from a one-day intro to Perl 6 course given by Damian Conway. I had the pleasure of attending this session in summer 2009 at EPFL Lausanne, organized by my employer, the Swiss Institute of Bioinformatics.

If these short notes make you curious about Perl 6, see its website, and don’t miss the Perl 6 Advent Calendar, that shows off a lot of cool features – with code examples, in contrast to my notes.

First a nugget of wisdom that Damian Conway shared in the evening at the local Perl/Linux user group: Geeks are people who manipulate reality via language. In that, they are like the wizards in stories. Nice!

The big changes

  • Everything’s an object. That’s one of the biggest changes from Perl 5, however it’s very declarative and easy to use, and still hidden from view until you need it.Perl 6 is statically typed! This comes as a surprise, but its consequences are not drastic (if the programmer doesn’t want to) because of the Any type. It’s like Object in Java, but it’s implicit.

    If you want/need to be explicit: my Str $value.

    The type system has restrictions, a very cool feature: my $short_string of Str where {.chars < 10}.

    Each type is also a method that casts, as in print Str($obj).

  • The punctuation has been changed around in a “Huffmanization of punctuation”. They looked at a lot of CPAN code and made the most often used punctuation the shortest and easiest. And indeed most of the punctuation characters have a new meaning, so expect some re-learning if you’re a Perl 5 programmer.
  • Conway estimates that a Perl 6 program should be 20 to 40% shorter than the equivalent Perl 5 program.
  • It has blocks, written as {}, and they are first class. {} means “block” everywhere, even in strings.
  • Ah, sigils. Perl 6 still has them, but different from Perl 5: they never change for a given variable. For instance, an element of array @foo is now accessed as @foo[0]. Don’t we lose information here in comparison to $foo[0], namely that the foo contains scalar values? Conway
    says that this is true in comparison to Perl 4, but the references in Perl 5 already destroyed this.

    There is also the new sigil & for subs, but it’s not required for calling. Passing a sub looks like foo(&bar).

  • One of the most innovative parts of Perl 6 are Junctions. They are, if I understood correctly, Conway’s invention and are conceptually inspired by quantum physics. The analogy is that something can be in multiple states until it is observed. Transferred to a programming language, that gives us data-centric parallelism. The all, any, and none operators for lists execute the tests in parallel for each list element (with the actual number of threads depending on the hardware). If the list is of blocks, these are also executed in parallel. This gives a concise, readable, and safe to use parallelization facility.I have this hastily scribbled note saying that if a sub gets one of these lists or list expressions, |list| copies of it are executed in parallel – have to check what I meant with this exactly.
  • Perl 6 has extensive introspection:
    • .WHAT # type object
    • .WHERE
    • .WHICH
    • .HOW
    • .WHENCE # auto-vivifier
    • .WHY # comment

    For instance, $foo.WHAT.methods.

  • The language has built-in support for grammars and rules, essentially giving you a parser and lexer built into the language, using the normal (powerful) regular expressions.grammar G {
    rule R {} # whitespace significant
    token T {} # whitespace ignored

    The compiler is being written in this! So you can manipulate Perl 6 code easily as the parse tree is built in.

  • Perl 6 is defined in operational semantics via its comprehensive unit test suite. A compiler is a Perl 6 compiler when it passes all tests. There are currently about 20,000 of them; should be around 100,000!

I have a lot more notes about smaller features and nice syntactical sugar, that I cut out here not to bore the reader. Overall, I’m pretty excited about Perl 6 now, even though it’s not finished yet. It just has so many things that look so handy!

Interestingly, it’s kind of the antithesis to Lisp in that regard. It comes with a ton of syntactic constructs that do specific things, while Lisp is minimal and malleable so you can construct your own language. Both approaches have their merits, and both ways of developing software depend on how well they are executed. I think the Perl 6 designers have succeeded in making the standard language very well suited to writing elegant and succinct programs.