Thursday, October 30, 2008

Word clouds for TDWG 2008

Some days ago I found a great application called Wordle. The application let you generate "word clouds" from text that you provide. It is very similar to what a tag cloud is but with the difference that it analyze the words in a document and not just the tags. The application analyze the frequency of word occurrences.

I gave it a try and I thought it could be interesting for others. For example here is the Word Cloud for the Proceedings of TDWG 2008. Basically I just copied all the text from the PDF and gave it to Wordle for analyzing.

Click on the image to view it in their Java applet.

Then I did the same but with Markus post about TDWG 2008. And here is the result:

They look similar but Markus I think has a higher geek level :D

Finally I could not resist and checked how it would look this blog and here it is:

Seems that the words that use most are Google and Data, even more than biodiversity itself (10 counts for Google and 6 for biodiversity). 

Finally, if you are interested in displaying word clouds in maps there is a recent post for one of my favorites blogs, cartogrammar. I dont see much use on it apart of that they look nice, and they indeed look nice!.

Saturday, October 25, 2008

TDWG 2008 in Fremantle

The great TDWG conference has just finished this year in Fremantle, Western Australia. Me and 180 other biodiversity informatics people have attended and I thought I would point out the most interesting things from my side.

Darwin Core
The most exciting thing to me is the chance to come up with a new Darwin Core soon for ratification that resembles more Dublin Core and consists of possibly 3 namespaces.

While working on the IPT it seemed that most information we are dealing with in biodiversity informatics is centered around 3 entities only: Taxon, Occurrence and SamplingSite. It would make sense in my eyes to separate elements of a new DarwinCore standard according to those core entities. The normative standard will be decoupled from the implementation technology and only consist of natural language definitions with URIs for the elements of the standard. Specific application schemas (called profiles in DC) will then define/recommend how to use DarwinCore within an XML, RDF, XHTML, OGC application schema or tab file environment. Even datatyping can be left to the respective schemas and e.g. dates can be expressed in their native formats. I will give examples of the different representations soon in another blog entry.

Combining these 3 core entities with the notion of one-to-many "extensions", the star schema of the IPT, one can handle quite a rich definition of data. Extensions for multiple identifications can be added to the occurrences, SPM like species descriptions, geographic distribution or species status for invasiveness to the taxon. And still the whole standard and exchanged data can be extremely simple! Btw, simplicity was probably the most mentioned idea in TDWG this year (a really nice talk was about creating a Species Index with SiteMaps by Roger Hyam in our Wild Ideas! session). Controlled vocabularies like BasisOfRecord or Ranks should be expressed in simple ASCII files like the ISO country code one, with a code, label, definition and examples. This also allows for easy translation in different languages because the codes stay the same.

Integrated Publishing Toolkit
Surprisingly many people asked for IPT demos after screenshots were shown in some GBIF talks. Me and Tim did demos nearly every day and the publishing tool was well received in general. Alpha testers (mainly usability) for the public instance at GBIF were gathered and new ideas arose or vague ideas materialised, e.g. validation/annotations:
 * validation can be done through external services that adhere to a simple API. The validaion would be asynchroneous, sending a token, link to the full dataset (either dwc xml or tab file dumps) and a callback handler. Once the validation is done the handler would receive the token and a link to the validation report, an xml file that contains annotations (unstructured text) about records together with some probability and potential suggestions (list) of property changes.
 * provide API to push datasets into the IPT via REST service
 * BCI collection and institution code validation and lookup of GUIDs during mapping.

Greg Whitbread will likely be leading the Technical Architecture Group. Refining and shrinking the core ontology (owl) was seen as an important issue over the next years.

Originally I had planned to question the uptake of LSIDs again, being in favour of PURLs since the beginning. After long debates that some of us seemed to have experienced before, we came to the following conclusions though keeping up the LSID recommendation:
 * LSID look much more stable in printed publications. That means they should really be resolvable through proper LSID resolution via DSN SRV records
 * LSID are on the agenda of many projects already
 * Proxying LSID with http removes many troubles, especially when used with RDF. The strong recommendation is to always use the proxied version in RDF abouts.
 * Changing the domain used within the LSID might cause problems (e.g. name change of an institutions). It might be better to have a single central LSID authority
 * pure UUIDs with a central resolver would be great. The resolver would have to know about all existing UUIDs in our domain, but it could be mirrored and sharded easily. Maybe something to think about
 * central PURLs and LSID (only) require a registration of services/authorities

Great introductory talks were given by Rich Pyle about zoological and botanical nomenclatural differences and by Nico Cellinese about the phylocode. If you always felt to be on shaky grounds with nomenclature or taxon concepts you should really watch these talks!

Friday, October 24, 2008

Identifying good images on Google cache for scientific names

I have been working on ways to represent taxonomic trees in a more intuitive way for non-biologist. The best I have found until now is to provide a contextual image together with the name. This helps a lot specially on higher ranks where most of the names are very unfamiliar for most of people, like me for example.
The problem is from where to get images about all scientific names. Well the best I managed to find is Google AJAX API. This is the content you get when you search for images on Google. The service is fast and reliable. But, it has one problem, it is not content aware. Well, they are trying, check this and much better this, but still it is not. So sometimes the results you get back just dont make any sense. There are some very pornographic examples of this, but to keep it kids friendly check the Phylum Labyrinthulomycota. The first image you will get is from this nice researcher:
Well, that is not of great help to get an idea of what this phylum is about. Therefore I had been for a while thinking in doing a little application where people can help me to select pictures that actually help to get an idea of what is behind a cold scientific name. And today I had a little bit of time and wanted to deploy something on AppEngine.

So here I am presenting a very simple application to ask for collaboration on this task. There are only 13 million names that I need to find a picture for, but I think I have enough friends :D

The application is very simple and I have not added any visual effects, but I just wanted to give it a try.
The rankings I get from this will be released soon in an API for anybody else.

Things I would like to incorporate:
1) Make it a game, more precisely a GWAP "game with a purpose". Like Google Image Labeler or any other toy from Luis von Ahn. (I wanted to link to his site for sooo much time).

2) Allow Multiple rankings. Now names only get evaluated once so a malicious contributor can ruin all this.

3) More sources, specially Flickr.

4) Nice UI so that Sergio, our new blogger here, is happy :)

When I get a decent amount of reviews from people I will post some stats and an example application.

Come on send the link to everybody and help everybody to understand better scientific names!