Saturday, October 25, 2008

TDWG 2008 in Fremantle

The great TDWG conference has just finished this year in Fremantle, Western Australia. Me and 180 other biodiversity informatics people have attended and I thought I would point out the most interesting things from my side.

Darwin Core
The most exciting thing to me is the chance to come up with a new Darwin Core soon for ratification that resembles more Dublin Core and consists of possibly 3 namespaces.

While working on the IPT it seemed that most information we are dealing with in biodiversity informatics is centered around 3 entities only: Taxon, Occurrence and SamplingSite. It would make sense in my eyes to separate elements of a new DarwinCore standard according to those core entities. The normative standard will be decoupled from the implementation technology and only consist of natural language definitions with URIs for the elements of the standard. Specific application schemas (called profiles in DC) will then define/recommend how to use DarwinCore within an XML, RDF, XHTML, OGC application schema or tab file environment. Even datatyping can be left to the respective schemas and e.g. dates can be expressed in their native formats. I will give examples of the different representations soon in another blog entry.

Combining these 3 core entities with the notion of one-to-many "extensions", the star schema of the IPT, one can handle quite a rich definition of data. Extensions for multiple identifications can be added to the occurrences, SPM like species descriptions, geographic distribution or species status for invasiveness to the taxon. And still the whole standard and exchanged data can be extremely simple! Btw, simplicity was probably the most mentioned idea in TDWG this year (a really nice talk was about creating a Species Index with SiteMaps by Roger Hyam in our Wild Ideas! session). Controlled vocabularies like BasisOfRecord or Ranks should be expressed in simple ASCII files like the ISO country code one, with a code, label, definition and examples. This also allows for easy translation in different languages because the codes stay the same.

Integrated Publishing Toolkit
Surprisingly many people asked for IPT demos after screenshots were shown in some GBIF talks. Me and Tim did demos nearly every day and the publishing tool was well received in general. Alpha testers (mainly usability) for the public instance at GBIF were gathered and new ideas arose or vague ideas materialised, e.g. validation/annotations:
 * validation can be done through external services that adhere to a simple API. The validaion would be asynchroneous, sending a token, link to the full dataset (either dwc xml or tab file dumps) and a callback handler. Once the validation is done the handler would receive the token and a link to the validation report, an xml file that contains annotations (unstructured text) about records together with some probability and potential suggestions (list) of property changes.
 * provide API to push datasets into the IPT via REST service
 * BCI collection and institution code validation and lookup of GUIDs during mapping.

Greg Whitbread will likely be leading the Technical Architecture Group. Refining and shrinking the core ontology (owl) was seen as an important issue over the next years.

Originally I had planned to question the uptake of LSIDs again, being in favour of PURLs since the beginning. After long debates that some of us seemed to have experienced before, we came to the following conclusions though keeping up the LSID recommendation:
 * LSID look much more stable in printed publications. That means they should really be resolvable through proper LSID resolution via DSN SRV records
 * LSID are on the agenda of many projects already
 * Proxying LSID with http removes many troubles, especially when used with RDF. The strong recommendation is to always use the proxied version in RDF abouts.
 * Changing the domain used within the LSID might cause problems (e.g. name change of an institutions). It might be better to have a single central LSID authority
 * pure UUIDs with a central resolver would be great. The resolver would have to know about all existing UUIDs in our domain, but it could be mirrored and sharded easily. Maybe something to think about
 * central PURLs and LSID (only) require a registration of services/authorities

Great introductory talks were given by Rich Pyle about zoological and botanical nomenclatural differences and by Nico Cellinese about the phylocode. If you always felt to be on shaky grounds with nomenclature or taxon concepts you should really watch these talks!


Javier de la Torre said...

Thanks Markus for this report. It is a great overview for people that could not attend, like myself.

I had been watching some of the talks already and I liked a lot the one from Rich Pyle you mention. I hope he makes it into slideshare and we can create a single post on the blog about it. This is a great introduction about the complexity of handling scientific names, or better taxon concepts.

I was hoping to see more discussion on geospatial issues and biodiversity. Specially about species distributions. It still intrigues me that there is no more people at TDWG working on this. If we really want to move biodiversity informatics to a more REAL phase with REAL applications of the data, then I think we have to move from primary data to species distributions! We need primitives to create species distributions APIs and standards to share the data. Please no more points, more polygons!
There are discussions on niche modeling, like always, but none seems to take care on how to share them, how to compare, merge them, etc...

If want to be able to answer questions like, is my road going over endangered species distributions, where should I place a national park, what species could live in my garden, what species can I find in the track I will hike in the Pyrenees, etc. For this we will need to start integrating species distribution data.
Well, I will be working on this in the next months so hopefully next year I can attend TDWG and start the standardization process. By the way, where is going to be next TDWG?

Congrats for the IPT Markus and Tim, I hope we finally get the ultimate provider software we had all been dreaming of. It would be cool is you can create a little screencast about it so you dont need to show it live continuously ;)

Peter Desmet said...

Hi Javier,

Just discovered this blog via the GBIF IPT Google Code page (can't wait till this tool comes out) and you have some really interesting posts up here. Keep it up!

If nobody told you yet, the next TDWG-meeting is going to be in Europe, no extra details yet.

PS: I like Wordle too. :-)