Friday, July 18, 2008

Visualizing GBIF density data in 3D using Processing

Ok, I recognize it, this is pretty nerdy, but I could not resist. Last week Radiohead released its last video House of Cards and I think it is great. Specially because it is made without cameras and lighting, only "scanners". Basically they just take 3D point positions and then visualize them using software. The making of is pretty cool. And even more cool is that they released the data and the code they used to generate the video (well kind of). They used an open source software called Processing. Basically it is a software to program images, animation and interactions. It is very simple to use and I just loved it from the very beginning. 

So after playing a bit with it and radiohead data I could not resist to create my own experiment. And now that I am playing so much with GBIF density data I thought this could be cool data to represent. I am not sure if it has any potential "real" use but never the less it is lot of fun and maybe a little artistic. The idea is to represent the GBIF density data for Fungi using the count of each cell as the z parameter in the 3D space. This is how the output looks like:

You can download the output also as a native application for:
Or use it as a Java applet at this place. It does not work in my computer but you know Java applets, they never work as intended ;)

The source code used to generate this program looks like:

import processing.xml.*;
import processing.opengl.*;

XMLElement xml;

ArrayList points = new ArrayList();

int maxCount = 0;

int currentX;
int currentY;

int frameCounter =1;

void setup() {
size(1024,768, OPENGL);


//Draw lines at a width of 1, for now.

xml = new XMLElement(this, "");
XMLElement[] densityRecords = xml.getChildren()[1].getChildren();
int numSites = densityRecords.length;
for (int i = 0; i < densityrecord =" densityRecords[i].getChildren();" coord =" new"> maxCount) {
maxCount = int(densityRecord[4].getContent());
void draw() {

// Lets adjust our center slightly
// Lets draw things bigger

// We'll use a black background
// The data has 0,0,0 at the center and we want to draw that point at the center of our screen
translate(width/2, height/2);


for(int i=0; i1000) {
z= z/10;

z = (z*10000)/maxCount;

// println(x+"-"+y+"-"+z);

//This is used to save frames to create a video for Youtube
Lot of things could be improved to make it more interesting, but that was nerdy enough. I have not been available to work much lately due to weddings, climbing mountains, birthdays, etc. Hopefully you will see some more things coming from my side soon.

Monday, July 7, 2008

Cascading Hadoop

When running MapReduce stuff, it very quickly becomes apparent that simple jobs need queued up then an operation run on the output of another, or perhaps an operation run on multiple outputs from another - almost like a cascaded effect one could say... Enter Cascading, a project that aims to build fault tolerant workflows on top of Hadoop.

I have only just started to play with it (similar to Hadoop it is not in the public maven repositories, so of course I wrote my own pom to do a local install). It is a new project and a small team (1 guy?) but it looks promising, although I think it misses a few more getting started examples for common operations - but I figure if people blog about it like here, the examples will pop up pretty quickly.

One of the nice features that immediately attracted me to it was the fact that it can do visualisations of the workflow for you like so (this is just an example workflow, not the code below):

And here is the example. I am using my standard subset of GBIF data, and grouping together the records by scientific name and then sorting them on resource and basisOfRecord (some databases can't mix group by and order by in SQL without temp table creation, so this seemed like a nice example).

public void run(String input, String output) {
// source is the input file - here I read from local file system
// (e.g. not the distributed fs)
Tap source = new Lfs( new TextLine(), input );

// sink is the output file - here I write to local file system
// (e.g. not the distributed fs)
Tap sink = new Lfs( new TextLine(), output, true );

// my tab file schema
Fields dwcFields = new Fields( "resource", "kingdom", "phylum", "class", "order", "family",
"genus", "scientificName", "basisOfRecord", "latitude", "longitude" );

// parse the data
Pipe pipe = new Each( "parser", new Fields( "line" ), new RegexSplitter(dwcFields));

// define some group and sort fields
Fields groupFields = new Fields("scientificName");
Fields sortFields = new Fields("resource", "basisOfRecord");

// a group by with a sort...
// note that this takes the previous pipe
pipe = new GroupBy(pipe, groupFields, sortFields);

// connect the assembly to the SOURCE and SINK taps
Flow parsedLogFlow = new FlowConnector().connect( source, sink, pipe );

// start execution of the flow (either locally or on the cluster)

// block until the flow completes

So this was very simple, and it was only the first night playing. Note that this code does not have a mention of a MapReduce job, or anything more complex than a simple tap, pipe, sink workflow,,,

I will proceed by trying to do a much more complex workflow - I think splitting the world data into the 2.8 grids I proposed earlier (6 zoom levels), followed by doing some breakdowns for various analytics I anticipate producing. Then I will report back with metrics running from EC2.
What I would really like to do, is have some nice metadata that accompanies the datafiles at each process that gives the semantics of the file - e.g. something that describes the columns in the tab file, so I expect to use the TDWG vocabularies and do some RDF (perhaps RDF represented as JSON?) This way I can set up the Fields automatically, based on the content of the file, and accept different input formats.

Tuesday, July 1, 2008

BADE (Biodiversity Atlas Distribution Editor) introduction

I have been working hard in the last weeks in the Biodiversity Atlas editor. Most of the time I spent it in figuring out realistic ways to use Google Maps API with Biodiversity data. As Tim always say, we have to move from points, so here we are with Polygons! The problem is that polygons do not perform that well in web mapping interface. I end it up trying Yahoo Maps for Flash, UMap and finally Google Maps for flash. I think Google has the fastest API and I figured out more or less how to not kill the client and have a responsive interface according to the actual RIA days. Of course the technology that I am using is Flex. One of these days I have to do a post just talking what I love Flex so much, but just to justify myself I really think is the only option right now to do this semi gis applications on the web. With Flex I can handle 10 or 20 more polygons than with Javascript and I definitely feel more productive.

In any case. What is BADE? Well this project started as a subproject from Biodiversity Atlas and ended it as a stand alone project. The idea at the beginning was to create a small module for users to be able to contribute with distributions directly from the web. And I started investigating ways to let the people create distributions. With all the problems with performance I had to think a lot about data models and ways the application would be able to handle lot of data and geometries at the same time. Specially hard was the creation of grids and handling different scales or precessions, so hard that right now we sticked to 1 degree cells and look for future developments. In any case, once I had a good data model and a feasible way to draw on the map lot of data I thought this was a good start for a small application for doing Analysis! So there I focused. What are the main ideas behind:
  1. Be able to "draw" the distributions of species on a grid system.
  2. Be able to import other sources of data, like GBIF, to complete your data or to just use external data
  3. Import from CSV and Shapefiles and export to everything we can.
  4. Engage users to share their work, but not force them, in Biodiversity Atlas to create a coherent and comprehensive source of distribution data.
  5. Let people work collaboratively online like Google docs let you do or all these new incredible web 2.0 apps out there.
  6. Continuos addition of analysis tools that work out of the box with your existing data.
Most of the points are still not covered, but I wanted to release early and ask for feedback as soon as possible. What can the app do right now?
  1. Create and edit datasets.
  2. "Draw" occurrences
  3. Import data from GBIF
  4. Save the document and reload it.
Not much, but the core is already there and right now adding functionalities should be easy and fast.

So here you can see some screenshots and more important you can try the application for yourself! Please I want to hear your feedback!