Sunday, November 30, 2008

Reproducing Spatial Joins using Hadoop and EC2

Finally I got some time to continue researching the use of Hadoop for the processing of the GBIF dataset with respect to the protected areas of the world.

The problem
There is a growing number of point data, and an increasing need to perform cross referencing of the GBIF data with external sources, in this case the protected areas of the world.

The last 3 years of working on reasonably large databases (>100G) has told me that this join was not going to be an easy one, despite the obvious partitioning strategy that can be invoked here (spatial partitioning).
Therefore Javi and I set about this task by
  1. Using PostGIS and the traditional approach of joining 2 tables of 150M point records with 120,000 polygon records - finding will be posted shortly
  2. Using Hadoop and the Amazon EC2 to process the join
Using Hadoop
This is ongoing but I thought I'd post the findings so far.

Strategy 1: 
Partition by cell
Start by splitting the 150M records from one file into a file per 1 degree cell.
Loop over each polygon:
  • Determine 1 degree cells covered polygon
  • Use the input files as a merge input to hadoop and pass in the Polygon for the Map to do the contains()

Hadoop will not handle so many file splits no matter what configuration I try so far. Mailing list responses suggest this is not the ideal use of Hadoop and it prefers single large files, rather than many smaller ones.

Using 10 degree cells, it does work but takes a LONG time... so long I killed it. This was expected since it is really the equivalent of only a DB join with an index on an INTEGER teg_deg column.

Strategy 2: 
1-Degree in-memory index of polygons
So this approach goes through the join the opposite way. The polygons are loaded into memory and a
HashMap<Integer, Collection<Polygon>>
for each Map operation to use.  

The logic is effectively give me polygons intersecting the 1 degree square for this point, and then test them properly. This worked ok, but requires a LOT of memory for the polygons to be held.

About 6 hours to process 60,000 polygons.

Strategy 3: 
RTree index of polygons
Very similar to strategy 2, this approach indexes all the polygons by RTree. Thus for each point, the index provides only candidate polygons whose bounding box contains the point, and the number of candidate polygons to test is far smaller than that of strategy 2.

45 minutes for a 130M x 30,000 cross reference but requires the larger EC2 instance sizes. The RTree index means it will need processed in batches.
Hadoop requires tuning, just like a database does for large joins. Here is the final tuning for strategy 3:
  • // for the R-Tree index
  • mapred.output.compress=false // for easy reading of output
  • // we have 7G per node, so 4G for 4 Map tasks is ok
  • mapred.job.reuse.jvm.num.tasks=-1 // Reuse the Map as it has the index built (takes 90 seconds for startup)
  • dfs.block.size=134217728 // use larger blocks - in retropsect this was a probable mistake and will be removed for next test
More to follow...

Thanks to the Hadoop-core Users mailing list for their insights, and Andrea of OpenGEO for pointing me at the JTS RTree implementation

Wednesday, November 5, 2008

Do you know any species distribution project?

I am trying to create a general view on how species distributions are handle in different projects world wide. I am specially interested in project where they have maps and are published on Internet. So I would like to ask you for links to projects related to Species Distributions.

I am also interested in different ways people handle occurrence status or absence. 

And finally I also would like links to any mapping application that you have seen related to Species distributions. And if possible comment what you liked.

Please post your links as comments to this post or send them to me ( I will post in the future the results from this little research.


Saturday, November 1, 2008

Flickr creating polygons out of points, what about species distributions?

This weekend I have found a very interesting post from the Flickr programmers called The shape of Alpha. What this people are basically doing is generating shapefiles from the geotagged pictures in Flickr. When a user geotag a picture, provide lat/long, they reverse-geocode it to find the neighborhood, the town, the country etc where this coordinates are.  They store this information in the database (as  Where On Earth(WOE) IDs) and this allow for much better performance when later people search for pictures in "Madrid".

But this post talks about a different idea, and it is awesome. Now that they have all this geotagged pictures and they know they are in a particular neighborhood or country (WOEID) they are creating shapefiles from these areas based on the coordinates of the different pictures! Maybe is easier to explain it with a picture:

The one in the left is a London polygon and the right for United States. They were created by aggregating all the coordinates of the different pictures found in London and in the US.
Again. They take all the coordinates, POINTs, in the DB for a certain WOEID and with them they generate a polygon. If they have enough data it looks more or less ok. They have a good discussion about the quality of the polygons and the threshold set to generate them. I like very much the introduction they make to  Alpha shapes and the links they provide.
Additionally they also provide the source code of the tool they use to process the data. It is called Clustr.

Now, lets apply this concept to Biodiversity, if you haven't already figure it out. Think of their geotagged pictures as PRIMARY DATA, the WOE as SCIENTIFIC NAMES and the polygons you get out of them as DERIVED SPECIES DISTRIBUTION polygons.
I have been investigating this for a long time already, specially trough the Biodiversity Atlas project that is now stopped, but hopefully will start soon. We can take their source code and apply it to GBIF data and generate "derived, unchecked and uncompleted" species distributions based on GBIF data on a massive way! And the same they try in Flickr, the more primary data we get into the system the better the distributions will start looking like.

But there is another idea... why dont ask Flickr to not only process their polygons based on WOE Ids but also on tags? So if we tag pictures in Flickr with scientific names, or better GUIDs, they can then try to generate by themselves the distributions.

I particularly dont see Flickr as the best place to handle species distributions discussions, and will, hopefully, try to convince at least one big biodiversity project to let me try this way. Most of you can probably imagine the incredible API we can create once we have a lot of species distributions accessible in such a system. I will write another post about it soon, but think of:
  • Which species could live in my garden?
  • Which species habitats this new road will cross? 
  • Where should I create a new Protected Area to preserve as much biodiversity as possible?
  • What species could I find in the track I will hike this weekend?
So maybe instead of so much niche modeling projects we should start thinking on how to manage the vast amount of primary data (and other sources) we have already  and how to curate it and complete it. I dream of a scientific community joined together to create a complete information system to know where species are. Imagine like a Wikipedia but for Species Distributions.