Since I was running on the full 135 million record GBIF index, and the IUCN Categories I-VI National shapefiles containing 60,753 polygons, I went back to Hadoop running in single node mode (i.e. the simplest of simple) as I am not sure the Mapreduce 'Lite' I wrote would cut it.
Input file: 135 Million records of 12 DwC fields (13G)
Stack: Hadoop for Mapreduce, Geotools for Polygon
I went for the brut force approach; the Map held a List
Of course this did not perform... but I wanted to post some benchmarks:
Reference: A Map that does nothing, and the GBIF index input: 500secs
Per Polygon: 1500secs to produce the species by polygon
(Note: This is running Mapreduce in a non clustered, single server environment)
I tried this using the polygon bounding boxes only, to remove Geotools from the equation in each Map operation and the results were the same.
It is clear that to do this kind of processing, the data needs to be sliced up to reduce the number of combinations.
I am pondering using the same tiling algorithm that the map guys seem to have settled upon (e.g. GE superoverlay style). By doing this, a preselect based on the intersect of the protected area bounding box with the tiled data would result in massively reduced processing. I am thinking 7 zoom levels resulting in 32768 distinct tiles so therefore working only on 2.8x2.8 degree cells. Currently GBIF models to 1x1 degree and 0.1x0.1 degree cell but this does not easily port to Mapreduce for partitioning purposes of the input data.
(As a side effect, the tiles could be processed in parallel for specific mapping views ready for immediate serving through Geoserver...)