Sunday, November 30, 2008

Reproducing Spatial Joins using Hadoop and EC2

Finally I got some time to continue researching the use of Hadoop for the processing of the GBIF dataset with respect to the protected areas of the world.

The problem
There is a growing number of point data, and an increasing need to perform cross referencing of the GBIF data with external sources, in this case the protected areas of the world.

The last 3 years of working on reasonably large databases (>100G) has told me that this join was not going to be an easy one, despite the obvious partitioning strategy that can be invoked here (spatial partitioning).
Therefore Javi and I set about this task by
  1. Using PostGIS and the traditional approach of joining 2 tables of 150M point records with 120,000 polygon records - finding will be posted shortly
  2. Using Hadoop and the Amazon EC2 to process the join
Using Hadoop
This is ongoing but I thought I'd post the findings so far.

Strategy 1: 
Partition by cell
Start by splitting the 150M records from one file into a file per 1 degree cell.
Loop over each polygon:
  • Determine 1 degree cells covered polygon
  • Use the input files as a merge input to hadoop and pass in the Polygon for the Map to do the contains()

Result  
Hadoop will not handle so many file splits no matter what configuration I try so far. Mailing list responses suggest this is not the ideal use of Hadoop and it prefers single large files, rather than many smaller ones.

Amendment
Using 10 degree cells, it does work but takes a LONG time... so long I killed it. This was expected since it is really the equivalent of only a DB join with an index on an INTEGER teg_deg column.

Strategy 2: 
1-Degree in-memory index of polygons
So this approach goes through the join the opposite way. The polygons are loaded into memory and a
HashMap<Integer, Collection<Polygon>>
for each Map operation to use.  

The logic is effectively give me polygons intersecting the 1 degree square for this point, and then test them properly. This worked ok, but requires a LOT of memory for the polygons to be held.

Result
About 6 hours to process 60,000 polygons.

Strategy 3: 
RTree index of polygons
Very similar to strategy 2, this approach indexes all the polygons by RTree. Thus for each point, the index provides only candidate polygons whose bounding box contains the point, and the number of candidate polygons to test is far smaller than that of strategy 2.

Result
45 minutes for a 130M x 30,000 cross reference but requires the larger EC2 instance sizes. The RTree index means it will need processed in batches.
Hadoop requires tuning, just like a database does for large joins. Here is the final tuning for strategy 3:
  • mapred.child.java.opts=-Xmx1G // for the R-Tree index
  • mapred.output.compress=false // for easy reading of output
  • mapred.tasktracker.map.tasks.maximum=4 // we have 7G per node, so 4G for 4 Map tasks is ok
  • mapred.job.reuse.jvm.num.tasks=-1 // Reuse the Map as it has the index built (takes 90 seconds for startup)
  • dfs.block.size=134217728 // use larger blocks - in retropsect this was a probable mistake and will be removed for next test
More to follow...

Thanks to the Hadoop-core Users mailing list for their insights, and Andrea of OpenGEO for pointing me at the JTS RTree implementation


5 comments:

Dave Martin said...

Very nice blog, just curious - with the 150k polygons loaded into postgis, how long does it take for postgis to answer the contains() query for a single point?
Wondered if you considered just chundering you way through the 150 million records one by one, hence avoiding the join.
I can imagine this may take 10 hours or so but its a simple way of doing it. Obviously the possibilities with hadoop are more exciting though...

Javier de la Torre said...

Congratulations Tim! This is a great work.

I will follow in the next days with a post over the PostGIS way. Fort hose curious the best I could get was using an xlarge instance in ec2 with 8 cores. With postgresql 8.3 you can perform simultaneous scans over the occurrence table and perform well. This opens the possibility to parallel processing. But the best result I could get was 6 hours for the same job Tim has done in 45 min.

I do not do a join, I go more or less record by record finding if the points fall inside a polygon. So more or less I can process 7.000 points per second. But again, this is using 8 simultaneous queries (one per processor) on different parts of the occurrence table.

The strategy with PostGIS is very similar to what Tim does. I use first an Rtree index for bounding box check and then distance (contains is slower) to really check if the point is inside the polygon. The difference between the Hadoop way and with one single PostGIS database is that I am limited to the number of processors the machine has while with Hadoop you can use 20 machines at the same time (Tim, how many instances were you using by the way?). The only thing I can say is that I am using a machine while I think Tim is using 20, but considering prize model of ec2 it is curious: Hadoop, 45min 12$, PostGIS: 6 hours 4.8$. And this is very curious, because if Tim would had used 40 instances, the time would had been 22.5min to process but the same prize!, for the same prize you could had used 160 instances and process the data in 5.6min!

And I think you are probably not using all the cores in each of the instances... amazing how this map/reduce together with ec2 works...

Tim Robertson said...

Hi,
So Dave - effectively this is the Strategy 3 of the Hadoop approach. It churns through all the records one by one (but in parallel) and then does the "what polygons are you in" (not it is a many polygons to one point join due to the overlap of the protected areas). The polygon RTree index I keep in memory, just like you would want a database to.
Javier is doing the same in PostGIS, but this is what a database what effectively do in a join anyway, but either one way (loop over point table) or the other (loop over polygons) - I *think*...
So Javier is testing what you describe and we will publish a summary report when we have both finished tuning as best we can. I still think I can tune Hadoop better - this is where it is clearly not as strong as a DB as I am literally profiling RTree for memory usage, then predicting Hadoop framework overhead, then seeing how many parallel processes I can run on one EC2 instance etc etc.

Javi - Hadoop spawns multiple JVMs per Node (EC2 instance). Therefore I have assumed, but not checked, that since 4 JVMs are spawned on a 4 processor (2x2) machine, Fedora is smart enough to assign a processor to each. If it doesn't then I would be better using more smaller instances. I might write them and ask if I can the 20 machine limit removed for my account ;o)

You guys did well understanding my post - I read it today and wonder how much wine I had drunk before I wrote it.

Tom White said...

Interesting.

You should be able to get better performance out of the large instances on EC2 by using all their drives (if you're not already): see https://issues.apache.org/jira/browse/HADOOP-4745 for the code. This should work better than using a corresponding number of small instances because you get better IO from EC2.

krokodil said...

Very interesting. Please post further results if any. I am interested in spatial data processing in hadoop as probably many others.