I have been playing recently with the Hadoop distributed filesystem which ships with a MapReduce framework for processing large volumes of 'point' style biodiversity data. The results were reasonably promising, but Hadoop is designed to process LARGE quantities of data (terrabytes) and not the mediocre few 100G that I am processing. Futhermore, Hadoop is designed to be run on multiple machines and not the 'dual node on one machine' that I was running. The MapReduce idea however really does fit the bill nicely for what I want to process, and therefore I went looking for a MapReduce 'lite' in Java but alas, it seems only geared for enterprise development. So I started coding...
Inside iBiodiversity you will find a simple MapReduce implementation in Java. Working on 11 million point occurrence records, my implementation generates the species per 1x1 degree cell index. This is necessary to efficiently allow UI offering a clickable Map that pulls up the distinct species the system has data for. Tuning parameter wise, you can configure the size of the pages to be worked on (P) and the in memory sort size (M) before a temporary file is written. I found P:M of 10:1 was producing the best results, with the processing 11M records taking 175s with only 256M of process memory.
This looks fairly promising as a means of quickly writing code that can run in parallel manner to process data into a format for custom views of biodiversity data (Species by cell, cells by species, cells by month by species, species by cell by decade, species interaction at the same time / place, aggregate counts etc etc). If it can work on my MapRecuce 'Lite' then we know as data grows it will port to Hadoop easily.
What next? It will go into some index (H2, SOLR, Compass looking likely candidates and all semi tested already in iBiodiversity) with a service on top for Javi to do some UI magic
No comments:
Post a Comment