Sunday, June 29, 2008

Hadoop on Amazon EC2 to generate Species by Cell Index

I finally took the plunge onto the Amazon Elastic Cloud to do some proper distributed processing using Hadoop after spending the past few weeks playing with MapReduce and distributed processing.

This is a post about my experiences and lessons, but for those who just want the result:

Macbook Pro, 2G JVM, 1 node Hadoop, species per 1 degree cell for all GBIF occurrences: 2449 secs
EC2 20 small instances, species per 1 degree cell for all GBIF occurrences: 472 secs

And the details:

This experiment was to simply generate the Species per Cell (1 degree x 1 degree) index for the entire GBIF occurrence record store (135M record index used) and run some comparisons for local versus cloud execution.

So the code (About 20 lines of 'real' code and maybe it could be optimised further - the reduce for example):

/**
* Generates the species by cell index
* @author timrobertson
*/
public class SpeciesByCell extends MapReduceBase
implements Mapper,
Reducer {

// assuming this is a large object as they reuse it in the tutorials...
private final static IntWritable cellId = new IntWritable();
private Text speciesMap = new Text();
private Text speciesReduce = new Text();

// reuse the pattern for performance
private Pattern tabPattern = Pattern.compile("\t");

/**
* Outputs Cell:Species
*/
public void map(LongWritable key, Text value,
OutputCollector output, Reporter reporter) {
String data = value.toString();
if (data != null) {

// sci name is column 2, lat 11, long 12
// split is not the best way to split a string
String parts[] = tabPattern.split(data);
if (parts.length>= 12) {
try {
cellId.set(SpeciesByCell.toCellId(Float.parseFloat(parts[10]), Float.parseFloat(parts[11])));
speciesMap.set(parts[1]);
output.collect(cellId, speciesMap);
} catch (NumberFormatException e) {
} catch (UnableToGenerateCellIdException e) {
} catch (IOException e) {
}
}
}
}

/**
* Distincts the species
*/
public void reduce(IntWritable key, Iterator values,
OutputCollector output, Reporter reported) throws IOException {
Set species = new HashSet();
while (values.hasNext()) {
species.add(values.next().toString());
}
for (String s : species) {
speciesReduce.set(s);
output.collect(key, speciesReduce);
}

}

/**
* Gives a cell id
*/
public static int toCellId(Float latitude, Float longitude) throws UnableToGenerateCellIdException {
if (latitude== null
|| latitude < -90 || latitude > 90
|| longitude < -180 || longitude > 180) {
throw new UnableToGenerateCellIdException("Latitude["+ latitude+"], Longitude["+longitude+"] cannot be " +
"converted to a cell id");
} else {
int la = new Double(Math.floor(latitude + 90)).intValue();
int lo = new Double(Math.floor(longitude + 180)).intValue();
int cellId = (la * 360) + lo;
return cellId;
}
}
}


Hadoop says it supports GZip files - so I GZipped the input data, which is simply 13 columns of DwC. Unzipped, the data is 13G, gzipped 972M.
Following uploading and running however, I found some pretty unusual results. Falling back to my favourite invasive (Passer domesticus) to run a smaller test I found that on GZipped the map worked on 246,768 rows, not the full 900,000+. Therefore, I conclude that I am either not running it right, or it does not support GZipped input. I don't see how the distributed chunking could work if it is GZipped, and looking through the docs I think it is only on the output that you can use GZip. So, I then started again, and extracted the full 13G and got that on the HDFS for processing.

Now, since I am a maven user, I need to build up an executable jar; here is the pom section that sets the manifest:

<build>
<defaultGoal>package</defaultGoal>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<archive>
<manifest>
<mainClass>com.ibiodiversity.index.mapreduce.SpeciesByCell</mainClass>
<packageName>com.ibiodiversity.index.mapreduce</packageName>
</manifest>
</archive>
</configuration>
</plugin>
</plugins>
</build>


Now, we put the input file onto S3 for use in EC2:
bin/hadoop fs -put /path/to/source s3://<id>:<secret>@<bucket>/path/to/target

(Ok, do NOT do s3://<id>:<secret>@<bucket>/ - it blew up when I tried to move the file out of S3 onto a local HDFS since it was in the root. My second attempt (of 2 hours transfer) was to ibiodiversity-gbif-dwc/dwc/allData)

Of course, my has a / in it, so the Hadoop client blows up even when they are escaped as %2f... in the end I got the hadoop code into eclipse and hacked it directly.

Uploading to S3 using the hadoop FSShell take a LONG time from my house! 3 minutes per 32meg = 186kb/s
Now that it is on S3, I can run many index generations without this latency, but each time there is a version, it would need updated. Clearly it will be better to harvest straight to S3 (perhaps in pages?).

1.5 hours (of hoping the network stays alive) later... (make that 3.5 hours total waiting time, but it is there for next time)

Now I fire up a master instance (standard Hadoop 0.17.0 AMI), connect to it, and copy across the input data using:
cd /usr/local/hadoop-0.17.0
bin/hadoop fs -mkdir logs
bin/hadoop distcp s3://:@/path/to/logs logs


The dreaded / problem again!!!
This time no way to easily hack around it, so top tip: keep reproducing your secret key until it has no /_- characters in it!!!
Now the copy works fine.

Copy up my jar file to the master (run locally)
. bin/hadoop-ec2-env.sh
scp $SSH_OPTS /tmp/speciesByCell.jar root@$MASTER_HOST_DNS:


So, as always we need a benchmark from single node.
Macbook Pro 2.4G, 2G JVM Memory:

org.apache.hadoop.mapred.Counters Counters: 11
org.apache.hadoop.mapred.Counters Map-Reduce Framework
org.apache.hadoop.mapred.Counters Map input records=134475707
org.apache.hadoop.mapred.Counters Map output records=106297454
org.apache.hadoop.mapred.Counters Map input bytes=14209537679
org.apache.hadoop.mapred.Counters Map output bytes=2406590553
org.apache.hadoop.mapred.Counters Combine input records=106297454
org.apache.hadoop.mapred.Counters Combine output records=1130567
org.apache.hadoop.mapred.Counters Reduce input groups=41845
org.apache.hadoop.mapred.Counters Reduce input records=9565669
org.apache.hadoop.mapred.Counters Reduce output records=7259291
org.apache.hadoop.mapred.Counters File Systems
org.apache.hadoop.mapred.Counters Local bytes read=3038621493011
org.apache.hadoop.mapred.Counters Local bytes written=73325771174

Finished in 2449 secs!

And 20 Node cluster:

08/06/29 17:31:04 INFO mapred.JobClient: Counters: 17
08/06/29 17:31:04 INFO mapred.JobClient: File Systems
08/06/29 17:31:04 INFO mapred.JobClient: Local bytes read=298092555
08/06/29 17:31:04 INFO mapred.JobClient: Local bytes written=597267841
08/06/29 17:31:04 INFO mapred.JobClient: HDFS bytes read=14210393954
08/06/29 17:31:04 INFO mapred.JobClient: HDFS bytes written=53736532
08/06/29 17:31:04 INFO mapred.JobClient: Job Counters
08/06/29 17:31:04 INFO mapred.JobClient: Launched map tasks=231
08/06/29 17:31:04 INFO mapred.JobClient: Launched reduce tasks=1
08/06/29 17:31:04 INFO mapred.JobClient: Data-local map tasks=207
08/06/29 17:31:04 INFO mapred.JobClient: Rack-local map tasks=5
08/06/29 17:31:04 INFO mapred.JobClient: Map-Reduce Framework
08/06/29 17:31:04 INFO mapred.JobClient: Map input records=134475707
08/06/29 17:31:04 INFO mapred.JobClient: Map output records=106297454
08/06/29 17:31:04 INFO mapred.JobClient: Map input bytes=14209537679
08/06/29 17:31:04 INFO mapred.JobClient: Map output bytes=2406590553
08/06/29 17:31:04 INFO mapred.JobClient: Combine input records=104849073
08/06/29 17:31:04 INFO mapred.JobClient: Combine output records=780651
08/06/29 17:31:04 INFO mapred.JobClient: Reduce input groups=41845
08/06/29 17:31:04 INFO mapred.JobClient: Reduce input records=9416965
08/06/29 17:31:04 INFO mapred.JobClient: Reduce output records=7259291

Finished in 472 secs!


None of these were tuned in anyway (Mappers, Reducers, Block Size etc)

Cost of running: well, less than 1 hour so ~$2 (20 x $0.1 + a little for transfers)

(Thanks to Tom White for publishing guidelines on Hadoop on EC2 here on which I based this test.)




Wednesday, June 25, 2008

Convex Hull over GBIF "points"

I found yesterday a post where Convex Hull was explained and the source code made available for Javascript and PHP. "The convex hull may be easily visualized by imagining an elastic band stretched open to encompass the given points; when released, it will assume the shape of the required convex hull." (from wikipedia). I talked about it months ago with Tim and I thought that would be fun to adapt it to AS3 and use it in BiodiversityAtlas. Months ago we talked that one way to create "polygons" from gbif straight points was to do a Convex Hull over them.
But when finished I realized it might dont make much sense... the convex hull over GBIF points always display enormous polygons over land and water that dont look very nice. Look at the next picture.
This is Puma concolor convex hull over GBIF data. Well this is exactly what a convex Hull is so I am not sure what I was expecting.

There is other types of convex hull like the Orthogonal convex hull that might make more sense in some scenarios, but in general this does not look good to me.

By the way, the convex hull algorithm i am using, quickhull, is amazingly fast.

If you want to try for yourself I have uploaded the test app for you to try. If anybody is interested in the source code let me know and I will post it.

Sunday, June 22, 2008

Mapreduce for species in a Protected Area

The World Commission on Protected Areas (WCPA) publish shapefiles for protected areas, so I thought I'd run these against the GBIF index and get some species lists.
Since I was running on the full 135 million record GBIF index, and the IUCN Categories I-VI National shapefiles containing 60,753 polygons, I went back to Hadoop running in single node mode (i.e. the simplest of simple) as I am not sure the Mapreduce 'Lite' I wrote would cut it.

Input file: 135 Million records of 12 DwC fields (13G)
Stack: Hadoop for Mapreduce, Geotools for Polygon

I went for the brut force approach; the Map held a List<Polygon> and I looped each time over the polygons for each GBIF point record (60,000 x 135M = 8,100,000,000,000 comparisons ;o)

Of course this did not perform... but I wanted to post some benchmarks:

Reference: A Map that does nothing, and the GBIF index input: 500secs
Per Polygon: 1500secs to produce the species by polygon
(Note: This is running Mapreduce in a non clustered, single server environment)

I tried this using the polygon bounding boxes only, to remove Geotools from the equation in each Map operation and the results were the same.

It is clear that to do this kind of processing, the data needs to be sliced up to reduce the number of combinations.
I am pondering using the same tiling algorithm that the map guys seem to have settled upon (e.g. GE superoverlay style). By doing this, a preselect based on the intersect of the protected area bounding box with the tiled data would result in massively reduced processing. I am thinking 7 zoom levels resulting in 32768 distinct tiles so therefore working only on 2.8x2.8 degree cells. Currently GBIF models to 1x1 degree and 0.1x0.1 degree cell but this does not easily port to Mapreduce for partitioning purposes of the input data.
(As a side effect, the tiles could be processed in parallel for specific mapping views ready for immediate serving through Geoserver...)

Tuesday, June 17, 2008

Displaying Shapefiles in Google Maps for Flash (or any other AS3 mapping engine)

Recently I have been exploring different ways to let the user upload Shapefiles to a flex application and display it on Google Maps for flash or Umap. I am working on this for BiodiversityAtlas where users will be able to upload species distributions in SHP.

There are two easy options you can try:

1) Load the Shapefile natively in Flex using vanrijkom classes. Read the geometries and create overlays for the mapping api.
2) Process them in PHP using the ShapeFile.inc.php class developed by Juan Carlos Ulloa, send them, using AMFPHP,  as AS3 objects using similar to the mapping API and overlay.

I have tried both ways now. You can check the code of the first try and a demo at:

http://biodiversityatlas.com.s3.amazonaws.com/shps/shapeFileReader.html

I am still not sure which way I am gonna take for BiodiversityAtlas as both look fine to me. I probably will use the PHP way as in any case, if the user has to upload the file to the server then I can process i there anyway and I might store them directly on PostGIS even before displaying... so that looks my router.

But in any case is great to see I can go both ways and that both are surprising fast.

If anybody is interested on the PHP example please let me know. 

Sunday, June 15, 2008

Generic data harvester

Work is underway on a generic harvester for biodiversity data sources (DiGIR, TAPIR, BioCASe, OAI-PMH + others when they come up - LSID + TaxonOccurrence etc).

The goals of this project are to remove the need for application developers to spend time on data harvesting, scheduling of harvesting, console for log display or basic processing.  The framework is extensible, with each protocol residing in a very simple subproject (new protocols / versions easy to add) and UI generated automatically for parameter input (no need to write JSP for a new protocol!!!).  Everything is internationalised, including all logs which come with a simple JSON+AJAX based log tailing in the browser.  If we can work with the wrapper providers, and offer a single generic solution to TDWG, perhaps we should be aiming to get to the stage where data providers are certified to work with the TDWG harvester?

The code is almost stable, and ready to accept contributers (Java developers)...

(Built on top of AppFuse, Java, Spring, Hibernate, Struts2, Maven, Mysql soon to be H2)

MapReduce 'Lite'

I have been playing recently with the Hadoop distributed filesystem which ships with a MapReduce framework for processing large volumes of 'point' style biodiversity data.  The results were reasonably promising, but Hadoop is designed to process LARGE quantities of data (terrabytes) and not the mediocre few 100G that I am processing.  Futhermore, Hadoop is designed to be run on multiple machines and not the 'dual node on one machine' that I was running.  The MapReduce idea however really does fit the bill nicely for what I want to process, and therefore I went looking for a MapReduce 'lite' in Java but alas, it seems only geared for enterprise development.  So I started coding...

Inside iBiodiversity you will find a simple MapReduce implementation in Java.  Working on 11 million point occurrence records, my implementation generates the species per 1x1 degree cell index.  This is necessary to efficiently allow UI offering a clickable Map that pulls up the distinct species the system has data for.  Tuning parameter wise, you can configure the size of the pages to be worked on (P) and the in memory sort size (M) before a temporary file is written.  I found P:M of 10:1 was producing the best results, with the processing 11M records taking 175s with only 256M of process memory.  

This looks fairly promising as a means of quickly writing code that can run in parallel manner to process data into a format for custom views of biodiversity data (Species by cell, cells by species, cells by month by species, species by cell by decade, species interaction at the same time / place, aggregate counts etc etc).  If it can work on my MapRecuce 'Lite' then we know as data grows it will port to Hadoop easily.  

What next?  It will go into some index (H2, SOLR, Compass looking likely candidates and all semi tested already in iBiodiversity) with a service on top for Javi to do some UI magic

Thursday, June 12, 2008

Using Google Charts with GBIF stats data

While developing the Iphone UI I wanted to give a try to the Google Charts API. This is a pretty amazing powerful API to create easily charts. The charts always come back as simple images so no chance to create interactive stats like with Flash and specially Flex Charting. In any case this is so simple to use that I had to give it a try.

But lets take a look at one example. You want to have a world map with different colors per country depending on how much occurrences there are in GBIF database for a certain taxon, uff!. For example Pinales:


Well it is pretty simple, the URL looks like this:

http://chart.apis.google.com/chart?cht=t&chs=300x163&chco=ffffff,edf0d4,13390a&chtm=world&chf=bg,s,EAF7FE&chld=
GBDEUSNOFRCZPLSEATAUESIEPGMXNLKRCACHBELUNZITNCIMILIDPSJOFILIJPRUD
KSICRGRSYPEVNTWPTHNCNNIBOMGECHUGIARGTSKPYSVVEKPTZCOPAMABYTRDOM
MBRLABZEGCLLBZAGEKEMYHTADAMINPHLTSBPRUGVAKZTHMKGLUZUABGGFCDPK
ALETTJHRCMKGAQSZNFRSZMBILYGYMCSMIRLSCYROAZEEVUMZDZISNGMWSAZWLV
GANPLKKHTCBTMN&chd=s:6aZTNMLLGEDDCCCCCBBBBBBBBBBAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAA

Well it looks complicate at first look but this is just because of the encoding of the data to represent. Step by step is easier, this how it was done:

1) Connect to the GBIF API service to retrieve occurrences per taxon and per country:

2) Generate the URL for charting:

$churl = "http://chart.apis.google.com/chart?cht=t&chs=300x163&chco=ffffff,edf0d4,13390a&chtm=world&chf=bg,s,EAF7FE";
$country_list="";
$country_data="";
foreach($json['Resultset']['Result'] as $country) {
$country_list.=$country['id'];
$country_data.=$country['count'].",";
}
$country_data = substr($country_data,0,strlen($country_data)-1);
$churl .= "&chld=".$country_list;
$churl .= google_chart_encode($country_data,"s");


Thats it! You got the URL. this was done in PHP but it could had even been done in Javascript

I also used it to represent the amount of data per country currently avaialble in GBIF. If anybody is interested please let me know and I will provide the code.



Wednesday, June 11, 2008

Taxonomic Browser in Flex

Last week I was playing with the Iphone and I loved its interface. While working for biodiversityatlas.com I wanted to implement some of the UI ideas for the taxonomic browser I am constructing.

The current component lacks of design, effects and a little bit of control on performance, but in general I am satisfied with it so I wanted to publish it here.

The images you see are coming from Google Images. So for every taxon it is displayed a request is sent to Google Api to look for the first image. Curiously we discovered that the results change depending in what country you are. Of course the images are not always correct but it gives a nicer way to help you navigate through higher ranks. For people like me that dont know much about the taxonomic tree it helps to find a Pinus at least!

I havent implemented much functionality when viewing a taxon detail, but this is because BiodiversityAtlas is about distributions, not so much about what data you have available for a taxon, like in the Iphone UI.

In any case this the taxonomic browser I am implementing now in BiodiversityAtlas. So please, if you like it give me some feedback :)

Thursday, June 5, 2008

GBIF on your Iphone

I have been playing the last 3 days with my new Ipod Touch. I mainly bought it to try programming stuff for the iphone SDK. But first I wanted to start with something easier. I looked at available iphone frameworks to develop AJAX iphone apps and came to IuI. This is a very easy to use library that includes CSS and Javascript. So I couldnt resist to give it a try.

I decided to take a look at GBIF. I know the IT guys there at Copenhagen and I couldnt resist to try their APIs, specially the JSON ones. I contacted Dave Martin who helped me a lot with the APIS and in 3 days I came to the next video. Of course there is a lot of things that could be improved but I think is nice for a short try.

I have made a video for those who do not have an ipod/iphone yet. I know it is rotated, but sorry I dont find a way to rotate videos on my mac, so sorry for the neck pain while watching it.

I know the music is also... well, I did not want to spend more time. Here it goes!



And if you want to try it for yourself here is the link (you should use safari or any webkit browser):


I hope you like it. Any comments or feedback is welcome.

Javi.