Monday, November 16, 2009

Amazon EC2, EBS RAID-0 & PostGIS build script

EC2's dirty secret
Javier's post was a great tutorial on building out a PostGIS database on Amazon EC2. We all know EC2, but it does have it's drawbacks and they are mainly related to disk IO. When using EC2 & EBS with large datasets you can easily run into IO bottlenecks. Individually these are not such a big deal, but when you are conducting global analyses poor disk IO on EC2 & EBS can quickly become a problem.

Clean living?
To help alleviate this, there is a trend of people stringing together EBS volumes and creating their own software RAID-0 arrays to achieve higher read and write throughput.

Nope, a Bash script.
I pieced together bits and bobs to create a script that builds out a PostGIS database on an n-volume RAID array on EC2. It's pretty simple stuff, but should mean that instead of hours, you can get your 20 volume RAID-0 PostGIS test rig up and running in minutes.

You can grab it from Github:

Monday, November 9, 2009

Automated informatics pipelines, public datasets, and the cloud

Although Biodivertido typically focuses on biodiversity informatics, I'm going to step it back to informatics of another stripe in the hopes of making a few points and getting people primed for Javier's TDWG talk Moving Biodiversity to the Cloud on Friday.

A couple of days ago I became curious if I could use CloudBurst to compare novel influenza sequences to the entire known sequence record of influenza in the hopes of detecting novel reassortants on the fly. The intended use of CloudBurst is to take small sequence reads from high throughput sequencing projects and query them against a complete genome a variety of reasons including SNP discovery, genotyping and personal genomics1. So why not take a novel influenza and break it up into many small overlapping fragments and query them against a genome comprised of all known sequences to detect reassortment?

Comparisons similar to these can likely be done using some manipulation of the Blast or similar algorithm. But in my case, I was particularly interested in CloudBurst because of its available implementation in EC2 using Hadoop to utilise the cluster environment. This means that if successful I could later wrap my method into an automated workflow for detecting and reporting potentially novel influenza reassortants as soon as new sequences are reported.

I posted an overview of the steps it took below. In the end, as a first pass detection of reassortment it appears promising or at least interesting. All in all, I went from idea, to data assembly, to trial implementation in about 2 hours (not counting a few hours trying to get Hadoop to run the CloudBurst.jar. Hadoop version number = VERY important). That is great for the study of influenza, but what about biodiversity informatics?

There are a few things that made this so simple.

First, a rapid adoption of EC2 by the bioinformatics community. Look for example at the JVCI maintained Bio-linux distribution for use on EC2. It comes prepackaged with a whole smörgåsbord of very fun little tools. It costs 34 cents per hour (not 10 because it requires a 64-bit instance) and takes about 1 minute to set up your very own copy. Where is our (the biodiversity informatics community's) GISing, niche modeling, distribution mapping, PDing, SGAing, Hadoop spatial joining, machine image?

Second, a growing community of users who are exchanging knowledge about how to tackle these very large datasets in the cloud. Biodiversity informatics is growing its own such community, writers of this blog can attest to that.

Third, cloud hosted datasets. Amazon Web Services is making it fairly simple to host very large datasets that anyone in the community can access. The GenBank image is ~200gb and takes 5 minutes to set up as a mounted volume for my own use. This targets a completely different consumer then our data portals and APIs. It opens up the data for the community of informatics enthusiasts doing cool things on the cloud (think spelling correction, retrospective georeferencing, spatial joining).

With all that said I'd like to return to think about how this relates to biodiversity informatics. I've been talking with a variety of people over the past few weeks and all of them show some level of interest in moving to the cloud. What I'm worried about now is that we will adopt many of the same stances as a community as we have had without the cloud. The case study I developed above relied heavily on easily accessible data and tools. We as a community must move forward with this as a primary goal. Likely, there will always be a need for portals and APIs, but for really big questions sometimes it is just easier to have the entire dataset ready for access. Why hasn't our community got it together enough to launch a unified public dataset in the cloud?

I guess data quality concerns and ownership are two primary concerns. I'm sorry, but those are bad reasons, time to grow up, it isn't 2001 any longer.

Once in place we can begin to build new methods (or reapply existing ones) for parsing out duplicate records, linking data to geographic areas, merging error types into targeted datasets, and sharing findings with the owners of the original data. The 'snapshot' approach as implemented in AWS Public Data Set makes it so we consumers constantly rely on the original providers to include the newest and most up to date records, none of our hard work will be included in future snapshots if we don't come up with methods for reporting corrections to the source.

It is important to restate, I don't think portals and APIs will go away. They are for two completely different consumer communities than those interested in looking at the entire dataset at once, versus pulling smaller subsets of data manually or developing tools to do so via API interfaces. I do think that by providing public datasets, new methods and technologies for enhancing the portals and APIs will arise, as well as still unknown methods for improving the datasets at source, and ultimately enhancing our knowledge about the world's biodiversity.


METHODS

After downloading Hadoop (version 0.18.3), here is what I ran.

Launch Hadoop cluster, create and mount a volume containing the GenBank influenza data

>src/contrib/ec2/bin/hadoop-ec2 launch-cluster cloud 5

>ec2-create-volume --snapshot snap-fe3ec297 -z us-east-1d

>ec2-attach-volume vol-7b49a612 -i MASTERNODEIDHERE -d /dev/sdh

>src/contrib/ec2/bin/hadoop-ec2 login cloud

$mkdir /inf

$mount /dev/sdh /inf

Here I ran a small Python script to extract all sequences (hemagluttanin only) known prior to the original swine flu outbreak, concatenate those sequences into a single “genome” saved as data/genome.fa, and recording a map of where in the genome each sequence ended.

Next, a group of sequences (again, hemagluttanin only) from the earliest swine flu outbreaks were broken up into numerous overlapping fragments and saved as data/segs.fa, again keeping a map of where each fragment belonged. These are the 'novel sequences' I would test for any cases of reassortment.

Download and extract CloudBurst

$wget http://downloads.sourceforge.net/project/cloudburst-bio/cloudburst/CloudBurst-1.0.1/CloudBurst-1.0.1.tgz

$tar xzf CloudBurst-1.0.1.tgz

$mv CloudBurst-1.0.1/ data/

Convert fasta files to Hadoop ready files

$java -jar data/ConvertFastaForCloud.jar data/genome.fa data/genome.br

$java -jar data/ConvertFastaForCloud.jar data/segs.fa data/segs.br

Move the data to hdfs

$/usr/local/hadoop-0.17.0/bin/hadoop fs -put ~/data /data

Run the analysis

$/usr/local/hadoop-0.17.0/bin/hadoop jar ~/data/CloudBurst.jar /data/genomeA.br /data/segsA.br /results 40 3 0 1 50 15 5 5 128 5

Copy results from hdfs

$/usr/local/hadoop-0.17.0/bin/hadoop fs -get /results results

Convert the results back to human readable format

java -jar data/PrintAlignments.jar results >results.txt

Profit!

Parsing meaning out of the results was a bit more labor intensive and I forewent any automation (for now), using OpenOffice spreadsheets to map matched portions back to their original viruses and report accession numbers. What I found was interesting, exciting and promissing, even matching some of the findings reported in Kingsford et al., 2009, from only 2 hours of work and a couple of dollars. I was pretty satisfied.

Wednesday, October 21, 2009

Install PostgreSQL 8.4 and PostGIS 1.4.0 in Ubuntu 9.0.4

I am a big fan of the new PostGIS 1.4.0 (and also of Paul Ramsey) . I always have troubles installing PostGIS in ubuntu so I thought that this time I was gonna document it and blog it here. So this is just a log of the steps required to install it on an EC2 instance with Ubuntu 9.04. I hope it can be useful for someone else.

Just for the record. The EC2 instance I used was ami-ccf615 from http://alestic.com .


Once login (totally fresh).




apt-get update
apt-get install vim
#The sources are still not available on the regular package servers... edit the sources 
vim /etc/apt/sources.list
  add deb http://ppa.launchpad.net/pitti/postgresql/ubuntu jaunty main
          deb-src http://ppa.launchpad.net/pitti/postgresql/ubuntu jaunty main

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 8683D8A2
sudo apt-get update
sudo apt-get install postgresql-8.4
#This changes the port from 5433 to 5432
sudo sed -i.bak -e 's/port = 5433/port = 5432/' /etc/postgresql/8.4/main/postgresql.conf

sudo /etc/init.d/postgresql-8.4 stop
sudo /etc/init.d/postgresql-8.4 start
apt-get install postgresql-server-dev-8.4 libpq-dev
apt-get install libgeos-dev
wget http://postgis.refractions.net/download/postgis-1.4.0.tar.gz
apt-get install proj
tar xvfz postgis-1.4.0.tar.gz
cd postgis-1.4.0
./configure
make
make install
sudo su postgres
#change the postgres password to "atlas" so that you can later login
psql -c"ALTER user postgres WITH PASSWORD 'atlas'"

createdb geodb    (with password atlas)
createlang -dgeodb plpgsql
psql -dgeodb -f /usr/share/postgresql/8.4/contrib/postgis.sql
psql -dgeodb -f /usr/share/postgresql/8.4/contrib/spatial_ref_sys.sql
psql -dgeodb -c"select postgis_lib_version();"
#This should return 1.4.0
exit


Good Luck!

Sunday, September 20, 2009

Using Geowebcache tiles stored in s3 from Google Maps for Flash

Last week I was in a code sprint for my next project on a very nice cottage. During the sprint I set up a Geoserver instance with Geowebcache to generate tiles from the World Database on Protected Areas. The tiles will be stored in S3 and distributed with CloudFront as the main goal is speed worldwide. The tiles will be used in a set of Maps in Flash around the project.

The biggest issue came when trying to generate a class to load the tiles from the default folder structure created by Geowebcache. I know I could have created my own tile folder structure generator for Geowebcache by creating my own BlobStore and modifying the FilePathGenerator to fir my needs, buuut i am not a Java lover and thought that it might be some good reasons why the good people at Geowebcache had use the current one.
The problem is that the structure is not very obvious, and I can imagine that it is because you dont want to have too many files within one single folder or you will start hitting the file system limits.

In order to use Google Maps for Flash you need to convert a request based on Zoom, X and Y tile coordinates, into a URL from where to get the tiles

If you use Geowebcache for accessing your tiles this is not a problem, use their gmap service and it will deliver your tile from wherever is in the file system. The issue is that I uploaded my whole geowebcache into S3 so I needed from the Flash client to know where to look for a x,y,z tile. With the help of Craig Mills we ported the code on Geowebcache to AS3.

The trickiest part was figuring out that Geowebcache starts its tiles from bottom left, while Google starts from top left. Dont know why they made this decission, but it is a bit of a pain. So tile z:1,x:0,y:0 in google is z:1,x:0,y:1 in geowebcache.

I thought it might be useful for someone that has similar needs so I have uploaded my classes here in case someone needs them.

Just a bit more information. I use the following command to trigger the tiling process:

curl -u geowebcache:secured -d “threadCount=01&type=reseed&gridSetId=EPSG:900913&format=image/png&zoomStart=00&zoomStop=08&minX=&minY=&maxX=&maxY=” http://localhost:8080/geowebcache/rest/seed/ppe:ppeblue

Remove the empty tiles generated (by looking if they are smaller than 663bytes)

find . -size -663c -type f -exec rm -f ‘{}’ \;

Remove the empty folders

find . -empty -type d -exec rm -rf ‘{}’ \;

And finally upload (well, sync) to S3

s3cmd sync —guess-mime-type —delete-removed —force —no-progress —acl-public /var/tiles/gwc/ppe_ppeblue/ s3://mybuket/ppe_ppeblue/

Friday, August 14, 2009

Presentation at Geoweb 09: Biodiversity: Where is important

I was a invited speaker at this year Geoweb 09 conference in Vancouver. This year the conference was focus on Cityscapes. Kind of trying to figure out how the geoweb could help on urban infrastructure on cities and things like that. It was curious that someone like me, focus mainly on biodiversity, got invited as a speaker there. But considering that cities have a huge effect on biodiversity, I think it was great to be there to offer some perspective on what we are doing in Biodiversity Informatics in relation with the geoweb. Hopefully some people got the idea of openness of this community, what primary data is and how are we sharing it. Here is my presentation on slideshare and links to videos on youtube (in 5 parts).






The conference was great and some people claim it was very hot, but really, these people dont know what a hot summer is if they havent been in August in Madrid.
I specially enjoyed meeting with Aaron Cope from Flickr, Paul Ramsey from PostGIS team and lot of different people there.

I might start getting into the discussion on how biodiversity informatics can better share their data to better play on the geoweb, specially since there is some great discussion going on. Related to why OGC does not work with us is the question on how we can more effectively share our data so that it better plays on the Geoweb.

Finally I think it is interesting if more people from biodiversity informatics start getting into this issue and this conference is a great place to discuss them. So, hopefully next year I will not be the only one talking about biodiversity at Geoweb 10!

Friday, July 3, 2009

Biodiversity databases and OGC standards don't play well together


Some days ago Tim and I had some discussions about how to provide OGC services for biodiversity databases, like for example the Global Registry of Migratory Species. This time the reason for the discussion to start again was the discovery of the new INSPIRE Geoportal Viewer. For those who don't know it, the INSPIRE directive is pushing the creation of a common infrastructure fo sharing geospatial data within Europe. They plan to do that by using Open Standards like the ones from the Open Geospatial Consortium (OGC).

The typical use case they always describe with these Spatial Data Infrastructures (SDIs) is having a registry of services (Catalog Service) where a user can find geospatial data. The data is available through Web services, like Web Map Server (WMS) or Web Feature Service (WFS). With a
web or desktop GIS client you discover, connect to the services, display different layers, do some analysis, print results, etc. All by using open standards and mixing data from lot of different places.

The INSPIRE registry and Geoportal Viewer is the typical example. On the viewer you can select from a list of available services (registered on the catalog service) or use your WMS web service URL.

When you select a service, you get a list of available layers on it. For example in the Spanish SDI you get this:
The way this work internally is by doing a getCapabilities request to the WMS server. The returned XML document list the layers available with metadata on how to query it, owner, etc.

But what happens when you have a database like the GBIF cache with 1.8 Million species? You can not create a layer per species, or the getCapabitilites document will be MB and MB impossible to parse by any client. In any case, who wants to provide a list of 1.8M species to select a layer?

Well, the way to make it work is to specify a filter on the WMS request specifying for example the species_id you are interested in. But those generic clients do not support specifying these kind of filters.

To me that means that the current status of OGC clients, like the ones used by INSPIRE, GEOSS, National SDIs, etc. are not able to handle biodiversity OGC services. Or say it in a different way, OGC services are not prepared to handle biodiversity databases with lot of species.

What are the possibilities?

1) OGC supports on their capabilities documents things like "Hey! I am not a service with a set of layers, I am a datastore with potentially millions of layers. So if you want to grab anything from me, you are going to provide a filter in your request". This will imply that OGC do some work and more important, software clients support this work. I think this will not happen in a few years.

2) Create a set of interesting layers in biodiversity. We could match our biodiversity databases against IUCN list of endangered species, create richness maps, etc, but access to primary data per species will not be possible.

If you think on the potential customers we probably should be thinking on a predefined list of layers that we could all create on our OGC services that might be interesting for lot of people. Richness, endangered species, kingdoms, whatever...

Other possibility is that we create portals where the user filter for a species and then gets a "customized,dynamic GetCapabilities document" that will include the filter on the URL. That will be possible. But with Catalog Services, like GEOSS, where there will thousands of services, is biodiversity going to be so special as for the user to go to one of our website before continuing in their wonderful world of web services workflow? I doubt it.

Next week I am going to Geoweb 09 as invited speaker to talk about Biodiversity and the challenges to share it on the Geoweb. I would love to hear what do you think about using OGC services within our community or any other issue related to geospatial data and analysis.
I would love to hear what do you think about using OGC services within our community.

Thursday, June 25, 2009

RSS feeds used by publishers

One of my current tasks is working on tools to index publications in order to find scientific names. One of the first things to figure out is how to discover publications. Many publishers provide various RSS feeds for their latest issue(s), a feature that uBio RSS is making use of, scanning about 980 journal feeds as of today.

I am trying to put some recommendations together for publishers on how to encode their RSS feeds or to use other formats to make their digital publications discoverable. If you have any recommendations I'd be glad to know about them. Especially on how to best promote back catalogues of all available publications would be interesting, as RSS feeds natively only show the latest ones (there are paging extensions for Atom, but that has no widespread support). Sitemaps or OAI-PMH seem like a good candidate, although something easier than OAI would be preferred.

Wondering which RSS format is most widely used by publishers currently and which extensions they use to encode their metadata, I wrote a little tool today that reads all current feeds known to ubio and checks their rss format, here are the results, not analyzing the namespaces and extension formats yet:


rss_0.92 = 3
rss_1.0 = 336
rss_2.0 = 431
rss_0.91U = 6
atom_1.0 = 2


So clearly the rdf based rss 1.0 (often together with Prism) and the simple rss 2.0 format is used mostly.
If there only would be a simple way to page. Maybe Microsofts Simple Sharing Extensions could help?