Monday, March 30, 2009

SpatialKey and biodiversity primary data analysis

Some days ago the good people from Universal Mind open the beta program for their new product called SpatialKey. For those who are not in the Flex community, Universal Mind is a very recognized company developing Rich Internet Applications. So it was a great pleasure when I saw months ago that they were working on a new geospatial product for data analysis.

I was lucky to get an invitation to the beta program and be able to take a look. They are promising some great things, but for the moment the beta is limited in certain ways I will describe later. The best to get an overview of it is to watch some of their ubercool videos (maybe too much for my taste).

I wanted to give it a try as soon as possible, and coincidentally I just finished working on the new WDPA-GBIF widget. The widget allows you to visualize biodiversity primary data, from GBIF, for all protected areas in the world. Check out for example the protected areas in Australia. Then select for example an area like the Great Barrier Reef. You will be able to download the data in multiple formats.

With this I downloaded the data for the Canadian Rocky Mountains and imported it in SpatialKey. The import now is limited to 10.000 records and 25 columns, so I had to delete multiple columns and records. I think SpatialKey should allow the discard of the data visually when it is already uploaded.

With SpatialKey you manage separately your datasets and the reports you create based on them. The reports right now are not exportable to the outside. I mean, you can not print or distribute as a widget the report you have created. I know they are working on it, but for the time being I only have 2 possibilities to show you how it looks like: share my report with everybody that gives me his email, or just do a little screencast for everybody to watch here. For the first one if someone is really interested to get into the system to take a look, send me an email. the screencast is following:

So here goes the things that I like a lot:
  • The heatmaps are just gorgeous. I would love to know how they do it.
  • The timeline filter is great. Has some usability issues but is great.
  • The way grids are displayed for summaries. The hover effect is very good and the tooltip very clear.
  • The filter "pods" are nice, but I wonder what would happen when you have thousands of hundreds records to search or select on. I suppose that when there is lot of data only the search would be enabled and not the selection.
  • Great look and feel.
Other comments following:
  • Is it necessary to refresh on every map movement? I understand it is on the zoom and if you have the filter by visible area disabled.
  • Not having the possibility right now to share the reports as widgets to embed on the blog.
  • It would be nice to also let the user provide a polygon or geometry to define the boundaries of the analysis. In this case for example would help a lot to visualize the borders of the protected area.
And finally, the things I really wonder how they work internally:
  • The heatmaps!
  • The data structures they use for dynamically regrouping the data on the client.
  • If it is true that they can handle millions of records, how does the server infrastructure looks like. I know it is Java, but what about the data store, how can they handle the creation of dynamic indexes or how do they do it? Would it work with GBIF data?
My general impression of the tool is great. It looks awesome and works really well. It looks very similar to some of the ideas we have for developing analytical tools for biodiversity data with GBIF. Tim give your impressions please!

I would love to see more and more such analytical tools for biodiversity. How they call them? Something like Bussiness Intelligence, I think we need some of this in our community. For the time being I will try to get into talks with Universal Mind on the applicability of SpatialKey for huge biodiversity primary data like GBIF.


Simon Tokumine said...

Javier, thanks for an awesome screencast. Did you do it in your pajamas...?

Spatial Key looks like an amazing tool. You mentioned the manual setting up of the color scale in the interpolated "heatmap". It would be really cool if we could take this a step further and manually define the interpolation function, for example including spline or our own custom interpolation, as well as the standard inverse-distance that the tool currently has.

I'd also be interested in knowing what sort of aggregation they are doing for their interpolation. Based on the speed of the tool, I'm guessing SK sums the species counts, and then interpolates on this frequency variable. In GBIF's case as not all data is from one resolution, an interesting approach may be to create a probability surface per species (based on data resolution) and then aggregate these to yield a final heatmap.

Even simpler though, it would be great to be able to specify the weighting by allowing tweaking of the power variable in the inverse-distance formula to be based on resolution of the data you have entered. This small change could make the interpolations of subsets of data at similar resolutions more meaningful.

Simon Tokumine said...

Actually, this links in nicely with your alpha shapes post. An alternative to creating probability surfaces would be to use a set of alpha shapes, perhaps weighted by some other factor such as data quality or resolution.

Doug McCune said...

Javier, I'm the lead Flex developer on the SpatialKey team. Thanks for the fantastic screencast! It's great to see the application out there being used and it's always cool to see people who instantly understand the value in what we're trying to do. I probably can't address some of your questions about exactly how we do certain things technically, but I can answer many of the other points you raise.

Regarding using the filter pods when you have hundreds of thousands of records, they'll still work the same way. We have clients that currently have around a half million records in the system and use the exact same filtering pods. Most of our pods limit the total number of items displayed (like in a timeline) by being smart about how to group the data, so if you have 100 years of data down to the millisecond level, we won't try to show a column for every millisecond, we'll only show a column for every year. The one pod that will be limited is the pod used for Text fields (in your screencast you use two of tehm), which shows the groupings by unique values. This pod is limited to only return a certain number of records (sorted by value). So in this case you might not get the full list of all unique values coming back, but you could still search using the text search to find the ones you're interested in (and you would always get to see the top couple hundred values no matter what).

We have lots of features planned to allow sharing of reports, but these features have not been rolled out to the beta trial accounts. But in the future we'll have the ability to do screenshot exports straight from the app as well as the ability to embed the interactive report in your own blog or link directly to it without requiring authentication (if you choose to make your report publicly accessible). We'll be rolling out some of these features in the coming months, so keep an eye out (we'll be sure to send you an update email once we release some of that stuff).

We're also working hard on polygon support. In terms of polygons, we'll be allowing you to draw your own custom shapes from within the app, and probably also import custom shapes from existing KML files. Once you have your polygons in the system you'll be able to use those polygons as filters just like any other filter in the application. You'll also be able to use those polygons for aggregation purposes too. We're trying to really plan out the polygon support so we make it powerful yet truly simple to use. One of our largest goals is to make powerful GIS software that removes most of the complications of existing GIS tools.

A few minor points, that runtime error at the end of the screencast has been fixed. You caught the beta revision at a point when we hadn't caught that yet, there was a window of about 2 days when that error was live, you just got lucky :) Also, if you want to select a single item in the grouped list, you can double click on the item (as opposed to selecting None, then selecting the item).

As we get further along with the beta phase we'll be rolling out accounts that will have much higher limits, both in terms of number of records per dataset and number of columns. I'd love to see how your full datasets work in the system. Thanks again for the great writeup (and sorry this comment got so long!). If you ever have any more questions or have more feedback, we'd love to hear it, just shoot us an email (the contact info is on the SpatialKey webpage or straight within the app you can send feedback).


Javier de la Torre said...

Hey Simon! Yes it was pretty late :)
Our experience with the data we have from GBIF is that we have lot of places with very little data (1 record) and places with lot of data (thousands), but little in the middle. Thats probably because the mix of grid and point data inside gbif. That makes it difficult to get a nice visualization.

Would be great to have time to do a little app that shows these different interpolations test. I am a little bit negative on the possible results because from my point of view the biggest impediment is the mix of grid data together with point data, without being able to separate these 2 things any visualization will probably bring some confusion.

Sounds also very cool the idea of using the different alpha shapes to represent probability surfaces. Has anybody already tried Clustr?

Javier de la Torre said...

Hey Doug,

Thanks for the long answer, I really appreciate getting this info.
I am already in contact with Tom from your team. In any case here goes some comments.

I take your comment on your largest goal being to create a GIS software. I really think it is only a matter of time that someone creates the buzzword for GIS applications. One of the reasons I think is the mix between server and client development that would be needed. Some of the GIS operations can not be done in the client, and handling huge amounts of data would need this server/client strategy. I supposed you had to do quite a lot of this in SpatialKey.It is wonderful that all this is hidden to user (apart of all those refresh).

Handling polygons for data analysis would just be incredible. But I suppose you would need to preprocess lot of things. For example, the last project I did we handle the intersection of 100,000 polygons with 160 million points. Agh, better than explaining it here, there is a screencast here:

On our case the processing of the stats takes around 16 hours using PostGIS. This is basically finding wich points fall in which polygons and calculate different stats.
But this use case is for a define set of polygons, the world protected areas. The other typical use case is the user would be able to create its own polygons and get live different stats. That sounds very much like what you are trying. Can you tell us at least how are you trying to handle this intersection? Relational database? Would be possible to quick determine which points, in a 200M datastore, fall inside the polygon?

Finally. Have you considered a pod that handle GroupingCollections? In our case we have a set of columns that define the taxonomy of the species. Would be great to have a pod that represents a dynamically generated tree based on the aggregations of the "active" records and with counts on each leaf. I have been using a GroupingCollection in my application for doing such things and I have to say it very quick becomes unusable with lot of data.

Thanks again Doug.