Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> A major addition to Pig. Working with spatial data

Copy link to this message
A major addition to Pig. Working with spatial data
Hi all,
  First, sorry for the long email. I wanted to put all my thoughts here and
get your feedback.
  I'm proposing a major addition to Pig that will greatly increase its
functionality and user base. It is simply to add spatial support to the
language and the framework. I've already started working on that but I
don't want it to be just another branch. I want it, eventually, to be
merged with the trunk of Apache Pig. So, I'm sending this email mainly to
reach out the main contributors of Pig to see the feasibility of this.
 This addition is a part of a big project we have been working on in
University of Minnesota; the project is called Spatial Hadoop.
http://spatialhadoop.cs.umn.edu. It's about building a MapReduce framework
(Hadoop) that is capable of maintaining and analyzing spatial data
efficiently. I'm the main guy behind that project and since we released its
first version, we received very encouraging responses from different groups
in the research and industrial community. I'm sure the addition we want to
make to Pig Latin will be widely accepted by the people in the spatial
 I'm proposing a plan here while we're still in the early phases of this
task to be able to discuss it with the main contributors and see its
feasibility. First of all, I think that we need to change the core of Pig
to be able to support spatial data. Providing a set of UDFs only is not
enough. The main reason is that Pig Latin does not provide a way to create
a new data type which is needed for spatial data. Once we have the spatial
data types we need, the functionality can be expanded using more UDFs.

Here's the plan as I see it.
1- Introduce a new primitive data type Geometry which represents all
spatial data types. In the underlying system, this will map to
com.vividsolutions.jts.geom.Geometry. This is a class from Java Topology
Suite (JTS) [http://www.vividsolutions.com/jts/JTSHome.htm], a stable and
efficient open source Java library for spatial data types and algorithms.
It is very popular in the spatial community and a C++ port of it is used in
PostGIS [http://postgis.net/] (a spatial library for Postgres). JTS also
conforms with Open Geospatial Consortium (OGC) [
http://www.opengeospatial.org/] which is an open standard for the spatial
data types. The Geometry data type is read from and written to text files
using the Well Known Text (WKT) format. There is also a way to convert it
to/from binary so that it can work with binary files and streams.
2- Add functions that manipulate spatial data types. These will be added as
UDFs and we will not need to mess with the internals of Pig. Most probably,
there will be one new class for each operation (e.g., union or
intersection). I think it will be good to put these new operations inside
the core of Pig so that users can use it without having to write the fully
qualified class name. Also, since there is no way to implicitly cast a
spatial data type to a non-spatial data types, there will not be any
conflicts in existing operations or new operations. All new operations, and
only the new operations, will be working on spatial data types. Here is an
initial list of operations that can be added. All those operations are
already implemented in JTS and the UDFs added to Pig will be just wrappers
around them.
**Predicates (used for spatial filtering)


**Aggregate functions

3- The third step is to implement spatial indexes (e.g., Grid or R-tree). A
Pig loader and Pig output classes will be created for those indexes. Note
that currently we have SpatialOutputFormat and SpatialInputFormat for those
indexes inside the Spatial Hadoop project, but we need to tweak them to
work with Pig.

4- (Advanced) Implement more sophisticated algorithms for spatial
operations that utilize the indexes. For example, we can have a specific
algorithm for spatial range query or spatial join. Again, we already have
algorithms built for different operations implemented in Spatial Hadoop as
MapReduce programs, but they will need to be modified to work in Pig
environment and get to work with other operations.

This is my whole plan for the spatial extension to Pig. I've already
started with the first step but as I mentioned earlier, I don't want to do
the work for our project and then the work gets forgotten. I want to
contribute to Pig and do my research at the same time. If you think the
plan is plausible, I'll open JIRA issues for the above tasks and start
shipping patches to do the stuff. I'll conform with the standards of the
project such as adding tests and well commenting the code.
Sorry for the long email and hope to hear back from you.
Best regards,
Ahmed Eldawy