Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> best practice for Pig + MySql for meta data lookups


+
William Oberman 2012-09-11, 15:17
+
Bill Graham 2012-09-11, 15:33
+
William Oberman 2012-09-11, 15:54
Copy link to this message
-
Re: best practice for Pig + MySql for meta data lookups
Instead of UDFs and Maps, try to work with LoadFuncs and Tuples if you can.
For example you could read from Cassandra with using CassangraStorage[1]
and produce a Tuple of objects. If your data is JSON in Cassandra you could
use a UDF to convert that to Tuples. Then you can then join or cogroup
those tuples with others that you've imported from the DB.

1 - I've never used this:
http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java

On Tue, Sep 11, 2012 at 8:54 AM, William Oberman
<[EMAIL PROTECTED]>wrote:

> Great news (for me)! :-)  My relational data is small (both relative to the
> big data, but also absolutely).
>
> I'm reading about Sqoop now, and it seems relatively straight forward.
>
> My current problem is not having done this kind of combining of data before
> in MR (which for me means Pig).  Right now I have to pipe my Cassandra data
> through a UDF, as the data itself is JSON (and I map it to a Map of well
> defined fields).  I was originally thinking I could just add a new field to
> my Map in the UDF, but I don't know how to read from HDFS in a UDF (and
> even if I knew how to read HDFS, I don't know how to read data produced by
> Sqoop stored in HDFS).
>
> Now I'm wondering if this is the wrong mental model entirely.  I haven't
> figured out the details (obviously!), but it seems possible that using Pig
> itself (without resorting to UDFs) I could
> -load my Cassandra data
> -load my HDFS data
> -combine them
> But, I'm puzzling on the how for the 2nd and 3rd items.
>
> It's hard to get specific without getting *really* specific, but all of the
> new problems I have seem to boil down to something like:
> 1.) Inside Pig I have a Map that contains a field with value X
> 2.) I have meta data in MySql that maps that X to a more general grouping Y
> 3.) I want to create reporting data based on both X and Y
> The goal being to see how Y is doing overall, and how each X_i of Y are
> doing relative to each other....
>
> will
>
>
> On Tue, Sep 11, 2012 at 11:33 AM, Bill Graham <[EMAIL PROTECTED]>
> wrote:
>
> > That approach makes sense. We have similar situations where we pull
> > relation data into HDFS and then join/agg with it via MR. In other cases
> > we'll export aggregated HDFS data into a relational DB and then do
> > additional aggs using SQL. That option of course only works of your data
> > sizes are within reason.
> >
> >
> > On Tue, Sep 11, 2012 at 8:17 AM, William Oberman
> > <[EMAIL PROTECTED]>wrote:
> >
> > > Hello,
> > >
> > > My setup is Pig + Hadoop + Cassandra for my "big data" and MySql for my
> > > "relational/meta data".  Up until now that has been fine, but now I
> need
> > to
> > > start creating metrics that "cross the lines".  In particular, I need
> to
> > > create aggregations of Cassandra data based on lookups from MySql.
> > >
> > > After doing some research, it seems like my best option is using
> > something
> > > like Sqoop to map the meta/relational data I need from MySql -> HDFS,
> and
> > > then use HDFS inside of Pig for the actual lookups.  I'd like to
> confirm
> > > that general strategy is correct (or any other tips).
> > >
> > > Thanks!
> > >
> > > will
> > >
> >
> >
> >
> > --
> > *Note that I'm no longer using my Yahoo! email address. Please email me
> at
> > [EMAIL PROTECTED] going forward.*
> >
>
+
William Oberman 2012-09-11, 18:09
+
William Oberman 2012-09-12, 14:41
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB