Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - best practice for Pig + MySql for meta data lookups


+
William Oberman 2012-09-11, 15:17
+
Bill Graham 2012-09-11, 15:33
+
William Oberman 2012-09-11, 15:54
Copy link to this message
-
Re: best practice for Pig + MySql for meta data lookups
Bill Graham 2012-09-11, 16:58
Instead of UDFs and Maps, try to work with LoadFuncs and Tuples if you can.
For example you could read from Cassandra with using CassangraStorage[1]
and produce a Tuple of objects. If your data is JSON in Cassandra you could
use a UDF to convert that to Tuples. Then you can then join or cogroup
those tuples with others that you've imported from the DB.

1 - I've never used this:
http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java

On Tue, Sep 11, 2012 at 8:54 AM, William Oberman
<[EMAIL PROTECTED]>wrote:

> Great news (for me)! :-)  My relational data is small (both relative to the
> big data, but also absolutely).
>
> I'm reading about Sqoop now, and it seems relatively straight forward.
>
> My current problem is not having done this kind of combining of data before
> in MR (which for me means Pig).  Right now I have to pipe my Cassandra data
> through a UDF, as the data itself is JSON (and I map it to a Map of well
> defined fields).  I was originally thinking I could just add a new field to
> my Map in the UDF, but I don't know how to read from HDFS in a UDF (and
> even if I knew how to read HDFS, I don't know how to read data produced by
> Sqoop stored in HDFS).
>
> Now I'm wondering if this is the wrong mental model entirely.  I haven't
> figured out the details (obviously!), but it seems possible that using Pig
> itself (without resorting to UDFs) I could
> -load my Cassandra data
> -load my HDFS data
> -combine them
> But, I'm puzzling on the how for the 2nd and 3rd items.
>
> It's hard to get specific without getting *really* specific, but all of the
> new problems I have seem to boil down to something like:
> 1.) Inside Pig I have a Map that contains a field with value X
> 2.) I have meta data in MySql that maps that X to a more general grouping Y
> 3.) I want to create reporting data based on both X and Y
> The goal being to see how Y is doing overall, and how each X_i of Y are
> doing relative to each other....
>
> will
>
>
> On Tue, Sep 11, 2012 at 11:33 AM, Bill Graham <[EMAIL PROTECTED]>
> wrote:
>
> > That approach makes sense. We have similar situations where we pull
> > relation data into HDFS and then join/agg with it via MR. In other cases
> > we'll export aggregated HDFS data into a relational DB and then do
> > additional aggs using SQL. That option of course only works of your data
> > sizes are within reason.
> >
> >
> > On Tue, Sep 11, 2012 at 8:17 AM, William Oberman
> > <[EMAIL PROTECTED]>wrote:
> >
> > > Hello,
> > >
> > > My setup is Pig + Hadoop + Cassandra for my "big data" and MySql for my
> > > "relational/meta data".  Up until now that has been fine, but now I
> need
> > to
> > > start creating metrics that "cross the lines".  In particular, I need
> to
> > > create aggregations of Cassandra data based on lookups from MySql.
> > >
> > > After doing some research, it seems like my best option is using
> > something
> > > like Sqoop to map the meta/relational data I need from MySql -> HDFS,
> and
> > > then use HDFS inside of Pig for the actual lookups.  I'd like to
> confirm
> > > that general strategy is correct (or any other tips).
> > >
> > > Thanks!
> > >
> > > will
> > >
> >
> >
> >
> > --
> > *Note that I'm no longer using my Yahoo! email address. Please email me
> at
> > [EMAIL PROTECTED] going forward.*
> >
>
+
William Oberman 2012-09-11, 18:09
+
William Oberman 2012-09-12, 14:41