-Re: best practice for Pig + MySql for meta data lookups
Bill Graham 2012-09-11, 16:58
Instead of UDFs and Maps, try to work with LoadFuncs and Tuples if you can.
For example you could read from Cassandra with using CassangraStorage
and produce a Tuple of objects. If your data is JSON in Cassandra you could
use a UDF to convert that to Tuples. Then you can then join or cogroup
those tuples with others that you've imported from the DB.
1 - I've never used this:
On Tue, Sep 11, 2012 at 8:54 AM, William Oberman
> Great news (for me)! :-) My relational data is small (both relative to the
> big data, but also absolutely).
> I'm reading about Sqoop now, and it seems relatively straight forward.
> My current problem is not having done this kind of combining of data before
> in MR (which for me means Pig). Right now I have to pipe my Cassandra data
> through a UDF, as the data itself is JSON (and I map it to a Map of well
> defined fields). I was originally thinking I could just add a new field to
> my Map in the UDF, but I don't know how to read from HDFS in a UDF (and
> even if I knew how to read HDFS, I don't know how to read data produced by
> Sqoop stored in HDFS).
> Now I'm wondering if this is the wrong mental model entirely. I haven't
> figured out the details (obviously!), but it seems possible that using Pig
> itself (without resorting to UDFs) I could
> -load my Cassandra data
> -load my HDFS data
> -combine them
> But, I'm puzzling on the how for the 2nd and 3rd items.
> It's hard to get specific without getting *really* specific, but all of the
> new problems I have seem to boil down to something like:
> 1.) Inside Pig I have a Map that contains a field with value X
> 2.) I have meta data in MySql that maps that X to a more general grouping Y
> 3.) I want to create reporting data based on both X and Y
> The goal being to see how Y is doing overall, and how each X_i of Y are
> doing relative to each other....
> On Tue, Sep 11, 2012 at 11:33 AM, Bill Graham <[EMAIL PROTECTED]>
> > That approach makes sense. We have similar situations where we pull
> > relation data into HDFS and then join/agg with it via MR. In other cases
> > we'll export aggregated HDFS data into a relational DB and then do
> > additional aggs using SQL. That option of course only works of your data
> > sizes are within reason.
> > On Tue, Sep 11, 2012 at 8:17 AM, William Oberman
> > <[EMAIL PROTECTED]>wrote:
> > > Hello,
> > >
> > > My setup is Pig + Hadoop + Cassandra for my "big data" and MySql for my
> > > "relational/meta data". Up until now that has been fine, but now I
> > to
> > > start creating metrics that "cross the lines". In particular, I need
> > > create aggregations of Cassandra data based on lookups from MySql.
> > >
> > > After doing some research, it seems like my best option is using
> > something
> > > like Sqoop to map the meta/relational data I need from MySql -> HDFS,
> > > then use HDFS inside of Pig for the actual lookups. I'd like to
> > > that general strategy is correct (or any other tips).
> > >
> > > Thanks!
> > >
> > > will
> > >
> > --
> > *Note that I'm no longer using my Yahoo! email address. Please email me
> > [EMAIL PROTECTED] going forward.*