Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - best practice for Pig + MySql for meta data lookups


Copy link to this message
-
Re: best practice for Pig + MySql for meta data lookups
William Oberman 2012-09-11, 18:09
Thanks (again)!

I'm already using CassandraStorage to load the JSON strings.  I used Maps
because I liked being able to name the fields, but I could easily change my
UDF (and my Pig script) to use tuples instead.  Maybe this is because I
found Pig (and Hadoop) coming from the world of Cassandra rather than vice
versa.

I'll look into Join and Cogroup more, and I'll see if I can puzzle through
how to load Sqoop persisted data into Pig.

will

On Tue, Sep 11, 2012 at 12:58 PM, Bill Graham <[EMAIL PROTECTED]> wrote:

> Instead of UDFs and Maps, try to work with LoadFuncs and Tuples if you can.
> For example you could read from Cassandra with using CassangraStorage[1]
> and produce a Tuple of objects. If your data is JSON in Cassandra you could
> use a UDF to convert that to Tuples. Then you can then join or cogroup
> those tuples with others that you've imported from the DB.
>
> 1 - I've never used this:
>
> http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java
>
> On Tue, Sep 11, 2012 at 8:54 AM, William Oberman
> <[EMAIL PROTECTED]>wrote:
>
> > Great news (for me)! :-)  My relational data is small (both relative to
> the
> > big data, but also absolutely).
> >
> > I'm reading about Sqoop now, and it seems relatively straight forward.
> >
> > My current problem is not having done this kind of combining of data
> before
> > in MR (which for me means Pig).  Right now I have to pipe my Cassandra
> data
> > through a UDF, as the data itself is JSON (and I map it to a Map of well
> > defined fields).  I was originally thinking I could just add a new field
> to
> > my Map in the UDF, but I don't know how to read from HDFS in a UDF (and
> > even if I knew how to read HDFS, I don't know how to read data produced
> by
> > Sqoop stored in HDFS).
> >
> > Now I'm wondering if this is the wrong mental model entirely.  I haven't
> > figured out the details (obviously!), but it seems possible that using
> Pig
> > itself (without resorting to UDFs) I could
> > -load my Cassandra data
> > -load my HDFS data
> > -combine them
> > But, I'm puzzling on the how for the 2nd and 3rd items.
> >
> > It's hard to get specific without getting *really* specific, but all of
> the
> > new problems I have seem to boil down to something like:
> > 1.) Inside Pig I have a Map that contains a field with value X
> > 2.) I have meta data in MySql that maps that X to a more general
> grouping Y
> > 3.) I want to create reporting data based on both X and Y
> > The goal being to see how Y is doing overall, and how each X_i of Y are
> > doing relative to each other....
> >
> > will
> >
> >
> > On Tue, Sep 11, 2012 at 11:33 AM, Bill Graham <[EMAIL PROTECTED]>
> > wrote:
> >
> > > That approach makes sense. We have similar situations where we pull
> > > relation data into HDFS and then join/agg with it via MR. In other
> cases
> > > we'll export aggregated HDFS data into a relational DB and then do
> > > additional aggs using SQL. That option of course only works of your
> data
> > > sizes are within reason.
> > >
> > >
> > > On Tue, Sep 11, 2012 at 8:17 AM, William Oberman
> > > <[EMAIL PROTECTED]>wrote:
> > >
> > > > Hello,
> > > >
> > > > My setup is Pig + Hadoop + Cassandra for my "big data" and MySql for
> my
> > > > "relational/meta data".  Up until now that has been fine, but now I
> > need
> > > to
> > > > start creating metrics that "cross the lines".  In particular, I need
> > to
> > > > create aggregations of Cassandra data based on lookups from MySql.
> > > >
> > > > After doing some research, it seems like my best option is using
> > > something
> > > > like Sqoop to map the meta/relational data I need from MySql -> HDFS,
> > and
> > > > then use HDFS inside of Pig for the actual lookups.  I'd like to
> > confirm
> > > > that general strategy is correct (or any other tips).
> > > >
> > > > Thanks!
> > > >
> > > > will
> > > >
> > >
> > >
> > >
> > > --
> > > *Note that I'm no longer using my Yahoo! email address. Please email me