Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> best practice for Pig + MySql for meta data lookups


+
William Oberman 2012-09-11, 15:17
+
Bill Graham 2012-09-11, 15:33
+
William Oberman 2012-09-11, 15:54
+
Bill Graham 2012-09-11, 16:58
Copy link to this message
-
Re: best practice for Pig + MySql for meta data lookups
Thanks (again)!

I'm already using CassandraStorage to load the JSON strings.  I used Maps
because I liked being able to name the fields, but I could easily change my
UDF (and my Pig script) to use tuples instead.  Maybe this is because I
found Pig (and Hadoop) coming from the world of Cassandra rather than vice
versa.

I'll look into Join and Cogroup more, and I'll see if I can puzzle through
how to load Sqoop persisted data into Pig.

will

On Tue, Sep 11, 2012 at 12:58 PM, Bill Graham <[EMAIL PROTECTED]> wrote:

> Instead of UDFs and Maps, try to work with LoadFuncs and Tuples if you can.
> For example you could read from Cassandra with using CassangraStorage[1]
> and produce a Tuple of objects. If your data is JSON in Cassandra you could
> use a UDF to convert that to Tuples. Then you can then join or cogroup
> those tuples with others that you've imported from the DB.
>
> 1 - I've never used this:
>
> http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java
>
> On Tue, Sep 11, 2012 at 8:54 AM, William Oberman
> <[EMAIL PROTECTED]>wrote:
>
> > Great news (for me)! :-)  My relational data is small (both relative to
> the
> > big data, but also absolutely).
> >
> > I'm reading about Sqoop now, and it seems relatively straight forward.
> >
> > My current problem is not having done this kind of combining of data
> before
> > in MR (which for me means Pig).  Right now I have to pipe my Cassandra
> data
> > through a UDF, as the data itself is JSON (and I map it to a Map of well
> > defined fields).  I was originally thinking I could just add a new field
> to
> > my Map in the UDF, but I don't know how to read from HDFS in a UDF (and
> > even if I knew how to read HDFS, I don't know how to read data produced
> by
> > Sqoop stored in HDFS).
> >
> > Now I'm wondering if this is the wrong mental model entirely.  I haven't
> > figured out the details (obviously!), but it seems possible that using
> Pig
> > itself (without resorting to UDFs) I could
> > -load my Cassandra data
> > -load my HDFS data
> > -combine them
> > But, I'm puzzling on the how for the 2nd and 3rd items.
> >
> > It's hard to get specific without getting *really* specific, but all of
> the
> > new problems I have seem to boil down to something like:
> > 1.) Inside Pig I have a Map that contains a field with value X
> > 2.) I have meta data in MySql that maps that X to a more general
> grouping Y
> > 3.) I want to create reporting data based on both X and Y
> > The goal being to see how Y is doing overall, and how each X_i of Y are
> > doing relative to each other....
> >
> > will
> >
> >
> > On Tue, Sep 11, 2012 at 11:33 AM, Bill Graham <[EMAIL PROTECTED]>
> > wrote:
> >
> > > That approach makes sense. We have similar situations where we pull
> > > relation data into HDFS and then join/agg with it via MR. In other
> cases
> > > we'll export aggregated HDFS data into a relational DB and then do
> > > additional aggs using SQL. That option of course only works of your
> data
> > > sizes are within reason.
> > >
> > >
> > > On Tue, Sep 11, 2012 at 8:17 AM, William Oberman
> > > <[EMAIL PROTECTED]>wrote:
> > >
> > > > Hello,
> > > >
> > > > My setup is Pig + Hadoop + Cassandra for my "big data" and MySql for
> my
> > > > "relational/meta data".  Up until now that has been fine, but now I
> > need
> > > to
> > > > start creating metrics that "cross the lines".  In particular, I need
> > to
> > > > create aggregations of Cassandra data based on lookups from MySql.
> > > >
> > > > After doing some research, it seems like my best option is using
> > > something
> > > > like Sqoop to map the meta/relational data I need from MySql -> HDFS,
> > and
> > > > then use HDFS inside of Pig for the actual lookups.  I'd like to
> > confirm
> > > > that general strategy is correct (or any other tips).
> > > >
> > > > Thanks!
> > > >
> > > > will
> > > >
> > >
> > >
> > >
> > > --
> > > *Note that I'm no longer using my Yahoo! email address. Please email me
+
William Oberman 2012-09-12, 14:41
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB