|
|
-
Re: best practice for Pig + MySql for meta data lookupsBill Graham 2012-09-11, 16:58
Instead of UDFs and Maps, try to work with LoadFuncs and Tuples if you can.
For example you could read from Cassandra with using CassangraStorage[1] and produce a Tuple of objects. If your data is JSON in Cassandra you could use a UDF to convert that to Tuples. Then you can then join or cogroup those tuples with others that you've imported from the DB. 1 - I've never used this: http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java On Tue, Sep 11, 2012 at 8:54 AM, William Oberman <[EMAIL PROTECTED]>wrote: > Great news (for me)! :-) My relational data is small (both relative to the > big data, but also absolutely). > > I'm reading about Sqoop now, and it seems relatively straight forward. > > My current problem is not having done this kind of combining of data before > in MR (which for me means Pig). Right now I have to pipe my Cassandra data > through a UDF, as the data itself is JSON (and I map it to a Map of well > defined fields). I was originally thinking I could just add a new field to > my Map in the UDF, but I don't know how to read from HDFS in a UDF (and > even if I knew how to read HDFS, I don't know how to read data produced by > Sqoop stored in HDFS). > > Now I'm wondering if this is the wrong mental model entirely. I haven't > figured out the details (obviously!), but it seems possible that using Pig > itself (without resorting to UDFs) I could > -load my Cassandra data > -load my HDFS data > -combine them > But, I'm puzzling on the how for the 2nd and 3rd items. > > It's hard to get specific without getting *really* specific, but all of the > new problems I have seem to boil down to something like: > 1.) Inside Pig I have a Map that contains a field with value X > 2.) I have meta data in MySql that maps that X to a more general grouping Y > 3.) I want to create reporting data based on both X and Y > The goal being to see how Y is doing overall, and how each X_i of Y are > doing relative to each other.... > > will > > > On Tue, Sep 11, 2012 at 11:33 AM, Bill Graham <[EMAIL PROTECTED]> > wrote: > > > That approach makes sense. We have similar situations where we pull > > relation data into HDFS and then join/agg with it via MR. In other cases > > we'll export aggregated HDFS data into a relational DB and then do > > additional aggs using SQL. That option of course only works of your data > > sizes are within reason. > > > > > > On Tue, Sep 11, 2012 at 8:17 AM, William Oberman > > <[EMAIL PROTECTED]>wrote: > > > > > Hello, > > > > > > My setup is Pig + Hadoop + Cassandra for my "big data" and MySql for my > > > "relational/meta data". Up until now that has been fine, but now I > need > > to > > > start creating metrics that "cross the lines". In particular, I need > to > > > create aggregations of Cassandra data based on lookups from MySql. > > > > > > After doing some research, it seems like my best option is using > > something > > > like Sqoop to map the meta/relational data I need from MySql -> HDFS, > and > > > then use HDFS inside of Pig for the actual lookups. I'd like to > confirm > > > that general strategy is correct (or any other tips). > > > > > > Thanks! > > > > > > will > > > > > > > > > > > -- > > *Note that I'm no longer using my Yahoo! email address. Please email me > at > > [EMAIL PROTECTED] going forward.* > > > |