Great news (for me)! :-) My relational data is small (both relative to the
big data, but also absolutely).
I'm reading about Sqoop now, and it seems relatively straight forward.
My current problem is not having done this kind of combining of data before
in MR (which for me means Pig). Right now I have to pipe my Cassandra data
through a UDF, as the data itself is JSON (and I map it to a Map of well
defined fields). I was originally thinking I could just add a new field to
my Map in the UDF, but I don't know how to read from HDFS in a UDF (and
even if I knew how to read HDFS, I don't know how to read data produced by
Sqoop stored in HDFS).
Now I'm wondering if this is the wrong mental model entirely. I haven't
figured out the details (obviously!), but it seems possible that using Pig
itself (without resorting to UDFs) I could
-load my Cassandra data
-load my HDFS data
But, I'm puzzling on the how for the 2nd and 3rd items.
It's hard to get specific without getting *really* specific, but all of the
new problems I have seem to boil down to something like:
1.) Inside Pig I have a Map that contains a field with value X
2.) I have meta data in MySql that maps that X to a more general grouping Y
3.) I want to create reporting data based on both X and Y
The goal being to see how Y is doing overall, and how each X_i of Y are
doing relative to each other....
On Tue, Sep 11, 2012 at 11:33 AM, Bill Graham <[EMAIL PROTECTED]> wrote:
> That approach makes sense. We have similar situations where we pull
> relation data into HDFS and then join/agg with it via MR. In other cases
> we'll export aggregated HDFS data into a relational DB and then do
> additional aggs using SQL. That option of course only works of your data
> sizes are within reason.
> On Tue, Sep 11, 2012 at 8:17 AM, William Oberman
> <[EMAIL PROTECTED]>wrote:
> > Hello,
> > My setup is Pig + Hadoop + Cassandra for my "big data" and MySql for my
> > "relational/meta data". Up until now that has been fine, but now I need
> > start creating metrics that "cross the lines". In particular, I need to
> > create aggregations of Cassandra data based on lookups from MySql.
> > After doing some research, it seems like my best option is using
> > like Sqoop to map the meta/relational data I need from MySql -> HDFS, and
> > then use HDFS inside of Pig for the actual lookups. I'd like to confirm
> > that general strategy is correct (or any other tips).
> > Thanks!
> > will
> *Note that I'm no longer using my Yahoo! email address. Please email me at
> [EMAIL PROTECTED] going forward.*