Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> best practice for Pig + MySql for meta data lookups


Copy link to this message
-
Re: best practice for Pig + MySql for meta data lookups
Hello again,

I spend the last day trying to load one of my MySql tables into a Pig bag
of tuples based on a sqoop-import.  I have something that seems to work
finally, but I wanted to double check there isn't a better/easy way, as
there was a *lot* of trial and error to get to this point.  Plus, if this
is correct and someone else struggles with this particular integration
maybe they'll find this post someday.

Overall, my biggest problem was I tried to test things using "ILLUSTRATE
bag_that_represents_table;".  If I just did
"DUMP bag_that_represents_table;", I would have had success *much* sooner.
 But, I hit at least one bug and a classloader issue that only exist (as
far as I can tell) because of ILLUSTRATE (in CDU3u5 which is Pig 0.8.1).
 In particular:

1.) Doing the MySql -> HDFS I had to add two "non-standard" parameters to
sqoop-import
--as-sequencefile => I had trouble figuring out how to load the default
(text data) into Pig.  Eventually I found this:
https://review.cloudera.org/r/1670/
Which references this code:
https://issues.cloudera.org/secure/attachment/10684/0001-Hive-SerDe-and-Pig-LoadFunc-to-access-Sqoop-sequence.patch
And that loader code seems to properly work given an --as-sequencefile
sqoop-import.

--columns => I'm using cloudera CDU3u5, so Pig 0.8.1, and I think there is
a bug in DisplayExamples.ShortenField(Object) that results in a NPE for
null values in ILLUSTRATE.  I changed my sqoop-import to only grab the
columns I wanted (which are all non-null).

2.)  To get ILLUSTRATE to work, I had to add sqoop and my autogenerated
table jar (from the sqoop-import) to the *local* class path.  In particular
I had to add sqoop-1.3.0-cdh3u5.jar and TABLE_NAME.jar to /usr/lib/pig/lib

3.) To get DUMP to work, I had to "register XYZ.jar" for the two jars
referenced above in #2 to my pig script.

My problem is I did #3 first, and it took me forever to figure out
ILLUSTRATE wasn't respecting the register commands in terms of the
classpath to find sqoop and the table jar.  No idea *why* that is true, but
the "class not found" errors definitely went away as soon as I added the
jars to my local pig LIB_DIR.

I'm not 100% convinced I had to use sequence files, as my problems at the
time may have been related to the classloader issues.  But, now that it's
working I planned to move on, rather than figure that out :-)

will
On Tue, Sep 11, 2012 at 2:09 PM, William Oberman
<[EMAIL PROTECTED]>wrote:

> Thanks (again)!
>
> I'm already using CassandraStorage to load the JSON strings.  I used Maps
> because I liked being able to name the fields, but I could easily change my
> UDF (and my Pig script) to use tuples instead.  Maybe this is because I
> found Pig (and Hadoop) coming from the world of Cassandra rather than vice
> versa.
>
> I'll look into Join and Cogroup more, and I'll see if I can puzzle through
> how to load Sqoop persisted data into Pig.
>
> will
>
> On Tue, Sep 11, 2012 at 12:58 PM, Bill Graham <[EMAIL PROTECTED]>wrote:
>
>> Instead of UDFs and Maps, try to work with LoadFuncs and Tuples if you
>> can.
>> For example you could read from Cassandra with using CassangraStorage[1]
>> and produce a Tuple of objects. If your data is JSON in Cassandra you
>> could
>> use a UDF to convert that to Tuples. Then you can then join or cogroup
>> those tuples with others that you've imported from the DB.
>>
>> 1 - I've never used this:
>>
>> http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java
>>
>> On Tue, Sep 11, 2012 at 8:54 AM, William Oberman
>> <[EMAIL PROTECTED]>wrote:
>>
>> > Great news (for me)! :-)  My relational data is small (both relative to
>> the
>> > big data, but also absolutely).
>> >
>> > I'm reading about Sqoop now, and it seems relatively straight forward.
>> >
>> > My current problem is not having done this kind of combining of data
>> before
>> > in MR (which for me means Pig).  Right now I have to pipe my Cassandra