Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Custom DB Loader UDF


+
Terry Siu 2012-08-31, 21:02
+
Ruslan Al-Fakikh 2012-08-31, 21:44
+
Terry Siu 2012-08-31, 22:01
+
Russell Jurney 2012-08-31, 22:02
+
Terry Siu 2012-08-31, 22:12
+
Russell Jurney 2012-08-31, 23:03
+
Ruslan Al-Fakikh 2012-08-31, 23:50
+
Russell Jurney 2012-09-01, 00:09
Copy link to this message
-
Re: Custom DB Loader UDF
You can also look at what Vertica did for their Pig connector:

https://github.com/vertica/Vertica-Hadoop-Connector/blob/master/pig-connector/com/vertica/pig/VerticaLoader.java

(it's apache licensed, so if you reuse any code, you have to indicate
the Vertica copyright and apache license in credits).

D

On Fri, Aug 31, 2012 at 5:09 PM, Russell Jurney
<[EMAIL PROTECTED]> wrote:
> I've thought about that, but getting stuff into Piggybank is hard - you
> have to peg it to a Pig release. My plan is to get MySQL working in github,
> then generalize into DbStorage in piggybank for 0.11.
>
> On Fri, Aug 31, 2012 at 4:50 PM, Ruslan Al-Fakikh <
> [EMAIL PROTECTED]> wrote:
>
>> Terry, Russell,
>>
>> Just a proposal:
>> maybe it should be added to DBStorage?
>>
>> http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/DBStorage.html
>> As far as I know it only stores data for now, but i think it can be
>> extended to load and store, like PigStorage.
>>
>> Ruslan
>>
>> On Sat, Sep 1, 2012 at 3:03 AM, Russell Jurney <[EMAIL PROTECTED]>
>> wrote:
>> > That would be awesome - I will generalize it and blog about what a great
>> > person you are :D
>> >
>> > On Fri, Aug 31, 2012 at 3:12 PM, Terry Siu <[EMAIL PROTECTED]>
>> wrote:
>> >
>> >> Thanks, Russell, I'll dig in to your recommendations. I'd be happy to
>> open
>> >> source it, but at the moment, it's not exactly general enough. However,
>> I
>> >> can certainly put it on github for your perusal.
>> >>
>> >> -Terry
>> >>
>> >> -----Original Message-----
>> >> From: Russell Jurney [mailto:[EMAIL PROTECTED]]
>> >> Sent: Friday, August 31, 2012 3:03 PM
>> >> To: [EMAIL PROTECTED]
>> >> Subject: Re: Custom DB Loader UDF
>> >>
>> >> I don't have an answer, and I'm only learning these APIs myself, but
>> >> you're writing something I'm planning on writing very soon - a
>> >> MySQL-specific LoadFunc for Pig. I would greatly appreciate it if you
>> would
>> >> open source it on github or contribute it to Piggybank :)
>> >>
>> >> The InputSplits should determine the number of mappers, but to debug you
>> >> might try forcing it by setting some properties in your script re:
>> >> inputsplits (see
>> >>
>> >>
>> https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/1QtL9bBwL0c
>> >> ):
>> >>
>> >> The input split size is detemined by map.min.split.size, dfs.block.size
>> >> and mapred.map.tasks.
>> >>
>> >> goalSize = totalSize / mapred.map.tasks
>> >> minSize = max {mapred.min.split.size, minSplitSize} splitSize= max
>> >> (minSize, min(goalSize, dfs.block.size))
>> >>
>> >> minSplitSize is determined by each InputFormat such as
>> >> SequenceFileInputFormat.
>> >>
>> >>
>> >> I'd play around with those and see if you can get it doing what you
>> want.
>> >>
>> >> On Fri, Aug 31, 2012 at 2:02 PM, Terry Siu <[EMAIL PROTECTED]>
>> >> wrote:
>> >>
>> >> > Hi all,
>> >> >
>> >> > I know this question has probably been posed multiple times, but I'm
>> >> > having difficulty figuring out a couple of aspects of a custom
>> >> > LoaderFunc to read from a DB. And yes, I did try to Google my way to
>> an
>> >> answer.
>> >> > Anyhoo, for what it's worth, I have a MySql table that I wish to load
>> >> > via Pig. I have the LoaderFunc working using PigServer in a Java app,
>> >> > but I noticed the following when my job gets submitted to my MR
>> >> > cluster. I generated 6 InputSplits in my custom InputFormat, where
>> >> > each split specifies a non-overlapping range/page of records to read
>> >> > from. I thought that each InputSplit would correspond to a map task,
>> >> > but what I see in the JobTracker is that the submitted job only has 1
>> >> > map task which executes each split serially. Is my understanding even
>> >> > correct that a split can be effectively assigned to a single map task?
>> >> > If so, can I coerce the submitted MR job to properly get each of my
>> >> > splits to execute in its own map task?
>> >>
+
Ruslan Al-Fakikh 2012-08-31, 22:55