Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Custom DB Loader UDF


Copy link to this message
-
Re: Custom DB Loader UDF
Ruslan Al-Fakikh 2012-08-31, 23:50
Terry, Russell,

Just a proposal:
maybe it should be added to DBStorage?
http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/DBStorage.html
As far as I know it only stores data for now, but i think it can be
extended to load and store, like PigStorage.

Ruslan

On Sat, Sep 1, 2012 at 3:03 AM, Russell Jurney <[EMAIL PROTECTED]> wrote:
> That would be awesome - I will generalize it and blog about what a great
> person you are :D
>
> On Fri, Aug 31, 2012 at 3:12 PM, Terry Siu <[EMAIL PROTECTED]> wrote:
>
>> Thanks, Russell, I'll dig in to your recommendations. I'd be happy to open
>> source it, but at the moment, it's not exactly general enough. However, I
>> can certainly put it on github for your perusal.
>>
>> -Terry
>>
>> -----Original Message-----
>> From: Russell Jurney [mailto:[EMAIL PROTECTED]]
>> Sent: Friday, August 31, 2012 3:03 PM
>> To: [EMAIL PROTECTED]
>> Subject: Re: Custom DB Loader UDF
>>
>> I don't have an answer, and I'm only learning these APIs myself, but
>> you're writing something I'm planning on writing very soon - a
>> MySQL-specific LoadFunc for Pig. I would greatly appreciate it if you would
>> open source it on github or contribute it to Piggybank :)
>>
>> The InputSplits should determine the number of mappers, but to debug you
>> might try forcing it by setting some properties in your script re:
>> inputsplits (see
>>
>> https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/1QtL9bBwL0c
>> ):
>>
>> The input split size is detemined by map.min.split.size, dfs.block.size
>> and mapred.map.tasks.
>>
>> goalSize = totalSize / mapred.map.tasks
>> minSize = max {mapred.min.split.size, minSplitSize} splitSize= max
>> (minSize, min(goalSize, dfs.block.size))
>>
>> minSplitSize is determined by each InputFormat such as
>> SequenceFileInputFormat.
>>
>>
>> I'd play around with those and see if you can get it doing what you want.
>>
>> On Fri, Aug 31, 2012 at 2:02 PM, Terry Siu <[EMAIL PROTECTED]>
>> wrote:
>>
>> > Hi all,
>> >
>> > I know this question has probably been posed multiple times, but I'm
>> > having difficulty figuring out a couple of aspects of a custom
>> > LoaderFunc to read from a DB. And yes, I did try to Google my way to an
>> answer.
>> > Anyhoo, for what it's worth, I have a MySql table that I wish to load
>> > via Pig. I have the LoaderFunc working using PigServer in a Java app,
>> > but I noticed the following when my job gets submitted to my MR
>> > cluster. I generated 6 InputSplits in my custom InputFormat, where
>> > each split specifies a non-overlapping range/page of records to read
>> > from. I thought that each InputSplit would correspond to a map task,
>> > but what I see in the JobTracker is that the submitted job only has 1
>> > map task which executes each split serially. Is my understanding even
>> > correct that a split can be effectively assigned to a single map task?
>> > If so, can I coerce the submitted MR job to properly get each of my
>> > splits to execute in its own map task?
>> >
>> > Thanks,
>> > -Terry
>> >
>>
>>
>>
>> --
>> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED]
>> datasyndrome.com
>>
>
>
>
> --
> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com

--
Best Regards,
Ruslan Al-Fakikh