Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Custom DB Loader UDF


Copy link to this message
-
Re: Custom DB Loader UDF
Russell Jurney 2012-08-31, 23:03
That would be awesome - I will generalize it and blog about what a great
person you are :D

On Fri, Aug 31, 2012 at 3:12 PM, Terry Siu <[EMAIL PROTECTED]> wrote:

> Thanks, Russell, I'll dig in to your recommendations. I'd be happy to open
> source it, but at the moment, it's not exactly general enough. However, I
> can certainly put it on github for your perusal.
>
> -Terry
>
> -----Original Message-----
> From: Russell Jurney [mailto:[EMAIL PROTECTED]]
> Sent: Friday, August 31, 2012 3:03 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Custom DB Loader UDF
>
> I don't have an answer, and I'm only learning these APIs myself, but
> you're writing something I'm planning on writing very soon - a
> MySQL-specific LoadFunc for Pig. I would greatly appreciate it if you would
> open source it on github or contribute it to Piggybank :)
>
> The InputSplits should determine the number of mappers, but to debug you
> might try forcing it by setting some properties in your script re:
> inputsplits (see
>
> https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/1QtL9bBwL0c
> ):
>
> The input split size is detemined by map.min.split.size, dfs.block.size
> and mapred.map.tasks.
>
> goalSize = totalSize / mapred.map.tasks
> minSize = max {mapred.min.split.size, minSplitSize} splitSize= max
> (minSize, min(goalSize, dfs.block.size))
>
> minSplitSize is determined by each InputFormat such as
> SequenceFileInputFormat.
>
>
> I'd play around with those and see if you can get it doing what you want.
>
> On Fri, Aug 31, 2012 at 2:02 PM, Terry Siu <[EMAIL PROTECTED]>
> wrote:
>
> > Hi all,
> >
> > I know this question has probably been posed multiple times, but I'm
> > having difficulty figuring out a couple of aspects of a custom
> > LoaderFunc to read from a DB. And yes, I did try to Google my way to an
> answer.
> > Anyhoo, for what it's worth, I have a MySql table that I wish to load
> > via Pig. I have the LoaderFunc working using PigServer in a Java app,
> > but I noticed the following when my job gets submitted to my MR
> > cluster. I generated 6 InputSplits in my custom InputFormat, where
> > each split specifies a non-overlapping range/page of records to read
> > from. I thought that each InputSplit would correspond to a map task,
> > but what I see in the JobTracker is that the submitted job only has 1
> > map task which executes each split serially. Is my understanding even
> > correct that a split can be effectively assigned to a single map task?
> > If so, can I coerce the submitted MR job to properly get each of my
> > splits to execute in its own map task?
> >
> > Thanks,
> > -Terry
> >
>
>
>
> --
> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED]
> datasyndrome.com
>

--
Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com