Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - Custom DB Loader UDF


+
Terry Siu 2012-08-31, 21:02
+
Ruslan Al-Fakikh 2012-08-31, 21:44
+
Terry Siu 2012-08-31, 22:01
Copy link to this message
-
Re: Custom DB Loader UDF
Russell Jurney 2012-08-31, 22:02
I don't have an answer, and I'm only learning these APIs myself, but you're
writing something I'm planning on writing very soon - a MySQL-specific
LoadFunc for Pig. I would greatly appreciate it if you would open source it
on github or contribute it to Piggybank :)

The InputSplits should determine the number of mappers, but to debug you
might try forcing it by setting some properties in your script re:
inputsplits (see
https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/1QtL9bBwL0c
):

The input split size is detemined by map.min.split.size, dfs.block.size and
mapred.map.tasks.

goalSize = totalSize / mapred.map.tasks
minSize = max {mapred.min.split.size, minSplitSize}
splitSize= max (minSize, min(goalSize, dfs.block.size))

minSplitSize is determined by each InputFormat such as
SequenceFileInputFormat.
I'd play around with those and see if you can get it doing what you want.

On Fri, Aug 31, 2012 at 2:02 PM, Terry Siu <[EMAIL PROTECTED]> wrote:

> Hi all,
>
> I know this question has probably been posed multiple times, but I'm
> having difficulty figuring out a couple of aspects of a custom LoaderFunc
> to read from a DB. And yes, I did try to Google my way to an answer.
> Anyhoo, for what it's worth, I have a MySql table that I wish to load via
> Pig. I have the LoaderFunc working using PigServer in a Java app, but I
> noticed the following when my job gets submitted to my MR cluster. I
> generated 6 InputSplits in my custom InputFormat, where each split
> specifies a non-overlapping range/page of records to read from. I thought
> that each InputSplit would correspond to a map task, but what I see in the
> JobTracker is that the submitted job only has 1 map task which executes
> each split serially. Is my understanding even correct that a split can be
> effectively assigned to a single map task? If so, can I coerce the
> submitted MR job to properly get each of my splits to execute in its own
> map task?
>
> Thanks,
> -Terry
>

--
Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
+
Terry Siu 2012-08-31, 22:12
+
Russell Jurney 2012-08-31, 23:03
+
Ruslan Al-Fakikh 2012-08-31, 23:50
+
Russell Jurney 2012-09-01, 00:09
+
Dmitriy Ryaboy 2012-09-02, 21:17
+
Ruslan Al-Fakikh 2012-08-31, 22:55