Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Custom DB Loader UDF

Copy link to this message
Re: Custom DB Loader UDF

Probably I can mislead you in some way, I haven't implemented our
loader myself but what we have is something like
public void setLocation(String string, Job job) throws IOException {
String path = ...//load data to hdfs and return the path
OurCustomInputFormat.setInputPaths(job, path);


public class OurCustomInputFormat<K, V> extends FileInputFormat<K, V> {

public RecordReader<K, V> createRecordReader(InputSplit is,
TaskAttemptContext tac) throws IOException, InterruptedException {
RecordReader reader =  new LineRecordReader();
reader.initialize(is, tac);
return reader;

Our loader is not very common either, and I am not allowed to
open-source it. Basically it is used for small portions of data for
replicated joins.


On Sat, Sep 1, 2012 at 2:12 AM, Terry Siu <[EMAIL PROTECTED]> wrote:
> Thanks, Russell, I'll dig in to your recommendations. I'd be happy to open source it, but at the moment, it's not exactly general enough. However, I can certainly put it on github for your perusal.
> -Terry
> -----Original Message-----
> From: Russell Jurney [mailto:[EMAIL PROTECTED]]
> Sent: Friday, August 31, 2012 3:03 PM
> Subject: Re: Custom DB Loader UDF
> I don't have an answer, and I'm only learning these APIs myself, but you're writing something I'm planning on writing very soon - a MySQL-specific LoadFunc for Pig. I would greatly appreciate it if you would open source it on github or contribute it to Piggybank :)
> The InputSplits should determine the number of mappers, but to debug you might try forcing it by setting some properties in your script re:
> inputsplits (see
> https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/1QtL9bBwL0c
> ):
> The input split size is detemined by map.min.split.size, dfs.block.size and mapred.map.tasks.
> goalSize = totalSize / mapred.map.tasks
> minSize = max {mapred.min.split.size, minSplitSize} splitSize= max (minSize, min(goalSize, dfs.block.size))
> minSplitSize is determined by each InputFormat such as SequenceFileInputFormat.
> I'd play around with those and see if you can get it doing what you want.
> On Fri, Aug 31, 2012 at 2:02 PM, Terry Siu <[EMAIL PROTECTED]> wrote:
>> Hi all,
>> I know this question has probably been posed multiple times, but I'm
>> having difficulty figuring out a couple of aspects of a custom
>> LoaderFunc to read from a DB. And yes, I did try to Google my way to an answer.
>> Anyhoo, for what it's worth, I have a MySql table that I wish to load
>> via Pig. I have the LoaderFunc working using PigServer in a Java app,
>> but I noticed the following when my job gets submitted to my MR
>> cluster. I generated 6 InputSplits in my custom InputFormat, where
>> each split specifies a non-overlapping range/page of records to read
>> from. I thought that each InputSplit would correspond to a map task,
>> but what I see in the JobTracker is that the submitted job only has 1
>> map task which executes each split serially. Is my understanding even
>> correct that a split can be effectively assigned to a single map task?
>> If so, can I coerce the submitted MR job to properly get each of my
>> splits to execute in its own map task?
>> Thanks,
>> -Terry
> --
> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com

Best Regards,
Ruslan Al-Fakikh