|
Terry Siu
2012-08-31, 21:02
Ruslan Al-Fakikh
2012-08-31, 21:44
Terry Siu
2012-08-31, 22:01
Russell Jurney
2012-08-31, 22:02
Terry Siu
2012-08-31, 22:12
Ruslan Al-Fakikh
2012-08-31, 22:55
Russell Jurney
2012-08-31, 23:03
Ruslan Al-Fakikh
2012-08-31, 23:50
Russell Jurney
2012-09-01, 00:09
Dmitriy Ryaboy
2012-09-02, 21:17
|
-
Custom DB Loader UDFTerry Siu 2012-08-31, 21:02
Hi all,
I know this question has probably been posed multiple times, but I'm having difficulty figuring out a couple of aspects of a custom LoaderFunc to read from a DB. And yes, I did try to Google my way to an answer. Anyhoo, for what it's worth, I have a MySql table that I wish to load via Pig. I have the LoaderFunc working using PigServer in a Java app, but I noticed the following when my job gets submitted to my MR cluster. I generated 6 InputSplits in my custom InputFormat, where each split specifies a non-overlapping range/page of records to read from. I thought that each InputSplit would correspond to a map task, but what I see in the JobTracker is that the submitted job only has 1 map task which executes each split serially. Is my understanding even correct that a split can be effectively assigned to a single map task? If so, can I coerce the submitted MR job to properly get each of my splits to execute in its own map task? Thanks, -Terry
-
Re: Custom DB Loader UDFRuslan Al-Fakikh 2012-08-31, 21:44
Hi Terry,
I am not sure whether you architecture is correct, but what we do in my team: we override setLocation in LoadFunc so that it caches db data to hdfs. Basically the simplest way is to copy data from MySQL to HDFS by Sqoop and then read it by Pig as a normal input. Ruslan On Sat, Sep 1, 2012 at 1:02 AM, Terry Siu <[EMAIL PROTECTED]> wrote: > Hi all, > > I know this question has probably been posed multiple times, but I'm having difficulty figuring out a couple of aspects of a custom LoaderFunc to read from a DB. And yes, I did try to Google my way to an answer. Anyhoo, for what it's worth, I have a MySql table that I wish to load via Pig. I have the LoaderFunc working using PigServer in a Java app, but I noticed the following when my job gets submitted to my MR cluster. I generated 6 InputSplits in my custom InputFormat, where each split specifies a non-overlapping range/page of records to read from. I thought that each InputSplit would correspond to a map task, but what I see in the JobTracker is that the submitted job only has 1 map task which executes each split serially. Is my understanding even correct that a split can be effectively assigned to a single map task? If so, can I coerce the submitted MR job to properly get each of my splits to execute in its own map task? > > Thanks, > -Terry -- Best Regards, Ruslan Al-Fakikh
-
RE: Custom DB Loader UDFTerry Siu 2012-08-31, 22:01
Hi Ruslan,
Yep, I heard of Sqoop and had originally thought of using that, but wanted to give the LoaderFunc a try first. With regards to overriding the setLocation, I'm not sure I understand how you're using it to cache your DB data to HDFS. Ultimately, the location is used (per the documentation) "so that the input format can get itself set up properly before reading". I figured in my case, it's not necessary so long as I pass in the correct parameters to my InputFormat so I can construct the splits and the RecordReaders correctly. That works for me and I can store my generated Tuples in HDFS. Can you elaborate on your comment? Thanks, -Terry -----Original Message----- From: Ruslan Al-Fakikh [mailto:[EMAIL PROTECTED]] Sent: Friday, August 31, 2012 2:45 PM To: [EMAIL PROTECTED] Subject: Re: Custom DB Loader UDF Hi Terry, I am not sure whether you architecture is correct, but what we do in my team: we override setLocation in LoadFunc so that it caches db data to hdfs. Basically the simplest way is to copy data from MySQL to HDFS by Sqoop and then read it by Pig as a normal input. Ruslan On Sat, Sep 1, 2012 at 1:02 AM, Terry Siu <[EMAIL PROTECTED]> wrote: > Hi all, > > I know this question has probably been posed multiple times, but I'm having difficulty figuring out a couple of aspects of a custom LoaderFunc to read from a DB. And yes, I did try to Google my way to an answer. Anyhoo, for what it's worth, I have a MySql table that I wish to load via Pig. I have the LoaderFunc working using PigServer in a Java app, but I noticed the following when my job gets submitted to my MR cluster. I generated 6 InputSplits in my custom InputFormat, where each split specifies a non-overlapping range/page of records to read from. I thought that each InputSplit would correspond to a map task, but what I see in the JobTracker is that the submitted job only has 1 map task which executes each split serially. Is my understanding even correct that a split can be effectively assigned to a single map task? If so, can I coerce the submitted MR job to properly get each of my splits to execute in its own map task? > > Thanks, > -Terry -- Best Regards, Ruslan Al-Fakikh
-
Re: Custom DB Loader UDFRussell Jurney 2012-08-31, 22:02
I don't have an answer, and I'm only learning these APIs myself, but you're
writing something I'm planning on writing very soon - a MySQL-specific LoadFunc for Pig. I would greatly appreciate it if you would open source it on github or contribute it to Piggybank :) The InputSplits should determine the number of mappers, but to debug you might try forcing it by setting some properties in your script re: inputsplits (see https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/1QtL9bBwL0c ): The input split size is detemined by map.min.split.size, dfs.block.size and mapred.map.tasks. goalSize = totalSize / mapred.map.tasks minSize = max {mapred.min.split.size, minSplitSize} splitSize= max (minSize, min(goalSize, dfs.block.size)) minSplitSize is determined by each InputFormat such as SequenceFileInputFormat. I'd play around with those and see if you can get it doing what you want. On Fri, Aug 31, 2012 at 2:02 PM, Terry Siu <[EMAIL PROTECTED]> wrote: > Hi all, > > I know this question has probably been posed multiple times, but I'm > having difficulty figuring out a couple of aspects of a custom LoaderFunc > to read from a DB. And yes, I did try to Google my way to an answer. > Anyhoo, for what it's worth, I have a MySql table that I wish to load via > Pig. I have the LoaderFunc working using PigServer in a Java app, but I > noticed the following when my job gets submitted to my MR cluster. I > generated 6 InputSplits in my custom InputFormat, where each split > specifies a non-overlapping range/page of records to read from. I thought > that each InputSplit would correspond to a map task, but what I see in the > JobTracker is that the submitted job only has 1 map task which executes > each split serially. Is my understanding even correct that a split can be > effectively assigned to a single map task? If so, can I coerce the > submitted MR job to properly get each of my splits to execute in its own > map task? > > Thanks, > -Terry > -- Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
-
RE: Custom DB Loader UDFTerry Siu 2012-08-31, 22:12
Thanks, Russell, I'll dig in to your recommendations. I'd be happy to open source it, but at the moment, it's not exactly general enough. However, I can certainly put it on github for your perusal.
-Terry -----Original Message----- From: Russell Jurney [mailto:[EMAIL PROTECTED]] Sent: Friday, August 31, 2012 3:03 PM To: [EMAIL PROTECTED] Subject: Re: Custom DB Loader UDF I don't have an answer, and I'm only learning these APIs myself, but you're writing something I'm planning on writing very soon - a MySQL-specific LoadFunc for Pig. I would greatly appreciate it if you would open source it on github or contribute it to Piggybank :) The InputSplits should determine the number of mappers, but to debug you might try forcing it by setting some properties in your script re: inputsplits (see https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/1QtL9bBwL0c ): The input split size is detemined by map.min.split.size, dfs.block.size and mapred.map.tasks. goalSize = totalSize / mapred.map.tasks minSize = max {mapred.min.split.size, minSplitSize} splitSize= max (minSize, min(goalSize, dfs.block.size)) minSplitSize is determined by each InputFormat such as SequenceFileInputFormat. I'd play around with those and see if you can get it doing what you want. On Fri, Aug 31, 2012 at 2:02 PM, Terry Siu <[EMAIL PROTECTED]> wrote: > Hi all, > > I know this question has probably been posed multiple times, but I'm > having difficulty figuring out a couple of aspects of a custom > LoaderFunc to read from a DB. And yes, I did try to Google my way to an answer. > Anyhoo, for what it's worth, I have a MySql table that I wish to load > via Pig. I have the LoaderFunc working using PigServer in a Java app, > but I noticed the following when my job gets submitted to my MR > cluster. I generated 6 InputSplits in my custom InputFormat, where > each split specifies a non-overlapping range/page of records to read > from. I thought that each InputSplit would correspond to a map task, > but what I see in the JobTracker is that the submitted job only has 1 > map task which executes each split serially. Is my understanding even > correct that a split can be effectively assigned to a single map task? > If so, can I coerce the submitted MR job to properly get each of my > splits to execute in its own map task? > > Thanks, > -Terry > -- Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
-
Re: Custom DB Loader UDFRuslan Al-Fakikh 2012-08-31, 22:55
Terry,
Probably I can mislead you in some way, I haven't implemented our loader myself but what we have is something like @Override public void setLocation(String string, Job job) throws IOException { String path = ...//load data to hdfs and return the path OurCustomInputFormat.setInputPaths(job, path); } Where public class OurCustomInputFormat<K, V> extends FileInputFormat<K, V> { @Override public RecordReader<K, V> createRecordReader(InputSplit is, TaskAttemptContext tac) throws IOException, InterruptedException { RecordReader reader = new LineRecordReader(); reader.initialize(is, tac); return reader; } } Our loader is not very common either, and I am not allowed to open-source it. Basically it is used for small portions of data for replicated joins. Ruslan On Sat, Sep 1, 2012 at 2:12 AM, Terry Siu <[EMAIL PROTECTED]> wrote: > Thanks, Russell, I'll dig in to your recommendations. I'd be happy to open source it, but at the moment, it's not exactly general enough. However, I can certainly put it on github for your perusal. > > -Terry > > -----Original Message----- > From: Russell Jurney [mailto:[EMAIL PROTECTED]] > Sent: Friday, August 31, 2012 3:03 PM > To: [EMAIL PROTECTED] > Subject: Re: Custom DB Loader UDF > > I don't have an answer, and I'm only learning these APIs myself, but you're writing something I'm planning on writing very soon - a MySQL-specific LoadFunc for Pig. I would greatly appreciate it if you would open source it on github or contribute it to Piggybank :) > > The InputSplits should determine the number of mappers, but to debug you might try forcing it by setting some properties in your script re: > inputsplits (see > https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/1QtL9bBwL0c > ): > > The input split size is detemined by map.min.split.size, dfs.block.size and mapred.map.tasks. > > goalSize = totalSize / mapred.map.tasks > minSize = max {mapred.min.split.size, minSplitSize} splitSize= max (minSize, min(goalSize, dfs.block.size)) > > minSplitSize is determined by each InputFormat such as SequenceFileInputFormat. > > > I'd play around with those and see if you can get it doing what you want. > > On Fri, Aug 31, 2012 at 2:02 PM, Terry Siu <[EMAIL PROTECTED]> wrote: > >> Hi all, >> >> I know this question has probably been posed multiple times, but I'm >> having difficulty figuring out a couple of aspects of a custom >> LoaderFunc to read from a DB. And yes, I did try to Google my way to an answer. >> Anyhoo, for what it's worth, I have a MySql table that I wish to load >> via Pig. I have the LoaderFunc working using PigServer in a Java app, >> but I noticed the following when my job gets submitted to my MR >> cluster. I generated 6 InputSplits in my custom InputFormat, where >> each split specifies a non-overlapping range/page of records to read >> from. I thought that each InputSplit would correspond to a map task, >> but what I see in the JobTracker is that the submitted job only has 1 >> map task which executes each split serially. Is my understanding even >> correct that a split can be effectively assigned to a single map task? >> If so, can I coerce the submitted MR job to properly get each of my >> splits to execute in its own map task? >> >> Thanks, >> -Terry >> > > > > -- > Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com -- Best Regards, Ruslan Al-Fakikh
-
Re: Custom DB Loader UDFRussell Jurney 2012-08-31, 23:03
That would be awesome - I will generalize it and blog about what a great
person you are :D On Fri, Aug 31, 2012 at 3:12 PM, Terry Siu <[EMAIL PROTECTED]> wrote: > Thanks, Russell, I'll dig in to your recommendations. I'd be happy to open > source it, but at the moment, it's not exactly general enough. However, I > can certainly put it on github for your perusal. > > -Terry > > -----Original Message----- > From: Russell Jurney [mailto:[EMAIL PROTECTED]] > Sent: Friday, August 31, 2012 3:03 PM > To: [EMAIL PROTECTED] > Subject: Re: Custom DB Loader UDF > > I don't have an answer, and I'm only learning these APIs myself, but > you're writing something I'm planning on writing very soon - a > MySQL-specific LoadFunc for Pig. I would greatly appreciate it if you would > open source it on github or contribute it to Piggybank :) > > The InputSplits should determine the number of mappers, but to debug you > might try forcing it by setting some properties in your script re: > inputsplits (see > > https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/1QtL9bBwL0c > ): > > The input split size is detemined by map.min.split.size, dfs.block.size > and mapred.map.tasks. > > goalSize = totalSize / mapred.map.tasks > minSize = max {mapred.min.split.size, minSplitSize} splitSize= max > (minSize, min(goalSize, dfs.block.size)) > > minSplitSize is determined by each InputFormat such as > SequenceFileInputFormat. > > > I'd play around with those and see if you can get it doing what you want. > > On Fri, Aug 31, 2012 at 2:02 PM, Terry Siu <[EMAIL PROTECTED]> > wrote: > > > Hi all, > > > > I know this question has probably been posed multiple times, but I'm > > having difficulty figuring out a couple of aspects of a custom > > LoaderFunc to read from a DB. And yes, I did try to Google my way to an > answer. > > Anyhoo, for what it's worth, I have a MySql table that I wish to load > > via Pig. I have the LoaderFunc working using PigServer in a Java app, > > but I noticed the following when my job gets submitted to my MR > > cluster. I generated 6 InputSplits in my custom InputFormat, where > > each split specifies a non-overlapping range/page of records to read > > from. I thought that each InputSplit would correspond to a map task, > > but what I see in the JobTracker is that the submitted job only has 1 > > map task which executes each split serially. Is my understanding even > > correct that a split can be effectively assigned to a single map task? > > If so, can I coerce the submitted MR job to properly get each of my > > splits to execute in its own map task? > > > > Thanks, > > -Terry > > > > > > -- > Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] > datasyndrome.com > -- Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
-
Re: Custom DB Loader UDFRuslan Al-Fakikh 2012-08-31, 23:50
Terry, Russell,
Just a proposal: maybe it should be added to DBStorage? http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/DBStorage.html As far as I know it only stores data for now, but i think it can be extended to load and store, like PigStorage. Ruslan On Sat, Sep 1, 2012 at 3:03 AM, Russell Jurney <[EMAIL PROTECTED]> wrote: > That would be awesome - I will generalize it and blog about what a great > person you are :D > > On Fri, Aug 31, 2012 at 3:12 PM, Terry Siu <[EMAIL PROTECTED]> wrote: > >> Thanks, Russell, I'll dig in to your recommendations. I'd be happy to open >> source it, but at the moment, it's not exactly general enough. However, I >> can certainly put it on github for your perusal. >> >> -Terry >> >> -----Original Message----- >> From: Russell Jurney [mailto:[EMAIL PROTECTED]] >> Sent: Friday, August 31, 2012 3:03 PM >> To: [EMAIL PROTECTED] >> Subject: Re: Custom DB Loader UDF >> >> I don't have an answer, and I'm only learning these APIs myself, but >> you're writing something I'm planning on writing very soon - a >> MySQL-specific LoadFunc for Pig. I would greatly appreciate it if you would >> open source it on github or contribute it to Piggybank :) >> >> The InputSplits should determine the number of mappers, but to debug you >> might try forcing it by setting some properties in your script re: >> inputsplits (see >> >> https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/1QtL9bBwL0c >> ): >> >> The input split size is detemined by map.min.split.size, dfs.block.size >> and mapred.map.tasks. >> >> goalSize = totalSize / mapred.map.tasks >> minSize = max {mapred.min.split.size, minSplitSize} splitSize= max >> (minSize, min(goalSize, dfs.block.size)) >> >> minSplitSize is determined by each InputFormat such as >> SequenceFileInputFormat. >> >> >> I'd play around with those and see if you can get it doing what you want. >> >> On Fri, Aug 31, 2012 at 2:02 PM, Terry Siu <[EMAIL PROTECTED]> >> wrote: >> >> > Hi all, >> > >> > I know this question has probably been posed multiple times, but I'm >> > having difficulty figuring out a couple of aspects of a custom >> > LoaderFunc to read from a DB. And yes, I did try to Google my way to an >> answer. >> > Anyhoo, for what it's worth, I have a MySql table that I wish to load >> > via Pig. I have the LoaderFunc working using PigServer in a Java app, >> > but I noticed the following when my job gets submitted to my MR >> > cluster. I generated 6 InputSplits in my custom InputFormat, where >> > each split specifies a non-overlapping range/page of records to read >> > from. I thought that each InputSplit would correspond to a map task, >> > but what I see in the JobTracker is that the submitted job only has 1 >> > map task which executes each split serially. Is my understanding even >> > correct that a split can be effectively assigned to a single map task? >> > If so, can I coerce the submitted MR job to properly get each of my >> > splits to execute in its own map task? >> > >> > Thanks, >> > -Terry >> > >> >> >> >> -- >> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] >> datasyndrome.com >> > > > > -- > Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com -- Best Regards, Ruslan Al-Fakikh
-
Re: Custom DB Loader UDFRussell Jurney 2012-09-01, 00:09
I've thought about that, but getting stuff into Piggybank is hard - you
have to peg it to a Pig release. My plan is to get MySQL working in github, then generalize into DbStorage in piggybank for 0.11. On Fri, Aug 31, 2012 at 4:50 PM, Ruslan Al-Fakikh < [EMAIL PROTECTED]> wrote: > Terry, Russell, > > Just a proposal: > maybe it should be added to DBStorage? > > http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/DBStorage.html > As far as I know it only stores data for now, but i think it can be > extended to load and store, like PigStorage. > > Ruslan > > On Sat, Sep 1, 2012 at 3:03 AM, Russell Jurney <[EMAIL PROTECTED]> > wrote: > > That would be awesome - I will generalize it and blog about what a great > > person you are :D > > > > On Fri, Aug 31, 2012 at 3:12 PM, Terry Siu <[EMAIL PROTECTED]> > wrote: > > > >> Thanks, Russell, I'll dig in to your recommendations. I'd be happy to > open > >> source it, but at the moment, it's not exactly general enough. However, > I > >> can certainly put it on github for your perusal. > >> > >> -Terry > >> > >> -----Original Message----- > >> From: Russell Jurney [mailto:[EMAIL PROTECTED]] > >> Sent: Friday, August 31, 2012 3:03 PM > >> To: [EMAIL PROTECTED] > >> Subject: Re: Custom DB Loader UDF > >> > >> I don't have an answer, and I'm only learning these APIs myself, but > >> you're writing something I'm planning on writing very soon - a > >> MySQL-specific LoadFunc for Pig. I would greatly appreciate it if you > would > >> open source it on github or contribute it to Piggybank :) > >> > >> The InputSplits should determine the number of mappers, but to debug you > >> might try forcing it by setting some properties in your script re: > >> inputsplits (see > >> > >> > https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/1QtL9bBwL0c > >> ): > >> > >> The input split size is detemined by map.min.split.size, dfs.block.size > >> and mapred.map.tasks. > >> > >> goalSize = totalSize / mapred.map.tasks > >> minSize = max {mapred.min.split.size, minSplitSize} splitSize= max > >> (minSize, min(goalSize, dfs.block.size)) > >> > >> minSplitSize is determined by each InputFormat such as > >> SequenceFileInputFormat. > >> > >> > >> I'd play around with those and see if you can get it doing what you > want. > >> > >> On Fri, Aug 31, 2012 at 2:02 PM, Terry Siu <[EMAIL PROTECTED]> > >> wrote: > >> > >> > Hi all, > >> > > >> > I know this question has probably been posed multiple times, but I'm > >> > having difficulty figuring out a couple of aspects of a custom > >> > LoaderFunc to read from a DB. And yes, I did try to Google my way to > an > >> answer. > >> > Anyhoo, for what it's worth, I have a MySql table that I wish to load > >> > via Pig. I have the LoaderFunc working using PigServer in a Java app, > >> > but I noticed the following when my job gets submitted to my MR > >> > cluster. I generated 6 InputSplits in my custom InputFormat, where > >> > each split specifies a non-overlapping range/page of records to read > >> > from. I thought that each InputSplit would correspond to a map task, > >> > but what I see in the JobTracker is that the submitted job only has 1 > >> > map task which executes each split serially. Is my understanding even > >> > correct that a split can be effectively assigned to a single map task? > >> > If so, can I coerce the submitted MR job to properly get each of my > >> > splits to execute in its own map task? > >> > > >> > Thanks, > >> > -Terry > >> > > >> > >> > >> > >> -- > >> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] > >> datasyndrome.com > >> > > > > > > > > -- > > Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] > datasyndrome.com > > > > -- > Best Regards, > Ruslan Al-Fakikh > -- Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
-
Re: Custom DB Loader UDFDmitriy Ryaboy 2012-09-02, 21:17
You can also look at what Vertica did for their Pig connector:
https://github.com/vertica/Vertica-Hadoop-Connector/blob/master/pig-connector/com/vertica/pig/VerticaLoader.java (it's apache licensed, so if you reuse any code, you have to indicate the Vertica copyright and apache license in credits). D On Fri, Aug 31, 2012 at 5:09 PM, Russell Jurney <[EMAIL PROTECTED]> wrote: > I've thought about that, but getting stuff into Piggybank is hard - you > have to peg it to a Pig release. My plan is to get MySQL working in github, > then generalize into DbStorage in piggybank for 0.11. > > On Fri, Aug 31, 2012 at 4:50 PM, Ruslan Al-Fakikh < > [EMAIL PROTECTED]> wrote: > >> Terry, Russell, >> >> Just a proposal: >> maybe it should be added to DBStorage? >> >> http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/DBStorage.html >> As far as I know it only stores data for now, but i think it can be >> extended to load and store, like PigStorage. >> >> Ruslan >> >> On Sat, Sep 1, 2012 at 3:03 AM, Russell Jurney <[EMAIL PROTECTED]> >> wrote: >> > That would be awesome - I will generalize it and blog about what a great >> > person you are :D >> > >> > On Fri, Aug 31, 2012 at 3:12 PM, Terry Siu <[EMAIL PROTECTED]> >> wrote: >> > >> >> Thanks, Russell, I'll dig in to your recommendations. I'd be happy to >> open >> >> source it, but at the moment, it's not exactly general enough. However, >> I >> >> can certainly put it on github for your perusal. >> >> >> >> -Terry >> >> >> >> -----Original Message----- >> >> From: Russell Jurney [mailto:[EMAIL PROTECTED]] >> >> Sent: Friday, August 31, 2012 3:03 PM >> >> To: [EMAIL PROTECTED] >> >> Subject: Re: Custom DB Loader UDF >> >> >> >> I don't have an answer, and I'm only learning these APIs myself, but >> >> you're writing something I'm planning on writing very soon - a >> >> MySQL-specific LoadFunc for Pig. I would greatly appreciate it if you >> would >> >> open source it on github or contribute it to Piggybank :) >> >> >> >> The InputSplits should determine the number of mappers, but to debug you >> >> might try forcing it by setting some properties in your script re: >> >> inputsplits (see >> >> >> >> >> https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/1QtL9bBwL0c >> >> ): >> >> >> >> The input split size is detemined by map.min.split.size, dfs.block.size >> >> and mapred.map.tasks. >> >> >> >> goalSize = totalSize / mapred.map.tasks >> >> minSize = max {mapred.min.split.size, minSplitSize} splitSize= max >> >> (minSize, min(goalSize, dfs.block.size)) >> >> >> >> minSplitSize is determined by each InputFormat such as >> >> SequenceFileInputFormat. >> >> >> >> >> >> I'd play around with those and see if you can get it doing what you >> want. >> >> >> >> On Fri, Aug 31, 2012 at 2:02 PM, Terry Siu <[EMAIL PROTECTED]> >> >> wrote: >> >> >> >> > Hi all, >> >> > >> >> > I know this question has probably been posed multiple times, but I'm >> >> > having difficulty figuring out a couple of aspects of a custom >> >> > LoaderFunc to read from a DB. And yes, I did try to Google my way to >> an >> >> answer. >> >> > Anyhoo, for what it's worth, I have a MySql table that I wish to load >> >> > via Pig. I have the LoaderFunc working using PigServer in a Java app, >> >> > but I noticed the following when my job gets submitted to my MR >> >> > cluster. I generated 6 InputSplits in my custom InputFormat, where >> >> > each split specifies a non-overlapping range/page of records to read >> >> > from. I thought that each InputSplit would correspond to a map task, >> >> > but what I see in the JobTracker is that the submitted job only has 1 >> >> > map task which executes each split serially. Is my understanding even >> >> > correct that a split can be effectively assigned to a single map task? >> >> > If so, can I coerce the submitted MR job to properly get each of my >> >> > splits to execute in its own map task? >> >> |