Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - Custom DB Loader UDF


+
Terry Siu 2012-08-31, 21:02
+
Ruslan Al-Fakikh 2012-08-31, 21:44
+
Terry Siu 2012-08-31, 22:01
+
Russell Jurney 2012-08-31, 22:02
+
Terry Siu 2012-08-31, 22:12
+
Russell Jurney 2012-08-31, 23:03
+
Ruslan Al-Fakikh 2012-08-31, 23:50
Copy link to this message
-
Re: Custom DB Loader UDF
Russell Jurney 2012-09-01, 00:09
I've thought about that, but getting stuff into Piggybank is hard - you
have to peg it to a Pig release. My plan is to get MySQL working in github,
then generalize into DbStorage in piggybank for 0.11.

On Fri, Aug 31, 2012 at 4:50 PM, Ruslan Al-Fakikh <
[EMAIL PROTECTED]> wrote:

> Terry, Russell,
>
> Just a proposal:
> maybe it should be added to DBStorage?
>
> http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/DBStorage.html
> As far as I know it only stores data for now, but i think it can be
> extended to load and store, like PigStorage.
>
> Ruslan
>
> On Sat, Sep 1, 2012 at 3:03 AM, Russell Jurney <[EMAIL PROTECTED]>
> wrote:
> > That would be awesome - I will generalize it and blog about what a great
> > person you are :D
> >
> > On Fri, Aug 31, 2012 at 3:12 PM, Terry Siu <[EMAIL PROTECTED]>
> wrote:
> >
> >> Thanks, Russell, I'll dig in to your recommendations. I'd be happy to
> open
> >> source it, but at the moment, it's not exactly general enough. However,
> I
> >> can certainly put it on github for your perusal.
> >>
> >> -Terry
> >>
> >> -----Original Message-----
> >> From: Russell Jurney [mailto:[EMAIL PROTECTED]]
> >> Sent: Friday, August 31, 2012 3:03 PM
> >> To: [EMAIL PROTECTED]
> >> Subject: Re: Custom DB Loader UDF
> >>
> >> I don't have an answer, and I'm only learning these APIs myself, but
> >> you're writing something I'm planning on writing very soon - a
> >> MySQL-specific LoadFunc for Pig. I would greatly appreciate it if you
> would
> >> open source it on github or contribute it to Piggybank :)
> >>
> >> The InputSplits should determine the number of mappers, but to debug you
> >> might try forcing it by setting some properties in your script re:
> >> inputsplits (see
> >>
> >>
> https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/1QtL9bBwL0c
> >> ):
> >>
> >> The input split size is detemined by map.min.split.size, dfs.block.size
> >> and mapred.map.tasks.
> >>
> >> goalSize = totalSize / mapred.map.tasks
> >> minSize = max {mapred.min.split.size, minSplitSize} splitSize= max
> >> (minSize, min(goalSize, dfs.block.size))
> >>
> >> minSplitSize is determined by each InputFormat such as
> >> SequenceFileInputFormat.
> >>
> >>
> >> I'd play around with those and see if you can get it doing what you
> want.
> >>
> >> On Fri, Aug 31, 2012 at 2:02 PM, Terry Siu <[EMAIL PROTECTED]>
> >> wrote:
> >>
> >> > Hi all,
> >> >
> >> > I know this question has probably been posed multiple times, but I'm
> >> > having difficulty figuring out a couple of aspects of a custom
> >> > LoaderFunc to read from a DB. And yes, I did try to Google my way to
> an
> >> answer.
> >> > Anyhoo, for what it's worth, I have a MySql table that I wish to load
> >> > via Pig. I have the LoaderFunc working using PigServer in a Java app,
> >> > but I noticed the following when my job gets submitted to my MR
> >> > cluster. I generated 6 InputSplits in my custom InputFormat, where
> >> > each split specifies a non-overlapping range/page of records to read
> >> > from. I thought that each InputSplit would correspond to a map task,
> >> > but what I see in the JobTracker is that the submitted job only has 1
> >> > map task which executes each split serially. Is my understanding even
> >> > correct that a split can be effectively assigned to a single map task?
> >> > If so, can I coerce the submitted MR job to properly get each of my
> >> > splits to execute in its own map task?
> >> >
> >> > Thanks,
> >> > -Terry
> >> >
> >>
> >>
> >>
> >> --
> >> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED]
> >> datasyndrome.com
> >>
> >
> >
> >
> > --
> > Russell Jurney twitter.com/rjurney [EMAIL PROTECTED]
> datasyndrome.com
>
>
>
> --
> Best Regards,
> Ruslan Al-Fakikh
>

--
Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
+
Dmitriy Ryaboy 2012-09-02, 21:17
+
Ruslan Al-Fakikh 2012-08-31, 22:55