Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> PigStorage's handling of InputFormat and OutputFormat


Copy link to this message
-
Re: PigStorage's handling of InputFormat and OutputFormat
makes sense. I will attach an updated patch that move Tuple serialization to
StorageUtil.

since we expect uses to extend PigStorage, I would like to add
getFieldDelmiter() method.. otherwise the extender has to parse and
remember.

Raghu.

On Fri, Jul 22, 2011 at 3:10 PM, Alan Gates <[EMAIL PROTECTED]> wrote:

> "There are very few StoreFuncs that extend PigStorage" that we know of.  We
> don't know how our users are extending it for themselves.  And PigStorage is
> a public interface.  Breaking it is a non-starter.
>
> Alan.
>
> On Jul 22, 2011, at 2:57 PM, Raghu Angadi wrote:
>
> > Yes, I don't like the extra copies either.. thats why didn't mark the
> Jira
> > 'patch available'. A static helper method would also be useful.
> >
> > But I don't see how it breaks how it breaks existing StoreFuncs or output
> > formats.. is there an example? There are very few StoreFuncs that extend
> > PigStorage.
> >
> > Raghu.
> >
> > On Fri, Jul 22, 2011 at 1:37 PM, Alan Gates <[EMAIL PROTECTED]>
> wrote:
> >
> >> At this point I'm -1 on this.  I don't want to break existing output
> >> formats or store functions.  And I don't see that much value here.  You
> can
> >> accomplish the same thing by putting the logic in a static method of
> >> PigTextOutputFormat and letting other users use it.  Also, the cost of
> an
> >> extra copy of the output is bad.  We don't want to slow down storing
> data.
> >>
> >> Alan.
> >>
> >> On Jul 22, 2011, at 12:24 PM, Raghu Angadi wrote:
> >>
> >>> attached a patch to https://issues.apache.org/jira/browse/PIG-2187
> >>>
> >>> Only drawback is extra copies required to make a Text().
> >>>
> >>>
> >>>
> >>> On Thu, Jul 21, 2011 at 1:21 PM, Daniel Dai <[EMAIL PROTECTED]>
> >> wrote:
> >>>
> >>>> I agree tuple -> text conversion better be in StoreFunc. User may have
> >>>> better chance to reuse OutputFormat.
> >>>>
> >>>> For backward compatibility, the signature of StoreFunc.getOutputFormat
> >>>> returns a generic OutputFormat object, this is fine. However, existing
> >>>> StoreFunc use PigOutputFormat need to change.
> >>>
> >>>
> >>> you mean existing classes that override PigStorage.getOutputFormat()
> and
> >> not
> >>> PigStorage.putNext()?
> >>> Yes, they would be affected.. but fixing them is very simple, they just
> >> need
> >>> to extend putNext().
> >>> As such there is no contract regd getOutputFormat() for us to break :)
> >>>
> >>> Raghu.
> >>>
> >>>> I don't know how much impact
> >>>> that will be, but need to be careful. We need to make clear
> announcement
> >>>> and
> >>>> document it as incompatible change if we do so.
> >>>>
> >>>> Daniel
> >>>>
> >>>> On Thu, Jul 21, 2011 at 11:12 AM, Raghu Angadi <[EMAIL PROTECTED]>
> >> wrote:
> >>>>
> >>>>> expectation from PigStorage.getInputFormat()  is that it is a
> >>>>> InputFormat<Writable, Text>, and PigStorage handles converting Text
> to
> >>>>> Tuple.
> >>>>> This is very useful and easy for users to use some other input
> format.
> >>>>>
> >>>>> But the same is not true for PigStorage().getOutputFormat().. Here it
> >>>>> expects OutputFormat<Writable, Tuple>. So the output format needs to
> >>>>> convert
> >>>>> Tuple to Text().
> >>>>>
> >>>>> Not sure if this is intentional or not. I can submit a patch to move
> >>>> Tuple
> >>>>> handling into PigStorage. Then PigTextOutputFormat would be as thin
> as
> >>>>> PigTextInputFormat.
> >>>>>
> >>>>
> >>
> >>
>
>