Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Best Practice: LOAD returns null


Copy link to this message
-
Re: Best Practice: LOAD returns null
Bill Graham 2012-04-11, 17:39
I'm not entirely following your empty data set proposal, but regardless I
think you should fail hard and fast if part of the glob is missed. IIRC
Hadoop's filesystem API throws an exception when not all glob variants are
met. I'd recommending throwing that up to the user, which should clearly
indicate why you're failing to execute.
On Wed, Apr 11, 2012 at 1:13 AM, Markus Resch <[EMAIL PROTECTED]>wrote:

> Thanks, you are perfectly right, the LOAD needs to fail. But how do I
> proceed if it fails? Afaik, I can't return an error to my caller or
> something else? One idea I had was to load an default (empty) data set
> of the given schema and union the result of LOAD with that to get a
> valid empty data set which I can handle normally with an empty result.
> But that looks kind of complicated to me.
>
> Thanks
> Markus
>
> Am Dienstag, den 10.04.2012, 20:53 -0700 schrieb Bill Graham:
> > Typically, file pattern globing is very strict and LOADs fail if not all
> > glob variants are met. This makes sense when you think that someone might
> > pass a glob path with each of the 24 hours in a day. If one of those
> hours
> > doesn't exist you want the LOAD to fail.
> >
> > thanks,
> > Bill
> >
> >
> > On Tue, Apr 10, 2012 at 8:58 AM, Markus Resch <[EMAIL PROTECTED]
> >wrote:
> >
> > > Hey everyone,
> > >
> > > I have a new question about how to handle a very common issue the best:
> > > We have a LOAD statement loading AVRO files using globbing by a given
> > > regex. By some wired reason this might return null as there is no file
> > > matching the regex.
> > > There are two thinkable cases where this can happen:
> > > On purpose: There is no data gathered in this e.g. time frame.
> > > On error: some nasty guy deleted a very important look up table for my
> > > join. Great hint the stuff with the replicated join, btw :).
> > >
> > >
> > > Do you have any suggestion about how to handle this?
> > >
> > > Thanks
> > >
> > > Markus
> > >
> > >
> >
> >
>
> --
> Markus Resch
> Software Developer
> P: +49 6103-5715-236 | F: +49 6103-5715-111 |
> ADTECH GmbH | Robert-Bosch-Str. 32 | 63303 Dreieich | Germany
> www.adtech.com<http://www.adtech.com>
>
> ADTECH | A Division of Advertising.com Group - Residence of the Company:
> Dreieich, Germany - Registration Office: Offenbach, HRB 46021
> Management Board: Erhard Neumann, Mark Thielen
>
> This message contains privileged and confidential information. Any
> dissemination, distribution, copying or other use of this
> message or any of its content (data, prices...) to any third parties may
> only occur with ADTECH's prior consent.
>
>
--
*Note that I'm no longer using my Yahoo! email address. Please email me at
[EMAIL PROTECTED] going forward.*