Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - explode operation


Copy link to this message
-
Re: explode operation
Stan Rosenberg 2012-01-30, 01:46
Hi Jonathan,

What you recommended below is not quite right.  The right solution
would need to do something similar to 'explode'.

Thanks,

stan

On Thu, Jan 26, 2012 at 3:04 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote:
> I think this might give you what you want
>
> X = LOAD 'input.txt' using PigStorage(',') AS (id1:chararray,
> id2:chararray, id3:chararray, id4:chararray, id5:chararray);
> Y_0 = foreach X generate FLATTEN(TOBAG(*));
> Y = filter Y_0 by $0 is not null;
>
> 2012/1/25 Prashant Kommireddi <[EMAIL PROTECTED]>
>
>> Sorry I misunderstood your initial question. You would have to write a
>> custom UDF to do this.
>>
>> Thanks,
>> Prashant
>>
>> On Jan 25, 2012, at 7:32 PM, Stan Rosenberg
>> <[EMAIL PROTECTED]> wrote:
>>
>> > To clarify, here is our input:
>> >
>> > X = LOAD 'input.txt' AS (id1:chararray, id2:charrarray,
>> > id3:charrarray, id4:chararray, id5:chararray);
>> >
>> > We want to compute Y that consists of a single column denoting the set
>> > of all (non-null) ids coming from X.
>> >
>> > stan
>> >
>> >
>> > On Wed, Jan 25, 2012 at 10:26 PM, Stan Rosenberg
>> > <[EMAIL PROTECTED]> wrote:
>> >> I don't see how flatten would help in this case.
>> >>
>> >> On Wed, Jan 25, 2012 at 10:19 PM, Prashant Kommireddi
>> >> <[EMAIL PROTECTED]> wrote:
>> >>> Hi Stan,
>> >>>
>> >>> Would using FLATTEN and then DISTINCT work?
>> >>>
>> >>> Thanks,
>> >>> Prashant
>> >>>
>> >>> On Wed, Jan 25, 2012 at 7:11 PM, Stan Rosenberg <
>> >>> [EMAIL PROTECTED]> wrote:
>> >>>
>> >>>> Hi Guys,
>> >>>>
>> >>>> I came across a use case that seems to require an 'explode' operation
>> >>>> which to my knowledge is not currently available.
>> >>>> That is, given a tuple (x,y,z), 'explode' would generate the tuples
>> >>>> (x), (y), (z).
>> >>>>
>> >>>> E.g., consider a relation that contains an arbitrary number of
>> >>>> different identifier columns, say,
>> >>>> social security id, student id, etc.  We want to compute the set of
>> >>>> all distinct identifiers.  Assume that the number of identifier
>> >>>> columns is large and intermingled with other
>> >>>> columns that should be projected out; this is to avoid a solution
>> >>>> using 'SPLIT', e.g.
>> >>>>
>> >>>> To be concrete, if X = {(..., 2, 4, ..., 3), (..., 2,,...,5)} is such
>> >>>> a relation, then the answer we want is
>> >>>> Y={2,3,4,5}.
>> >>>>
>> >>>> Any suggestions?
>> >>>>
>> >>>> Thanks,
>> >>>>
>> >>>> stan
>> >>>>
>>