Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> explode operation


+
Stan Rosenberg 2012-01-26, 03:11
+
Prashant Kommireddi 2012-01-26, 03:19
+
Stan Rosenberg 2012-01-26, 03:26
+
Stan Rosenberg 2012-01-26, 03:31
+
Prashant Kommireddi 2012-01-26, 03:46
+
Jonathan Coveney 2012-01-26, 20:04
Copy link to this message
-
Re: explode operation
Hi Jonathan,

What you recommended below is not quite right.  The right solution
would need to do something similar to 'explode'.

Thanks,

stan

On Thu, Jan 26, 2012 at 3:04 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote:
> I think this might give you what you want
>
> X = LOAD 'input.txt' using PigStorage(',') AS (id1:chararray,
> id2:chararray, id3:chararray, id4:chararray, id5:chararray);
> Y_0 = foreach X generate FLATTEN(TOBAG(*));
> Y = filter Y_0 by $0 is not null;
>
> 2012/1/25 Prashant Kommireddi <[EMAIL PROTECTED]>
>
>> Sorry I misunderstood your initial question. You would have to write a
>> custom UDF to do this.
>>
>> Thanks,
>> Prashant
>>
>> On Jan 25, 2012, at 7:32 PM, Stan Rosenberg
>> <[EMAIL PROTECTED]> wrote:
>>
>> > To clarify, here is our input:
>> >
>> > X = LOAD 'input.txt' AS (id1:chararray, id2:charrarray,
>> > id3:charrarray, id4:chararray, id5:chararray);
>> >
>> > We want to compute Y that consists of a single column denoting the set
>> > of all (non-null) ids coming from X.
>> >
>> > stan
>> >
>> >
>> > On Wed, Jan 25, 2012 at 10:26 PM, Stan Rosenberg
>> > <[EMAIL PROTECTED]> wrote:
>> >> I don't see how flatten would help in this case.
>> >>
>> >> On Wed, Jan 25, 2012 at 10:19 PM, Prashant Kommireddi
>> >> <[EMAIL PROTECTED]> wrote:
>> >>> Hi Stan,
>> >>>
>> >>> Would using FLATTEN and then DISTINCT work?
>> >>>
>> >>> Thanks,
>> >>> Prashant
>> >>>
>> >>> On Wed, Jan 25, 2012 at 7:11 PM, Stan Rosenberg <
>> >>> [EMAIL PROTECTED]> wrote:
>> >>>
>> >>>> Hi Guys,
>> >>>>
>> >>>> I came across a use case that seems to require an 'explode' operation
>> >>>> which to my knowledge is not currently available.
>> >>>> That is, given a tuple (x,y,z), 'explode' would generate the tuples
>> >>>> (x), (y), (z).
>> >>>>
>> >>>> E.g., consider a relation that contains an arbitrary number of
>> >>>> different identifier columns, say,
>> >>>> social security id, student id, etc.  We want to compute the set of
>> >>>> all distinct identifiers.  Assume that the number of identifier
>> >>>> columns is large and intermingled with other
>> >>>> columns that should be projected out; this is to avoid a solution
>> >>>> using 'SPLIT', e.g.
>> >>>>
>> >>>> To be concrete, if X = {(..., 2, 4, ..., 3), (..., 2,,...,5)} is such
>> >>>> a relation, then the answer we want is
>> >>>> Y={2,3,4,5}.
>> >>>>
>> >>>> Any suggestions?
>> >>>>
>> >>>> Thanks,
>> >>>>
>> >>>> stan
>> >>>>
>>
+
Aniket Mokashi 2012-01-30, 07:25
+
Stan Rosenberg 2012-01-30, 16:05