Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Conversion


Copy link to this message
-
Re: Conversion
Mark 2011-04-01, 19:14
How would I return a list of values?

(val1, val2, val3...)

I tried returning a List<Object> however it I get a tuple that contains
a tuple with a list of values and I have to flatten it to get the
desired behavior.

((val1, val2, val3...))

Thanks

On 4/1/11 10:09 AM, Dmitriy Ryaboy wrote:
> Right, Pig always returns a Tuple that contains whatever your UDF returns --
> so if you return a string, it returns a Tuple with a String in it.
> Unfortunately that also means that if you return a Tuple, you get a Tuple in
> a Tuple.
>
> We probably shouldn't do that, but at this point changing the behavior can
> break a lot of people's existing pig code :(.
>
> D
>
> On Fri, Apr 1, 2011 at 7:30 AM, Mark<[EMAIL PROTECTED]>  wrote:
>
>> I created the following:
>>
>> http://pastie.org/1743857
>>
>> And I'm using it in the following way:
>>
>> register 'target/pig-1.0-SNAPSHOT.jar'
>> rows = LOAD 'foo' AS (user:chararray, item:long);
>> grouped = GROUP rows BY user;
>> final = GENERATE FLATTEN(com.mycompany.pig.udf.Foo(unique.item));
>>
>> Does that look about right? Is there any particular reason why I need to
>> flatten at the end? When I try to output a simple tuple from the EvalFunc it
>> is always a tuple inside a tuple.
>>
>> Thanks
>>
>>
>>
>> On 3/31/11 10:10 AM, Jonathan Coveney wrote:
>>
>>> You definitely can do this with a UDF. You simply take the Tuples as input
>>> and then begin concatenating them together. Be wary of memory limitations
>>> for the intermediate as it gets large. It may be more practical to let the
>>> output be a tuple whose element sare the rows.
>>>
>>> (199027860,199027860,149167529,203508790,198488630)
>>>
>>> then the input to your UDF will be a tuple whose first element is a bag,
>>> and
>>> then the output will be a tuple of all the elements. It is quite easy to
>>> write something that does this, take a look at the UDF documentation and
>>> ask
>>> if you need any help.
>>>
>>> 2011/3/31 Mark<[EMAIL PROTECTED]>
>>>
>>>   I have these "rows"
>>>> ({(155495400)})
>>>> ({(199027860),(199027860),(149167529),(203508790),(198488630)})
>>>> ({(174255619),(201077556),(199051606),(198778302)})
>>>>
>>>> I believe the correct way to explain them would be each row/tuple is a
>>>> bag
>>>> that contains tuples of size 1? Is that right?
>>>>
>>>> Anyway, is there something native or UDF I can use to convert them to
>>>> this
>>>> format?
>>>>
>>>> (155495400)
>>>> (199027860 199027860 149167529 203508790 198488630)
>>>> (174255619 201077556 199051606 198778302)
>>>>
>>>> Maybe if I explain what we are trying to do it would help.
>>>>
>>>> We have logs of users to product views in a tab delimited format.
>>>>
>>>> foo\t1234
>>>> bar\t1234
>>>> foo\t4423
>>>> baz\t5563
>>>>
>>>> We simply want product views grouped by user and outputed on 1 line.
>>>>
>>>> 1234 4423
>>>> 1234
>>>> 5563
>>>>
>>>> The above first line would be from the user foo, second bar and third
>>>> baz.
>>>>
>>>> Thanks
>>>>
>>>>