Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Conversion


Copy link to this message
-
Re: Conversion
How would I return a list of values?

(val1, val2, val3...)

I tried returning a List<Object> however it I get a tuple that contains
a tuple with a list of values and I have to flatten it to get the
desired behavior.

((val1, val2, val3...))

Thanks

On 4/1/11 10:09 AM, Dmitriy Ryaboy wrote:
> Right, Pig always returns a Tuple that contains whatever your UDF returns --
> so if you return a string, it returns a Tuple with a String in it.
> Unfortunately that also means that if you return a Tuple, you get a Tuple in
> a Tuple.
>
> We probably shouldn't do that, but at this point changing the behavior can
> break a lot of people's existing pig code :(.
>
> D
>
> On Fri, Apr 1, 2011 at 7:30 AM, Mark<[EMAIL PROTECTED]>  wrote:
>
>> I created the following:
>>
>> http://pastie.org/1743857
>>
>> And I'm using it in the following way:
>>
>> register 'target/pig-1.0-SNAPSHOT.jar'
>> rows = LOAD 'foo' AS (user:chararray, item:long);
>> grouped = GROUP rows BY user;
>> final = GENERATE FLATTEN(com.mycompany.pig.udf.Foo(unique.item));
>>
>> Does that look about right? Is there any particular reason why I need to
>> flatten at the end? When I try to output a simple tuple from the EvalFunc it
>> is always a tuple inside a tuple.
>>
>> Thanks
>>
>>
>>
>> On 3/31/11 10:10 AM, Jonathan Coveney wrote:
>>
>>> You definitely can do this with a UDF. You simply take the Tuples as input
>>> and then begin concatenating them together. Be wary of memory limitations
>>> for the intermediate as it gets large. It may be more practical to let the
>>> output be a tuple whose element sare the rows.
>>>
>>> (199027860,199027860,149167529,203508790,198488630)
>>>
>>> then the input to your UDF will be a tuple whose first element is a bag,
>>> and
>>> then the output will be a tuple of all the elements. It is quite easy to
>>> write something that does this, take a look at the UDF documentation and
>>> ask
>>> if you need any help.
>>>
>>> 2011/3/31 Mark<[EMAIL PROTECTED]>
>>>
>>>   I have these "rows"
>>>> ({(155495400)})
>>>> ({(199027860),(199027860),(149167529),(203508790),(198488630)})
>>>> ({(174255619),(201077556),(199051606),(198778302)})
>>>>
>>>> I believe the correct way to explain them would be each row/tuple is a
>>>> bag
>>>> that contains tuples of size 1? Is that right?
>>>>
>>>> Anyway, is there something native or UDF I can use to convert them to
>>>> this
>>>> format?
>>>>
>>>> (155495400)
>>>> (199027860 199027860 149167529 203508790 198488630)
>>>> (174255619 201077556 199051606 198778302)
>>>>
>>>> Maybe if I explain what we are trying to do it would help.
>>>>
>>>> We have logs of users to product views in a tab delimited format.
>>>>
>>>> foo\t1234
>>>> bar\t1234
>>>> foo\t4423
>>>> baz\t5563
>>>>
>>>> We simply want product views grouped by user and outputed on 1 line.
>>>>
>>>> 1234 4423
>>>> 1234
>>>> 5563
>>>>
>>>> The above first line would be from the user foo, second bar and third
>>>> baz.
>>>>
>>>> Thanks
>>>>
>>>>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB