Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Conversion


Copy link to this message
-
Re: Conversion
I created the following:

http://pastie.org/1743857

And I'm using it in the following way:

register 'target/pig-1.0-SNAPSHOT.jar'
rows = LOAD 'foo' AS (user:chararray, item:long);
grouped = GROUP rows BY user;
final = GENERATE FLATTEN(com.mycompany.pig.udf.Foo(unique.item));

Does that look about right? Is there any particular reason why I need to
flatten at the end? When I try to output a simple tuple from the
EvalFunc it is always a tuple inside a tuple.

Thanks
On 3/31/11 10:10 AM, Jonathan Coveney wrote:
> You definitely can do this with a UDF. You simply take the Tuples as input
> and then begin concatenating them together. Be wary of memory limitations
> for the intermediate as it gets large. It may be more practical to let the
> output be a tuple whose element sare the rows.
>
> (199027860,199027860,149167529,203508790,198488630)
>
> then the input to your UDF will be a tuple whose first element is a bag, and
> then the output will be a tuple of all the elements. It is quite easy to
> write something that does this, take a look at the UDF documentation and ask
> if you need any help.
>
> 2011/3/31 Mark<[EMAIL PROTECTED]>
>
>> I have these "rows"
>>
>> ({(155495400)})
>> ({(199027860),(199027860),(149167529),(203508790),(198488630)})
>> ({(174255619),(201077556),(199051606),(198778302)})
>>
>> I believe the correct way to explain them would be each row/tuple is a bag
>> that contains tuples of size 1? Is that right?
>>
>> Anyway, is there something native or UDF I can use to convert them to this
>> format?
>>
>> (155495400)
>> (199027860 199027860 149167529 203508790 198488630)
>> (174255619 201077556 199051606 198778302)
>>
>> Maybe if I explain what we are trying to do it would help.
>>
>> We have logs of users to product views in a tab delimited format.
>>
>> foo\t1234
>> bar\t1234
>> foo\t4423
>> baz\t5563
>>
>> We simply want product views grouped by user and outputed on 1 line.
>>
>> 1234 4423
>> 1234
>> 5563
>>
>> The above first line would be from the user foo, second bar and third baz.
>>
>> Thanks
>>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB