Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - Group Data By UDF Result?

Joshua Penton 2012-10-16, 22:05
Copy link to this message
Re: Group Data By UDF Result?
Russell Jurney 2012-10-17, 01:38
The 'enormous intermediate data way':

queries = foreach my_row generate id, extract_query(field1) as query;
target_queries = cross queries, target;
result = filter target_queries by my_condition(queries.query), etc.

The 'looping smaller chunks in ram in a UDF if your data partitions way':

queries = foreach my_row generate id, extract_query(field1) as query;
by_key = group queries by some_key;
also_by_key = group target by some_key;
crossed_groups = cross by_key, also_by_key;
result = filter crossed_groups by looping_udf(fields);

Russell Jurney http://datasyndrome.com

On Oct 16, 2012, at 3:06 PM, Joshua Penton <[EMAIL PROTECTED]> wrote:

> Greetings.
> I currently have two sets of data, let's call them QUERY and TARGETS. What I am currently trying to do is the following:
> 1. For each row in QUERY extract a 'query' property
> 2. For each 'query' extracted locate all TARGET rows whose 'value' property "matches" the 'query' property.
> Note: Determining the "matches" state involves the execution of a custom UDF to determine the validity of equality. (Essentially implementing a SQL LIKE-style request) As a result there doesn't appear to be in-built Pig functionality to perform this comparison.
> I have tried multiple methods including utilizing a FOREACH with a FILTER command, convoluted COGROUPing, and countless other methods to no avail. The only method that I've found works is to compute a full CROSS between QUERY and TARGETS and performing the FILTER on the result. However the execution time of this single task is on the order runs on the order of 30 minutes and would only grow exponentially once operational data is introduced.
> So, am I missing something obvious or is there some standard method to implement this functionality?
> (Please be kind, for as embarrassingly long as I have been on the internet I have never before submitted information to a mailing list.)
Jonathan Coveney 2012-10-16, 22:34