-Re: Group Data By UDF Result?
Jonathan Coveney 2012-10-16, 22:34
Howdy Joshua. This question comes up a fair amount, in various forms, and
here is the answer: unless you can figure out a way to reduce this to an
equi-join, then it is going to be tough.
Why is that? Because of how joining in map-reduce land works. The way
joining generally works is by hashing the join key in each relation and
sending equal hash values to the same reducer. Can you see why doing more
complicated equality operations is tough?
What is the algorithm around the equality?
Essentially, for a join to work, you need to be able to find a function
f1(x,y) = true iff f2(x)=f2(y)
f1 is your current function.
2012/10/16 Joshua Penton <[EMAIL PROTECTED]>
> I currently have two sets of data, let's call them QUERY and TARGETS. What
> I am currently trying to do is the following:
> 1. For each row in QUERY extract a 'query' property
> 2. For each 'query' extracted locate all TARGET rows whose 'value'
> property "matches" the 'query' property.
> Note: Determining the "matches" state involves the execution of a custom
> UDF to determine the validity of equality. (Essentially implementing a SQL
> LIKE-style request) As a result there doesn't appear to be in-built Pig
> functionality to perform this comparison.
> I have tried multiple methods including utilizing a FOREACH with a FILTER
> command, convoluted COGROUPing, and countless other methods to no avail.
> The only method that I've found works is to compute a full CROSS between
> QUERY and TARGETS and performing the FILTER on the result. However the
> execution time of this single task is on the order runs on the order of 30
> minutes and would only grow exponentially once operational data is
> So, am I missing something obvious or is there some standard method to
> implement this functionality?
> (Please be kind, for as embarrassingly long as I have been on the internet
> I have never before submitted information to a mailing list.)