Hadoop tasks use a single thread, so there won't be multiple threads
accessing the UDF.
However, there's a flip side of thread safety if your UDF maintains state;
is it receiving all the data it should or is the data being sharded over
multiple processes in a way that defeats the UDF? My favorite example is a
moving average calculator (like you might use in Finance). Most
full-featured SQLs have window functions for this purpose.
Suppose I'm averaging over the last 50 closing prices for a given financial
instrument. To do this I cache the last 50 I've seen in the UDF as each
record is passed to me (keeping the data for each instrument properly
separated). If some records go to one mapper task and other records go to a
different mapper task, then at least some of my averages will be wrong due
to missing data.
On Sun, Mar 10, 2013 at 10:12 PM, Shaun Clowes <[EMAIL PROTECTED]>wrote:
> Hi All,
> Could anyone describe what the required thread safety for a UDF is? I
> understand that one is instantiated for each use of the function in an
> expression, but can there be multiple threads executing the methods of a
> single UDF object at once?
*Dean Wampler, Ph.D.*