Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Multithreaded UDF


Copy link to this message
-
Re: Multithreaded UDF
oh, this is much better than custom loader hack I mentioned to batch up
input tuples.

On Wed, Nov 9, 2011 at 12:22 PM, Mridul Muralidharan
<[EMAIL PROTECTED]>wrote:

>
> A simple solution would be to tag each tuple with a random number (such
> that each number has multiple url's associated with it - but not too large
> a number of urls), and simply group based on this field.
> In the reducer, you get a bag of url's for each random number : at which
> point, you can use multiple threads to fetch content and associate their
> responses with the appropriate input tuple.
>
>
> You only need to ensure that :
> a) Too many tuples dont get associated with a single random number (to the
> extent that it causes spills to disk).
>
> b) Too few tuples dont get associated over all random numbers you use -
> else it degenerates to current case.
>
> c) You seed the random number sensible, in order not to hit problems with
> having your tasks being non-repeatable.
>
> Regards,
> Mridul
>
>
> On Wednesday 09 November 2011 07:04 PM, Daan Gerits wrote:
>
>> Hello,
>>
>> First of all, great job creating pig, really a magnificent piece of
>> software.
>>
>> I do have a few questions about UDFs. I have a dataset with a list of
>> url's I want to fetch. Since an EvalFunc can only process one tuple at a
>> time and the asynchronous abilities of the UDF are deprecated, I can only
>> fetch one url at a time. The problem is that fetching this one url takes a
>> reasonable amount of time (1 to 5 seconds, there is a delay built in) so
>> that really slows down the processing. I already converted the UDF into an
>> Accumulator but that only seems to get fired after a group by. If would be
>> nice to have some kind of Queue UDF which will queue the tuples until a
>> certain amount is reached and than flushes the queue. That way I can add
>> tuples to an internal list and on flush start multiple threads to go
>> through the list of Tuples.
>>
>> This is a workaround though, since the best solution would be to
>> reintroduce the asynchronous UDF's (in which case I can schedule the
>> threads as the tuples come in)
>>
>> Any idea's on this? I already saw someone trying almost the same thing,
>> but didn't get a definite answer from that one.
>>
>> An other option is to increase the number of reducer slots on the
>> cluster, but I'm afraid that would mean we overload the nodes in case of a
>> heavy reduce phase.
>>
>> Best Regards,
>>
>> Daan
>>
>
>