Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Multithreaded UDF


Copy link to this message
-
Re: Multithreaded UDF
oh, this is much better than custom loader hack I mentioned to batch up
input tuples.

On Wed, Nov 9, 2011 at 12:22 PM, Mridul Muralidharan
<[EMAIL PROTECTED]>wrote:

>
> A simple solution would be to tag each tuple with a random number (such
> that each number has multiple url's associated with it - but not too large
> a number of urls), and simply group based on this field.
> In the reducer, you get a bag of url's for each random number : at which
> point, you can use multiple threads to fetch content and associate their
> responses with the appropriate input tuple.
>
>
> You only need to ensure that :
> a) Too many tuples dont get associated with a single random number (to the
> extent that it causes spills to disk).
>
> b) Too few tuples dont get associated over all random numbers you use -
> else it degenerates to current case.
>
> c) You seed the random number sensible, in order not to hit problems with
> having your tasks being non-repeatable.
>
> Regards,
> Mridul
>
>
> On Wednesday 09 November 2011 07:04 PM, Daan Gerits wrote:
>
>> Hello,
>>
>> First of all, great job creating pig, really a magnificent piece of
>> software.
>>
>> I do have a few questions about UDFs. I have a dataset with a list of
>> url's I want to fetch. Since an EvalFunc can only process one tuple at a
>> time and the asynchronous abilities of the UDF are deprecated, I can only
>> fetch one url at a time. The problem is that fetching this one url takes a
>> reasonable amount of time (1 to 5 seconds, there is a delay built in) so
>> that really slows down the processing. I already converted the UDF into an
>> Accumulator but that only seems to get fired after a group by. If would be
>> nice to have some kind of Queue UDF which will queue the tuples until a
>> certain amount is reached and than flushes the queue. That way I can add
>> tuples to an internal list and on flush start multiple threads to go
>> through the list of Tuples.
>>
>> This is a workaround though, since the best solution would be to
>> reintroduce the asynchronous UDF's (in which case I can schedule the
>> threads as the tuples come in)
>>
>> Any idea's on this? I already saw someone trying almost the same thing,
>> but didn't get a definite answer from that one.
>>
>> An other option is to increase the number of reducer slots on the
>> cluster, but I'm afraid that would mean we overload the nodes in case of a
>> heavy reduce phase.
>>
>> Best Regards,
>>
>> Daan
>>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB