On Mar 3, 2011, at 3:29 PM, Jacob R Rideout wrote:
> On Thu, Mar 3, 2011 at 2:04 PM, Keith Wiley <[EMAIL PROTECTED]> wrote:
>> On Mar 3, 2011, at 2:51 AM, Steve Loughran wrote:
>>> yes, but the problem is determining which one will fail. Ideally you should find the route cause, which is often some race condition or hardware fault. If it's the same server ever time, turn it off.
>>> You can play with the specex parameters, maybe change when they get kicked off. The assumption in the code is that the slowness is caused by H/W problems (especially HDD issues) and it tries to avoid duplicate work. If every Map was duplicated, you'd be doubling the effective cost of each query, and annoying everyone else in the cluster. Plus increased disk and network IO might slow things down.
>>> Look at the options, have a play and see. If it doesn't have the feature, you can always try coding it in -if the scheduler API lets it do it, you wont' be breaking anyone else's code.
>> Thanks. I'll take it under consideration. In my case, it would be really beneficial to duplicate the work. That task in question is a single task on a single node (numerous mappers feed data into a single reducer), so duplicating the reducer represents very will duplicated effort while mitigating a potential bottleneck in the job's performance since the job simply is not done until the single reducer finishes. I would really like to be able to do what I am suggesting, to duplicate the reducer and kill the clones after the winner finishes.
>> Anyway, thanks.
> What is your reason for needing a single reducer? I'd first try to see
> how I could parallelize that work first if possible.
No no no, I don't want to debate my high-level design. I'm just trying to hone and sharpen the current approach. First and foremost, the reducer algorithm is not very amenable to parallelization. It is theoretically parallelizable but only at the cost of significant overhead which will probably mitigate the benefits. I appreciate your curiosity but that's my goal here, my goal was simply to inquire about a seemingly obvious optimization, i.e., to race concurrent tasks such that if one fails, the total job time is not impeded.
If I find some free time I'll try to write up a longer description of our program, but I don't have time for that now, I'm sorry. I don't mean to sound rude, I just don't have time for that...kinda the same reason I was hoping to make the reducer more failure-tolerant in the first place. I'm trying to get this data processed super fast.
Anyway, thanks for the input, sounds like I've got the answer: Hadoop does not natively support what I'm suggesting, although if I want to try to patch it perhaps I can find a way...at some point.
Cheers! Thanks for the input on the matter.
Keith Wiley [EMAIL PROTECTED] keithwiley.com music.keithwiley.com
"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
-- Galileo Galilei