Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper >> mail # dev >> Tool for interactive fault injection testing


Copy link to this message
-
Re: Tool for interactive fault injection testing
Hi Andrei, Here are some thoughts about fault injection. Broadly speaking, I think we can classify the faults into server and channel. Server faults can mess with the internals of a server, and can cause it to crash, delay a response, or corrupt data. We have only limited support to tolerate data corruption, and I think if you inject bit flips at various parts of the pipeline, only in a few spots we would be able to deal with them.

For channel faults, we could consider introducing delays, dropping messages at random, and disconnections at various points. Dropping messages at random may cause ZooKeeper to break because it makes the assumption in a few places that things are delivered in order and there is no gap. Introducing delays seems to be particularly interesting because we could test different ways of interleaving messages for leader election and Zab.

-Flavio    

On Jul 2, 2012, at 10:27 PM, Andrei Savu wrote:

> Thanks Flavio! This is the rule I'm using for the demo:
>
> RULE NIO Server readPayload fails
> CLASS org.apache.zookeeper.server.NIOServerCnxn
> METHOD readPayload
> HELPER RandomHelper
> AT ENTRY
> IF nextInt(100) < 10
> DO throw new IOException("Injected by byteman");
> ENDRULE
>
>
> See:
>
> https://github.com/andreisavu/zookeeper-tester/blob/master/src/main/resources/functions/install_byteman.sh
>
>
> ~10% of all the payload reads result in an exception being thrown. There is
> an
> increase in latency but the cluster as a whole works as expected.
>
> I am planning to do more of this if you think it's useful.
>
> -- Andrei Savu
>
> On Mon, Jul 2, 2012 at 11:21 PM, Flavio Junqueira <[EMAIL PROTECTED]> wrote:
>
>> Sounds like great stuff, Andrei. Do you have a description of the faults
>> you have injected I can access?
>>
>> -Flavio
>>
>> On Jul 2, 2012, at 10:14 PM, Andrei Savu wrote:
>>
>>> I was unable to find any issues so far. It seems like ZooKeeper does a
>>> great job at
>>> handling network failures.
>>>
>>> This tool is deploying a ZooKeeper cluster on a cloud provider using
>> Whirr
>>> together
>>> with Byteman [1]  (attached to the JVM).
>>>
>>> Faults are injected by using Byteman rules. See this tutorial:
>>>
>> https://community.jboss.org/wiki/FaultInjectionTestingWithByteman#what_is_fault_injection_testing
>>>
>>> I am planning to improve the tool to have the ability o inject arbitrary
>>> rules through the web UI.
>>>
>>> As an workload generator I am using a distributed queue implementation
>>> that's handling
>>> ConnectionLoss by retrying to post the message (duplicates are acceptable
>>> when measuring the latency).
>>>
>>> [1] http://www.jboss.org/byteman/
>>>
>>> -- Andrei Savu
>>>
>>> On Mon, Jul 2, 2012 at 7:39 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote:
>>>
>>>> Sounds interesting but it's not clear to me from the provided docs
>>>> what it does and what am I expected to do? (canned tests or a
>>>> framework for me to use). Have you been able to find any issues using
>>>> this?
>>>>
>>>> Patrick
>>>>
>>>> On Mon, Jul 2, 2012 at 3:15 AM, Andrei Savu <[EMAIL PROTECTED]>
>> wrote:
>>>>> Hi guys,
>>>>>
>>>>> As part of my MSc. project I have spent some time working on a tool for
>>>>> fault injection testing for Apache ZooKeeper based on jboss Byteman and
>>>>> Apache Whirr.
>>>>>
>>>>> You can find the code on Github:
>>>>>
>>>>> https://github.com/andreisavu/zookeeper-tester
>>>>>
>>>>> Do you think this can be an useful addition to contrib? (a version
>>>> that's a
>>>>> bit more generic)
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -- Andrei Savu / axemblr.com / Tools for Clouds
>>>>
>>
>>