Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Zookeeper, mail # user - Question about the Barrier Java example on the ZooKeeper documentation


+
Semih Salihoglu 2011-03-06, 01:06
+
Mahadev Konar 2011-03-06, 02:41
+
Semih Salihoglu 2011-03-07, 10:23
Copy link to this message
-
Re: Question about the Barrier Java example on the ZooKeeper documentation
Flavio Junqueira 2011-03-08, 13:59
I believe the goal of the examples was never to be a complete  
solutions to barriers or queues, but just to give a quick bootstrap to  
beginners. It is true, though, that the documentation page does not  
make that claim, and can be misleading.

I see two possible action points out of this discussion:

1- State clearly in the beginning that the example discussed is not  
correct under the assumption that a process may finish the computation  
before another has started, and the example is there for illustration  
purposes;
2- Have another example following the current one that discusses the  
problem and shows how to fix it. This is an interesting option that  
illustrates how one could reason about a solution when developing with  
zookeeper.

If you are interested in helping us fix it, Semih, then you could  
perhaps create a jira and assign yourself to fix it. I can help you out.

-Flavio

On Mar 7, 2011, at 11:23 AM, Semih Salihoglu wrote:

> Hi Mahadev,
>
> Sorry for the late response. I agree, actually in this other  
> documentation
> http://hadoop.apache.org/zookeeper/docs/r3.0.0/recipes.html, where  
> there is
> only the pseudo-code, I think this situation is avoided. Here there is
> another znode /ready that all nodes have a watch on. And after each  
> node
> writes their own ephemeral child, they don't wait. They read how  
> many of has
> been written and the last one writes the /ready znode and everyone  
> wakes up.
> The only race condition in this one is that there can be two nodes  
> trying to
> write /ready and only one of them will succeed but this is ok.
>
> Thank you again,
>
> semih
>
> On Sat, Mar 5, 2011 at 6:41 PM, Mahadev Konar <[EMAIL PROTECTED]>  
> wrote:
>
>> Semih,
>> You pointed it out right. It is possible ot enter into a situation
>> like that. The recipe does have a bug. It can be fixed with the last
>> client creating a special znode and every node in the list watching
>> for that (so itll be an indication for entering the barrier). no?
>>
>> thanks
>> mahadev
>>
>> On Sat, Mar 5, 2011 at 5:06 PM, Semih Salihoglu <[EMAIL PROTECTED]>
>> wrote:
>>> Hi All,
>>>
>>> I am new to this group and to ZooKeeper. I was readin the Barrier
>> tutorial
>>> in one of the ZooKeeper documentations.
>>> http://hadoop.apache.org/zookeeper/docs/current/zookeeperTutorial.html
>>>  .
>> A
>>> barrier primitive is exactly how I want to use ZooKeeper. I have a
>> question
>>> about this example. It's not really a ZooKeeper question, it's  
>>> more a
>>> question about the Barrier primitive I think. Here it is: In the  
>>> enter
>>> method of this Barrier implementation below
>>>
>>> boolean enter() throws KeeperException, InterruptedException{
>>>           zk.create(root + "/" + name, new byte[0],  
>>> Ids.OPEN_ACL_UNSAFE,
>>>                   CreateMode.EPHEMERAL_SEQUENTIAL);
>>>           while (true) {
>>>               synchronized (mutex) {
>>>                   List<String> list = zk.getChildren(root, true);
>>>
>>>                   if (list.size() < size) {
>>>                       mutex.wait();
>>>                   } else {
>>>                       return true;
>>>                   }
>>>               }
>>>           }
>>>       }
>>>
>>> could there be a race condition? Let's say there are two
>>> machines/nodes: node1 and node2 that will use this code to  
>>> synchronize
>>> over ZK. Let's say the following steps take place:
>>>
>>>
>>>  1. node1 calls the zk.create method and then reads the number of
>>> children, and sees that it's 1 and starts waiting.
>>>  2. node2 calls the zk.create method (doesn't call the
>>> zk.getChildren method yet, let's say it's very slow)
>>>  3. node1 is notified that the number of children on the znode
>>> changed, it checks that the size is 2 so it leaves the barrier, it
>>> does its work and then leaves the barrier, deleting its node.
>>>  4. node2 calls zk.getChildren and because node1 has already left,
>>> it sees that the number of children is equal to 1. Since node1 will

flavio
junqueira

research scientist

[EMAIL PROTECTED]
direct +34 93-183-8828

avinguda diagonal 177, 8th floor, barcelona, 08018, es
phone (408) 349 3300    fax (408) 349 3301
+
Patrick Hunt 2011-03-08, 16:31
+
Semih Salihoglu 2011-03-08, 21:13
+
Flavio Junqueira 2011-03-09, 09:30
+
Semih Salihoglu 2011-03-09, 09:55
+
Mahadev Konar 2011-03-09, 16:11