I think a hybrid approach is probably too much pain than its worth. The
configuration of the networking and the IP addresses across virtual and
physical hosts will be challenging but not impossible. Also, what are you
trying to isolate accumulo from? MapReduce perhaps? A large storm instance?
Either way, you'll have to think about how to virtualize and provision
those things, too. Now your host is dealing with VMs and HDFS services.
None of these are really show-stopper excuses, so you really could do what
you are trying to do, but you'd be paving your own way.
I'm pretty sure I agree with Josh on this one, but wanted to explain the
pure virtualization option.
The VMWare thing you mentioned might have been this thing:
technical, but less breadth)
I'm a big proponent of these as they really do solve a couple fundamental
problems (disclaimer: I use to work for Pivotal, who helped pushed this
solution). The neat thing they added in the extensions was the
understanding of data locality between TaskTrackers and DataNodes if they
reside on the same physical host in different virtual machines. This means
that jobs would get assigned to TTs within the same "node group", which is
nice for a couple reasons. Most prominantly, it allows you to separate the
HDFS and MR services into different VMs while maintaining data locality.
This is good for scaling compute separate from storage, particularly in a
multi-tenant environment. Another cool thing is you can "shut off" the
execution environment: spin down the VMs with the TTs but leave the DNs
alone. There are some other things they did to make this architecture make
So getting back to your question, hypothetically, you could have multiple
HDFS instances on the same cluster (neat), each supporting one or more
Accumulo instances, each of which can be handled independently of one
another. Your MR and other things can also use VMs and you have pretty good
resource utilization compartmentalization. his would give you multi tenancy
and would allow you to manage separate services running over HDFS as
separate clusters. You could also stop region servers while keeping HDFS
(and perhaps MapReduce alive), which could be interesting if you want to
start up a proof of concept but don't need the service to be live all the
In that VMWare paper they mention that performance actually increases with
this DN/TT separation scheme over bare metal, but be wary of the numbers.
There is no doubt overhead in having a virtualization layer. But, if
multi-tenancy and elasticity are important to you, this could be one way to
perform that tradeoff.
On Tue, Nov 5, 2013 at 3:31 PM, Josh Elser <[EMAIL PROTECTED]> wrote:
> Hi Kesten,
> As you likely know (given your arguments against), using virtualization to
> a Hadoop stack can introduce some unintended consequences. Hadoop has a lot
> of heartbeats between processes to determine system "aliveness". If your
> infrastructure is overloaded, Hadoop can really suffer from spikes in
> Accumulo is much the same way, arguably a bit more. Accumulo's processes
> are very dependent on maintaining a lock in ZooKeeper (every 30 seconds by
> default) instead of RPC calls between DataNodes and NameNodes. Accumulo's
> node failure tends to be much more expensive than HDFS' because Accumulo
> wants to make sure every tablet is available without significant downtime.
> Hadoop has multiple replicas for each file so it can be a bit more lazy
> about noticing failure and re-replicating. What I've typically heard is
> that running Accumulo in a virtualized environment makes administration and
> use a bit more difficult.
> If you're considering running HDFS on baremetal, I would encourage you do
> to the same with Accumulo or investigate something like YARN (really, HOYA