Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # dev >> loggers

On Mon, Jan 30, 2012 at 10:05 AM, Aaron Cordova
>> The big problem is in the fact that writing replicas in HDFS is done in a pipeline, rather than in parallel. There is a ticket to change this (HDFS-1783), but no movement on it since last summer.
> ugh - why would they change this? Pipelining maximizes bandwidth usage. It'd be cool if the log stream could be configured to return after written to one, two, or more nodes though.

The JIRA proposes to allow "star replication" instead of "pipeline
replication" on a per-stream basis. Pipelining trades off latency for
bandwidth -- multiple RTTs instead of 1 RTT.

A few other notes relevant to the discussion above (sorry for losing
the quote history):

Regarding HDFS's being designed for large sequential writes rather
than small records, that was originally true, but now its actually
fairly efficient. We have optimizations like HDFS-895 specifically for
the WAL use case which approximate things like group commit, and when
you combine that with group commit at the tablet-server level you can
get very good throughput along with durability guarantees. I haven't
benchmarked vs Accumulo's Loggers ever, but I'd be surprised if the
difference were substantial - we tend to be network bound on the WAL
unless the edits are really quite tiny.

We're also looking at making our WAL implementation pluggable: see
HBASE-4529. Maybe a similar approach could be taken in Accumulo such
that HBase could use Accumulo loggers, or Accumulo could use HBase's
existing WAL class?

Todd Lipcon
Software Engineer, Cloudera