Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # dev >> loggers


By default we roll every hour regardless of the number of edits. We
also roll every 0.95 x blocksize (fairly arbitrary, but works out to
around 120M usually). And same thing - we trigger flushes when we have
too many logs - but we're much less agressive than you -- our default
here is 32 segments I think. We'll also roll immediately any time
there are less than 3 DNs in the pipeline.

-Todd

On Mon, Jan 30, 2012 at 11:15 AM, Eric Newton <[EMAIL PROTECTED]> wrote:
> What is "periodically" for closing the log segments?  Accumulo basis this
> decision based on the log size, with 1G segments by default.
>
> Accumulo marks the tablets with the log segment, and initiates flushes when
> the number of segments grows large (3 segments by default).  Otherwise,
> tablets with low write-rates, like the metadata table can take forever to
> recover.
>
> Does HBase do similar things?  Have you experimented with larger/smaller
> segments?
>
> -Eric
>
> On Mon, Jan 30, 2012 at 1:59 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote:
>
>> On Mon, Jan 30, 2012 at 10:53 AM, Eric Newton <[EMAIL PROTECTED]>
>> wrote:
>> > I will definitely look at using HBase's WAL.  What did you guys do to
>> > distribute the log-sort/split?
>>
>> In 0.90 the log sorting was done serially by the master, which as you
>> can imagine was slow after a full outage.
>> In 0.92 we have distributed log splitting:
>> https://issues.apache.org/jira/browse/hbase-1364
>>
>> I don't think there is a single design doc for it anywhere, but
>> basically we use ZK as a distributed work queue. The master shoves
>> items in there, and the region servers try to claim them. Each item
>> corresponds to a log segment (the HBase WAL rolls periodically). After
>> the segments are split, they're moved into the appropriate region
>> directories, so when the region is next opened, they are replayed.
>>
>> -Todd
>>
>> >
>> > On Mon, Jan 30, 2012 at 1:37 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote:
>> >
>> >> On Mon, Jan 30, 2012 at 10:05 AM, Aaron Cordova
>> >> <[EMAIL PROTECTED]> wrote:
>> >> >> The big problem is in the fact that writing replicas in HDFS is done
>> in
>> >> a pipeline, rather than in parallel. There is a ticket to change this
>> >> (HDFS-1783), but no movement on it since last summer.
>> >> >
>> >> > ugh - why would they change this? Pipelining maximizes bandwidth
>> usage.
>> >> It'd be cool if the log stream could be configured to return after
>> written
>> >> to one, two, or more nodes though.
>> >> >
>> >>
>> >> The JIRA proposes to allow "star replication" instead of "pipeline
>> >> replication" on a per-stream basis. Pipelining trades off latency for
>> >> bandwidth -- multiple RTTs instead of 1 RTT.
>> >>
>> >> A few other notes relevant to the discussion above (sorry for losing
>> >> the quote history):
>> >>
>> >> Regarding HDFS's being designed for large sequential writes rather
>> >> than small records, that was originally true, but now its actually
>> >> fairly efficient. We have optimizations like HDFS-895 specifically for
>> >> the WAL use case which approximate things like group commit, and when
>> >> you combine that with group commit at the tablet-server level you can
>> >> get very good throughput along with durability guarantees. I haven't
>> >> benchmarked vs Accumulo's Loggers ever, but I'd be surprised if the
>> >> difference were substantial - we tend to be network bound on the WAL
>> >> unless the edits are really quite tiny.
>> >>
>> >> We're also looking at making our WAL implementation pluggable: see
>> >> HBASE-4529. Maybe a similar approach could be taken in Accumulo such
>> >> that HBase could use Accumulo loggers, or Accumulo could use HBase's
>> >> existing WAL class?
>> >>
>> >> -Todd
>> >> --
>> >> Todd Lipcon
>> >> Software Engineer, Cloudera
>> >>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>

--
Todd Lipcon
Software Engineer, Cloudera
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB