Mutations are written to WALOGS when they are inserted into a
TServer's in-memory map. The TServer's in-memory map gets flushed to
disk periodically, but there's a risk that the TServer will die after
the data has been ingested, but before it is flushed to disk. The
WALOGS, when enabled, protect against this data loss, by first writing
out incoming data to a WALOG. The WALOG is more efficient than
creating RFiles, because it does not contain sorted data or indexes.
It's just a playback file, so that in case of a failure, Mutations
that the client believed had been ingested, aren't lost.
Putting the WALOG in memory defeats the purpose of the WALOG, but it
can be disabled (per-table), if you care more about performance than
protection against data loss. Don't disable it for the !METADATA
You can generate RFiles directly (perhaps using a M/R job), and bypass
the WALOG, and bulk import them into Accumulo.
Christopher L Tubbs II
On Wed, Sep 25, 2013 at 4:39 PM, Slater, David M.
<[EMAIL PROTECTED]> wrote:
> First, thank you all for the responses on my BatchWriter question, as I was
> able to increase my ingestion rate by a large factor. I am now hitting disk
> i/o limits, which is forcing me to look at reducing file copying. My primary
> thoughts concerning this are reducing the hadoop replication factor as well
> as reducing the number of major compactions.
> However, from what I understand about write ahead logs (in 1.4), even if you
> remove all major compactions, all data will essentially be written to disk
> twice: once to the WALOG in the local directory (HDFS is 1.5), then from the
> WALOG to an RFile on HDFS. Is this understanding correct?
> I’m trying to understand what the primary reasons are for having the WALOG.
> Is there any way to write directly to an RFile from the In-Memory Map (or
> have the WALOG in memory)?