The large blocks issue is going away soon/already with append support in HDFS. You are still going to be hurt if you have other things IOing on the node as you still need to spin disk, but it won't be as terrible as it could be.
The big problem is in the fact that writing replicas in HDFS is done in a pipeline, rather than in parallel. There is a ticket to change this (HDFS-1783), but no movement on it since last summer.
Just my two cents, but sticking with the currently logging style makes the most sense, though maybe making it a really distinct interface so we can swap out for an HDFS implementation when it's ready and people prefer.
- Jesse Yates
Sent from my iPhone.
On Jan 30, 2012, at 8:18 AM, Eric Newton <[EMAIL PROTECTED]> wrote:
> I wonder if the I/O model of HDFS might be different from logging. Logging
> consists of many small appends, whereas HDFS relies on large buffers to
> push around big blocks of data. This is awesome for performance, but not
> so great for synchronous appends.
> I imagine we'll need to implement it in order to measure the impact. I
> don't expect it will be that hard to add.
> I would also like to experiment with specialized log nodes that would not
> have to compete with HDFS seeks.
> And, of course, we should try the multi-queue logger discussed in the
> google paper.
> As it is now, we can barely tell (by performance) if the walog is on, so
> I'm not sure it's worth spending much more time on the performance.
> On Mon, Jan 30, 2012 at 10:55 AM, Keith Turner <[EMAIL PROTECTED]> wrote:
>> I think it makes sense to move to HDFS if it is reliable (can survive
>> continuous ingest and the agitator) and performs well. Also, I am very
>> curious about what the performance differences are. It would be nice
>> to do some test.
>> On Mon, Jan 30, 2012 at 10:48 AM, Aaron Cordova
>> <[EMAIL PROTECTED]> wrote:
>>> At the Hbase vs Accumulo deathmatch the other night Todd elucidated that
>> Hbase's write-ahead log is in HDFS and benefits somewhat thereby. He
>> neglected to mention that for years until HDFS append() was available Hbase
>> just LOST data while Accumulo didn't .. but he was talking about the
>> current state of affairs so, whatever.
>>> The question now is, does it make any sense to look at HDFS as a place
>> to store Accumulo's write-ahead log? I remember that BigTable used two
>> write streams (each of which is transparently replicated by HDFS) and
>> switched between them to avoid performance hiccups, so it does sound like a
>> critical part of the overall performance. Such a big change would belong
>> probably in 1.6 or later ... But there may be reasons to never use HDFS and
>> to always use a separately maintained subsystem.
>>> Any one care to lay out the arguments for staying with a separate
>> subsystem? I think we know the arguments for using HDFS.