-Re: Accumulo design questions
Billie Rinaldi 2012-11-06, 20:30
On Tue, Nov 6, 2012 at 11:01 AM, Sukant Hajra <[EMAIL PROTECTED]>wrote:
> I've been trying to understand Accumulo more deeply as we use it more. To
> supplement the on-line documentation and source, I've been referencing some
> blog articles on HBase (Lars George has some ones), HBase docs, and the
> BigTable paper.
> But I'm curious about some of the deviations of Accumulo from BigTable and
> The questions I have right now are:
> 1. Is the format of an RFile close to HFile version 1, HFile version
> 2, or
> at this point is the format really it's own thing? I found good
> documentation on the HFile, but I haven't yet found similar
> on RFiles. There's the source code, but I haven't dug into that yet.
I think there is a different HFile for each column family, isn't there? An
RFile stores all columns, all locality groups in a single file, which is
another reason you don't get the same performance penalty for having lots
of column families in Accumulo.
> 2. I understand that HBase doesn't do well with too many column
> However, creating too many column families in HBase isn't likely anyway
> because you can't (I believe) create them dynamically. Accumulo
> allows you
> to create column families dynamically. But I wonder if this can come
> at a
> cost. Is there a benefit to using column families less frequently if
> possible in Accumulo? Or is the cost of using column families more or
> the same as using column qualifiers.
> 3. I guess one way families might be different from qualifiers relates
> HBase's recommendation to keep column family names short to avoid
> storage waste. That should apply to Accumulo as well, right?
> 4. In supporting dynamic column families, was there a design trade-off
> respect to the original BigTable or current HBase design? What might
> be a
> benefit of doing it the other way?
The main thing Accumulo had to do differently from BigTable to allow
dynamic creation of column families was to create a default locality
group. That's the locality group that stores column families that aren't
specified for any other locality group. I recall Keith saying it was kind
of a pain to implement, but I don't see any obvious negative tradeoffs of