On Tue, Nov 6, 2012 at 3:30 PM, Billie Rinaldi <[EMAIL PROTECTED]> wrote:
> On Tue, Nov 6, 2012 at 11:01 AM, Sukant Hajra <[EMAIL PROTECTED]>
>> I've been trying to understand Accumulo more deeply as we use it more. To
>> supplement the on-line documentation and source, I've been referencing
>> blog articles on HBase (Lars George has some ones), HBase docs, and the
>> BigTable paper.
>> But I'm curious about some of the deviations of Accumulo from BigTable and
>> The questions I have right now are:
>> 1. Is the format of an RFile close to HFile version 1, HFile version
>> 2, or
>> at this point is the format really it's own thing? I found good
>> documentation on the HFile, but I haven't yet found similar
>> on RFiles. There's the source code, but I haven't dug into that yet.
> I think there is a different HFile for each column family, isn't there? An
> RFile stores all columns, all locality groups in a single file, which is
> another reason you don't get the same performance penalty for having lots of
> column families in Accumulo.
>> 2. I understand that HBase doesn't do well with too many column
>> However, creating too many column families in HBase isn't likely
>> because you can't (I believe) create them dynamically. Accumulo
>> allows you
>> to create column families dynamically. But I wonder if this can come
>> at a
>> cost. Is there a benefit to using column families less frequently if
>> possible in Accumulo? Or is the cost of using column families more or
>> the same as using column qualifiers.
>> 3. I guess one way families might be different from qualifiers relates
>> HBase's recommendation to keep column family names short to avoid
>> storage waste. That should apply to Accumulo as well, right?
>> 4. In supporting dynamic column families, was there a design trade-off
>> respect to the original BigTable or current HBase design? What might
>> be a
>> benefit of doing it the other way?
> The main thing Accumulo had to do differently from BigTable to allow dynamic
> creation of column families was to create a default locality group. That's
> the locality group that stores column families that aren't specified for any
> other locality group. I recall Keith saying it was kind of a pain to
> implement, but I don't see any obvious negative tradeoffs of the design.
It seems like with Big Table you can drop a locality group and all of
the data related to the locality group goes away. Even if Big Table
does not support this, that would be true of the BigTable model.
With default locality groups, if you drop a locality group, then that
data will end up in the default locality group. This is not a
negative, but a difference.
One other point, Accumulo supports online changes to locality group
configuration. If you change the locality group config for a table,
then all files created after that point will use the new config. This
is easy to do because each RFile encapsulates a set of locality groups
as Billie mentioned. So the locality group config goes with the file,
nothing external is needed by Rfile to make smart decisions when