1. I'm not too familir with the differenet HFile versions, I really can't
draw comparisons. But RFiles were developed independently, but I think at a
high level they share the same idea.
2. Column families do not have the performance penalty they have in HBase
because we do not have a 1:1 ratio of column families to locality groups.
The only real benefit I forsee from minimizing column families is a bit
more compaction due to the relative encoding in RFiles. But for all intents
and purposes, you can think of them as a similar performance metric to that
of column qualifiers, until you start creating locality groups from them.
3. Disk is cheap and the second you start trying to shave a byte here and a
few bits there you're going against the natural design of the architecture.
You want the data to be first usable, which is something you can lose once
you start having ellaborate compaction schemes. Again, the relative
encoding + standard compression done in the RFiles should be more than
enough for making your data tiny enough. Also, I would not be surprised if
the logic behind long column families in HBase is due to the locality group
issue, which as mentioned above, is not an issue for Accumulo.
4. I believe the BigTable paper specified 'a mapping from column families
to a locality group', which is more in line with our configuration.
However, BigTable also specified that all column families are defined in
advance, which is more in tune with Hbase. We feel the dynamic nature of
our system provides enough flexibility to be convenient to use while still
providing mechanisms to harness the power of locality groups. In standard
Accumulo use cases, switching to the other way would probably be a
hindrance because we don't try to minimize column families, which means
more blocks needing to be merged together at scan time, creating a
significant performance hit.
Hope this answers some questions
On Tue, Nov 6, 2012 at 2:01 PM, Sukant Hajra <[EMAIL PROTECTED]> wrote:
> I've been trying to understand Accumulo more deeply as we use it more. To
> supplement the on-line documentation and source, I've been referencing some
> blog articles on HBase (Lars George has some ones), HBase docs, and the
> BigTable paper.
> But I'm curious about some of the deviations of Accumulo from BigTable and
> The questions I have right now are:
> 1. Is the format of an RFile close to HFile version 1, HFile version
> 2, or
> at this point is the format really it's own thing? I found good
> documentation on the HFile, but I haven't yet found similar
> on RFiles. There's the source code, but I haven't dug into that yet.
> 2. I understand that HBase doesn't do well with too many column
> However, creating too many column families in HBase isn't likely anyway
> because you can't (I believe) create them dynamically. Accumulo
> allows you
> to create column families dynamically. But I wonder if this can come
> at a
> cost. Is there a benefit to using column families less frequently if
> possible in Accumulo? Or is the cost of using column families more or
> the same as using column qualifiers.
> 3. I guess one way families might be different from qualifiers relates
> HBase's recommendation to keep column family names short to avoid
> storage waste. That should apply to Accumulo as well, right?
> 4. In supporting dynamic column families, was there a design trade-off
> respect to the original BigTable or current HBase design? What might
> be a
> benefit of doing it the other way?