It'd be nice to see some numbers, but I also think it's important to
account for use cases. Doing secondary indexing on records/files,
metadata extraction and document storage will increase the raw storage
required by some factor. Then, it's all compressed in various ways
(ie, at the RFile level, at the HDFS block level)!
Could we try to define some rudimentary structure that we'd put the
data in? Like just create a term index on it, since I know HBase and
Cassandra should be able to handle that.
On Thu, Jul 12, 2012 at 6:42 AM, David Medinets
<[EMAIL PROTECTED]> wrote:
> Are there any published numbers for the amount of disk space used by
> Accumulo versus other products? I'm thinking some dataset like dbpedia
> or something from http://books.google.com/ngrams/datasets. If there is
> not such a comparison, what comparisons would you like to see? What
> about WordNet stored in CSV, MySQL, Cassandra, HBase, and Accumulo?
> WordNet is just a large set of CSV files so it would be a good
> candidate for this concept, I think.