Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - interesting

Copy link to this message
Re: interesting
Eric Newton 2013-05-03, 23:20
ngram == row
year == column family
count == column qualifier (prepended with zeros for sort)
book count == value

I used ascii text for the counts, even.

I'm not sure if the google entries are sorted, so the sort would help

The RFile format does not repeat identical data from key to key, so in most
cases, the row is not repeated.  That gives gzip other things to work on.

I'll have to do more analysis to figure out why RFile did so well.  Perhaps
google used less aggressive settings for their compression.

I'm more interested in 2-grams to test our partial-row compression in 1.5.

On Fri, May 3, 2013 at 4:09 PM, Jared Winick <[EMAIL PROTECTED]> wrote:

> That is very interesting and sounds like a fun friday project! Could you
> please elaborate on how you mapped the original format of
> ngram TAB year TAB match_count TAB volume_count NEWLINE
> into Accumulo key/values? Could you briefly explain what feature in
> Accumulo is responsible for this improvement in storage efficiency. This
> could be a helpful illustration for users to know how key/value design can
> take advantage of these Accumulo features. Thanks a lot!
> Jared
> On Fri, May 3, 2013 at 1:24 PM, Eric Newton <[EMAIL PROTECTED]> wrote:
>> I think David Medinets suggested some publicly available data sources
>> that could be used to compare the storage requirements of different
>> key/value stores.
>> Today I tried it out.
>> I took the google 1-gram word lists and ingested them into accumulo.
>> http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
>> It took about 15 minutes to ingest on a 10 node cluster (4 drives each).
>> $ hadoop fs -du -s -h /data/googlebooks/ngrams/1-grams
>> running...
>> 5.2 G  /data/googlebooks/ngrams/1-grams
>> $ hadoop fs -du -s -h /accumulo/tables/4
>> running...
>> 4.1 G  /accumulo/tables/4
>> The storage format in accumulo is about 20% more efficient than gzip'd
>> csv files.
>> I'll post the 2-gram results sometime next month when its done
>> downloading. :-)
>> -Eric, which occurred 221K times in 34K books in 2008.