Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> interesting


I think David Medinets suggested some publicly available data sources that
could be used to compare the storage requirements of different key/value
stores.

Today I tried it out.

I took the google 1-gram word lists and ingested them into accumulo.

http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

It took about 15 minutes to ingest on a 10 node cluster (4 drives each).

$ hadoop fs -du -s -h /data/googlebooks/ngrams/1-grams
running...
5.2 G  /data/googlebooks/ngrams/1-grams

$ hadoop fs -du -s -h /accumulo/tables/4
running...
4.1 G  /accumulo/tables/4

The storage format in accumulo is about 20% more efficient than gzip'd csv
files.

I'll post the 2-gram results sometime next month when its done downloading.
:-)

-Eric, which occurred 221K times in 34K books in 2008.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB