Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> interesting


I think David Medinets suggested some publicly available data sources that
could be used to compare the storage requirements of different key/value
stores.

Today I tried it out.

I took the google 1-gram word lists and ingested them into accumulo.

http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

It took about 15 minutes to ingest on a 10 node cluster (4 drives each).

$ hadoop fs -du -s -h /data/googlebooks/ngrams/1-grams
running...
5.2 G  /data/googlebooks/ngrams/1-grams

$ hadoop fs -du -s -h /accumulo/tables/4
running...
4.1 G  /accumulo/tables/4

The storage format in accumulo is about 20% more efficient than gzip'd csv
files.

I'll post the 2-gram results sometime next month when its done downloading.
:-)

-Eric, which occurred 221K times in 34K books in 2008.
+
Jared Winick 2013-05-03, 20:09
+
Eric Newton 2013-05-03, 23:20
+
Eric Newton 2013-05-15, 14:58
+
Josh Elser 2013-05-15, 16:00
+
Eric Newton 2013-05-15, 16:11
+
Josh Elser 2013-05-15, 16:11
+
Eric Newton 2013-05-15, 18:52
+
Christopher 2013-05-15, 20:27
+
Josh Elser 2013-05-15, 21:20
+
Jim Klucar 2013-05-20, 02:13
+
Eric Newton 2013-05-20, 14:46
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB