Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo, mail # user - interesting


Copy link to this message
-
interesting
Eric Newton 2013-05-03, 19:24
I think David Medinets suggested some publicly available data sources that
could be used to compare the storage requirements of different key/value
stores.

Today I tried it out.

I took the google 1-gram word lists and ingested them into accumulo.

http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

It took about 15 minutes to ingest on a 10 node cluster (4 drives each).

$ hadoop fs -du -s -h /data/googlebooks/ngrams/1-grams
running...
5.2 G  /data/googlebooks/ngrams/1-grams

$ hadoop fs -du -s -h /accumulo/tables/4
running...
4.1 G  /accumulo/tables/4

The storage format in accumulo is about 20% more efficient than gzip'd csv
files.

I'll post the 2-gram results sometime next month when its done downloading.
:-)

-Eric, which occurred 221K times in 34K books in 2008.
+
Jared Winick 2013-05-03, 20:09
+
Eric Newton 2013-05-03, 23:20
+
Eric Newton 2013-05-15, 14:58
+
Josh Elser 2013-05-15, 16:00
+
Eric Newton 2013-05-15, 16:11
+
Josh Elser 2013-05-15, 16:11
+
Eric Newton 2013-05-15, 18:52
+
Christopher 2013-05-15, 20:27
+
Josh Elser 2013-05-15, 21:20
+
Jim Klucar 2013-05-20, 02:13
+
Eric Newton 2013-05-20, 14:46