Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> interesting


+
Eric Newton 2013-05-03, 19:24
+
Jared Winick 2013-05-03, 20:09
+
Eric Newton 2013-05-03, 23:20
+
Eric Newton 2013-05-15, 14:58
Copy link to this message
-
Re: interesting
RFile... with gzip? Or did you use another compressor?

On 5/15/13 10:58 AM, Eric Newton wrote:
> I ingested the 2-gram data on a 10 node cluster.  It took just under 7
> hours.  For most of the job, accumulo ingested at about 200K k-v/server.
>
> $ hadoop fs -dus /accumulo/tables/2 /data/n-grams/2-grams
> /accumulo/tables/274632273653
> /data/n-grams/2-grams154271541304
>
> That's a very nice result.  RFile compressed the same data to half the
> gzip'd CSV format.
>
> There are 37,582,158,107 entries in the 2-gram set, which means that
> accumulo is using only 2 bytes for each entry.
>
> -Eric Newton, which appeared 62 times in 37 books in 2008.
>
>
> On Fri, May 3, 2013 at 7:20 PM, Eric Newton <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
>
>     ngram == row
>     year == column family
>     count == column qualifier (prepended with zeros for sort)
>     book count == value
>
>     I used ascii text for the counts, even.
>
>     I'm not sure if the google entries are sorted, so the sort would
>     help compression.
>
>     The RFile format does not repeat identical data from key to key, so
>     in most cases, the row is not repeated.  That gives gzip other
>     things to work on.
>
>     I'll have to do more analysis to figure out why RFile did so well.
>       Perhaps google used less aggressive settings for their compression.
>
>     I'm more interested in 2-grams to test our partial-row compression
>     in 1.5.
>
>     -Eric
>
>
>     On Fri, May 3, 2013 at 4:09 PM, Jared Winick <[EMAIL PROTECTED]
>     <mailto:[EMAIL PROTECTED]>> wrote:
>
>         That is very interesting and sounds like a fun friday project!
>         Could you please elaborate on how you mapped the original format of
>
>         ngram TAB year TAB match_count TAB volume_count NEWLINE
>
>         into Accumulo key/values? Could you briefly explain what feature
>         in Accumulo is responsible for this improvement in storage
>         efficiency. This could be a helpful illustration for users to
>         know how key/value design can take advantage of these Accumulo
>         features. Thanks a lot!
>
>         Jared
>
>
>         On Fri, May 3, 2013 at 1:24 PM, Eric Newton
>         <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
>
>             I think David Medinets suggested some publicly available
>             data sources that could be used to compare the storage
>             requirements of different key/value stores.
>
>             Today I tried it out.
>
>             I took the google 1-gram word lists and ingested them into
>             accumulo.
>
>             http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
>
>             It took about 15 minutes to ingest on a 10 node cluster (4
>             drives each).
>
>             $ hadoop fs -du -s -h /data/googlebooks/ngrams/1-grams
>             running...
>             5.2 G  /data/googlebooks/ngrams/1-grams
>
>             $ hadoop fs -du -s -h /accumulo/tables/4
>             running...
>             4.1 G  /accumulo/tables/4
>
>             The storage format in accumulo is about 20% more efficient
>             than gzip'd csv files.
>
>             I'll post the 2-gram results sometime next month when its
>             done downloading. :-)
>
>             -Eric, which occurred 221K times in 34K books in 2008.
>
>
>
>
+
Eric Newton 2013-05-15, 16:11
+
Josh Elser 2013-05-15, 16:11
+
Eric Newton 2013-05-15, 18:52
+
Christopher 2013-05-15, 20:27
+
Josh Elser 2013-05-15, 21:20
+
Jim Klucar 2013-05-20, 02:13
+
Eric Newton 2013-05-20, 14:46
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB