Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> interesting


Copy link to this message
-
Re: interesting
RFile... with gzip? Or did you use another compressor?

On 5/15/13 10:58 AM, Eric Newton wrote:
> I ingested the 2-gram data on a 10 node cluster.  It took just under 7
> hours.  For most of the job, accumulo ingested at about 200K k-v/server.
>
> $ hadoop fs -dus /accumulo/tables/2 /data/n-grams/2-grams
> /accumulo/tables/274632273653
> /data/n-grams/2-grams154271541304
>
> That's a very nice result.  RFile compressed the same data to half the
> gzip'd CSV format.
>
> There are 37,582,158,107 entries in the 2-gram set, which means that
> accumulo is using only 2 bytes for each entry.
>
> -Eric Newton, which appeared 62 times in 37 books in 2008.
>
>
> On Fri, May 3, 2013 at 7:20 PM, Eric Newton <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
>
>     ngram == row
>     year == column family
>     count == column qualifier (prepended with zeros for sort)
>     book count == value
>
>     I used ascii text for the counts, even.
>
>     I'm not sure if the google entries are sorted, so the sort would
>     help compression.
>
>     The RFile format does not repeat identical data from key to key, so
>     in most cases, the row is not repeated.  That gives gzip other
>     things to work on.
>
>     I'll have to do more analysis to figure out why RFile did so well.
>       Perhaps google used less aggressive settings for their compression.
>
>     I'm more interested in 2-grams to test our partial-row compression
>     in 1.5.
>
>     -Eric
>
>
>     On Fri, May 3, 2013 at 4:09 PM, Jared Winick <[EMAIL PROTECTED]
>     <mailto:[EMAIL PROTECTED]>> wrote:
>
>         That is very interesting and sounds like a fun friday project!
>         Could you please elaborate on how you mapped the original format of
>
>         ngram TAB year TAB match_count TAB volume_count NEWLINE
>
>         into Accumulo key/values? Could you briefly explain what feature
>         in Accumulo is responsible for this improvement in storage
>         efficiency. This could be a helpful illustration for users to
>         know how key/value design can take advantage of these Accumulo
>         features. Thanks a lot!
>
>         Jared
>
>
>         On Fri, May 3, 2013 at 1:24 PM, Eric Newton
>         <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
>
>             I think David Medinets suggested some publicly available
>             data sources that could be used to compare the storage
>             requirements of different key/value stores.
>
>             Today I tried it out.
>
>             I took the google 1-gram word lists and ingested them into
>             accumulo.
>
>             http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
>
>             It took about 15 minutes to ingest on a 10 node cluster (4
>             drives each).
>
>             $ hadoop fs -du -s -h /data/googlebooks/ngrams/1-grams
>             running...
>             5.2 G  /data/googlebooks/ngrams/1-grams
>
>             $ hadoop fs -du -s -h /accumulo/tables/4
>             running...
>             4.1 G  /accumulo/tables/4
>
>             The storage format in accumulo is about 20% more efficient
>             than gzip'd csv files.
>
>             I'll post the 2-gram results sometime next month when its
>             done downloading. :-)
>
>             -Eric, which occurred 221K times in 34K books in 2008.
>
>
>
>