Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> interesting


Copy link to this message
-
Re: interesting
Definitely, with a note on the ingest job duration, too.

On 05/15/2013 04:27 PM, Christopher wrote:
> I'd be very curious how something faster, like Snappy, compared.
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Wed, May 15, 2013 at 2:52 PM, Eric Newton <[EMAIL PROTECTED]> wrote:
>> I don't intend to do that.
>>
>>
>> On Wed, May 15, 2013 at 12:11 PM, Josh Elser <[EMAIL PROTECTED]> wrote:
>>> Just kidding, re-read the rest of this. Let me try again:
>>>
>>> Any intents to retry this with different compression codecs?
>>>
>>>
>>> On 5/15/13 12:00 PM, Josh Elser wrote:
>>>> RFile... with gzip? Or did you use another compressor?
>>>>
>>>> On 5/15/13 10:58 AM, Eric Newton wrote:
>>>>> I ingested the 2-gram data on a 10 node cluster.  It took just under 7
>>>>> hours.  For most of the job, accumulo ingested at about 200K k-v/server.
>>>>>
>>>>> $ hadoop fs -dus /accumulo/tables/2 /data/n-grams/2-grams
>>>>> /accumulo/tables/274632273653
>>>>> /data/n-grams/2-grams154271541304
>>>>>
>>>>> That's a very nice result.  RFile compressed the same data to half the
>>>>> gzip'd CSV format.
>>>>>
>>>>> There are 37,582,158,107 entries in the 2-gram set, which means that
>>>>> accumulo is using only 2 bytes for each entry.
>>>>>
>>>>> -Eric Newton, which appeared 62 times in 37 books in 2008.
>>>>>
>>>>>
>>>>> On Fri, May 3, 2013 at 7:20 PM, Eric Newton <[EMAIL PROTECTED]
>>>>> <mailto:[EMAIL PROTECTED]>> wrote:
>>>>>
>>>>>      ngram == row
>>>>>      year == column family
>>>>>      count == column qualifier (prepended with zeros for sort)
>>>>>      book count == value
>>>>>
>>>>>      I used ascii text for the counts, even.
>>>>>
>>>>>      I'm not sure if the google entries are sorted, so the sort would
>>>>>      help compression.
>>>>>
>>>>>      The RFile format does not repeat identical data from key to key, so
>>>>>      in most cases, the row is not repeated.  That gives gzip other
>>>>>      things to work on.
>>>>>
>>>>>      I'll have to do more analysis to figure out why RFile did so well.
>>>>>        Perhaps google used less aggressive settings for their
>>>>> compression.
>>>>>
>>>>>      I'm more interested in 2-grams to test our partial-row compression
>>>>>      in 1.5.
>>>>>
>>>>>      -Eric
>>>>>
>>>>>
>>>>>      On Fri, May 3, 2013 at 4:09 PM, Jared Winick <[EMAIL PROTECTED]
>>>>>      <mailto:[EMAIL PROTECTED]>> wrote:
>>>>>
>>>>>          That is very interesting and sounds like a fun friday project!
>>>>>          Could you please elaborate on how you mapped the original
>>>>> format of
>>>>>
>>>>>          ngram TAB year TAB match_count TAB volume_count NEWLINE
>>>>>
>>>>>          into Accumulo key/values? Could you briefly explain what feature
>>>>>          in Accumulo is responsible for this improvement in storage
>>>>>          efficiency. This could be a helpful illustration for users to
>>>>>          know how key/value design can take advantage of these Accumulo
>>>>>          features. Thanks a lot!
>>>>>
>>>>>          Jared
>>>>>
>>>>>
>>>>>          On Fri, May 3, 2013 at 1:24 PM, Eric Newton
>>>>>          <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
>>>>>
>>>>>              I think David Medinets suggested some publicly available
>>>>>              data sources that could be used to compare the storage
>>>>>              requirements of different key/value stores.
>>>>>
>>>>>              Today I tried it out.
>>>>>
>>>>>              I took the google 1-gram word lists and ingested them into
>>>>>              accumulo.
>>>>>
>>>>>
>>>>> http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
>>>>>
>>>>>              It took about 15 minutes to ingest on a 10 node cluster (4
>>>>>              drives each).
>>>>>
>>>>>              $ hadoop fs -du -s -h /data/googlebooks/ngrams/1-grams
>>>>>              running...
>>>>>              5.2 G  /data/googlebooks/ngrams/1-grams
>>>>>
>>>>>              $ hadoop fs -du -s -h /accumulo/tables/4