Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> interesting


Copy link to this message
-
Re: interesting
Eric, what version of Accumulo did you use? I'm assuming 1.5.0
On Wed, May 15, 2013 at 5:20 PM, Josh Elser <[EMAIL PROTECTED]> wrote:

> Definitely, with a note on the ingest job duration, too.
>
>
> On 05/15/2013 04:27 PM, Christopher wrote:
>
>> I'd be very curious how something faster, like Snappy, compared.
>>
>> --
>> Christopher L Tubbs II
>> http://gravatar.com/ctubbsii
>>
>>
>> On Wed, May 15, 2013 at 2:52 PM, Eric Newton <[EMAIL PROTECTED]>
>> wrote:
>>
>>> I don't intend to do that.
>>>
>>>
>>> On Wed, May 15, 2013 at 12:11 PM, Josh Elser <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>> Just kidding, re-read the rest of this. Let me try again:
>>>>
>>>> Any intents to retry this with different compression codecs?
>>>>
>>>>
>>>> On 5/15/13 12:00 PM, Josh Elser wrote:
>>>>
>>>>> RFile... with gzip? Or did you use another compressor?
>>>>>
>>>>> On 5/15/13 10:58 AM, Eric Newton wrote:
>>>>>
>>>>>> I ingested the 2-gram data on a 10 node cluster.  It took just under 7
>>>>>> hours.  For most of the job, accumulo ingested at about 200K
>>>>>> k-v/server.
>>>>>>
>>>>>> $ hadoop fs -dus /accumulo/tables/2 /data/n-grams/2-grams
>>>>>> /accumulo/tables/274632273653
>>>>>> /data/n-grams/2-**grams154271541304
>>>>>>
>>>>>> That's a very nice result.  RFile compressed the same data to half the
>>>>>> gzip'd CSV format.
>>>>>>
>>>>>> There are 37,582,158,107 entries in the 2-gram set, which means that
>>>>>> accumulo is using only 2 bytes for each entry.
>>>>>>
>>>>>> -Eric Newton, which appeared 62 times in 37 books in 2008.
>>>>>>
>>>>>>
>>>>>> On Fri, May 3, 2013 at 7:20 PM, Eric Newton <[EMAIL PROTECTED]
>>>>>> <mailto:[EMAIL PROTECTED]>**> wrote:
>>>>>>
>>>>>>      ngram == row
>>>>>>      year == column family
>>>>>>      count == column qualifier (prepended with zeros for sort)
>>>>>>      book count == value
>>>>>>
>>>>>>      I used ascii text for the counts, even.
>>>>>>
>>>>>>      I'm not sure if the google entries are sorted, so the sort would
>>>>>>      help compression.
>>>>>>
>>>>>>      The RFile format does not repeat identical data from key to key,
>>>>>> so
>>>>>>      in most cases, the row is not repeated.  That gives gzip other
>>>>>>      things to work on.
>>>>>>
>>>>>>      I'll have to do more analysis to figure out why RFile did so
>>>>>> well.
>>>>>>        Perhaps google used less aggressive settings for their
>>>>>> compression.
>>>>>>
>>>>>>      I'm more interested in 2-grams to test our partial-row
>>>>>> compression
>>>>>>      in 1.5.
>>>>>>
>>>>>>      -Eric
>>>>>>
>>>>>>
>>>>>>      On Fri, May 3, 2013 at 4:09 PM, Jared Winick <
>>>>>> [EMAIL PROTECTED]
>>>>>>      <mailto:[EMAIL PROTECTED]>**> wrote:
>>>>>>
>>>>>>          That is very interesting and sounds like a fun friday
>>>>>> project!
>>>>>>          Could you please elaborate on how you mapped the original
>>>>>> format of
>>>>>>
>>>>>>          ngram TAB year TAB match_count TAB volume_count NEWLINE
>>>>>>
>>>>>>          into Accumulo key/values? Could you briefly explain what
>>>>>> feature
>>>>>>          in Accumulo is responsible for this improvement in storage
>>>>>>          efficiency. This could be a helpful illustration for users to
>>>>>>          know how key/value design can take advantage of these
>>>>>> Accumulo
>>>>>>          features. Thanks a lot!
>>>>>>
>>>>>>          Jared
>>>>>>
>>>>>>
>>>>>>          On Fri, May 3, 2013 at 1:24 PM, Eric Newton
>>>>>>          <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>**>
>>>>>> wrote:
>>>>>>
>>>>>>              I think David Medinets suggested some publicly available
>>>>>>              data sources that could be used to compare the storage
>>>>>>              requirements of different key/value stores.
>>>>>>
>>>>>>              Today I tried it out.
>>>>>>
>>>>>>              I took the google 1-gram word lists and ingested them
>>>>>> into
>>>>>>              accumulo.
>>>>>>
>>>>>>
>>>>>> http://storage.googleapis.com/**books/ngrams/books/datasetsv2.**html<http://storage.googleapis.com/books/ngrams/books/datasetsv2.html>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB