Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo, mail # user - interesting


+
Eric Newton 2013-05-03, 19:24
+
Jared Winick 2013-05-03, 20:09
+
Eric Newton 2013-05-03, 23:20
+
Eric Newton 2013-05-15, 14:58
+
Josh Elser 2013-05-15, 16:00
+
Eric Newton 2013-05-15, 16:11
+
Josh Elser 2013-05-15, 16:11
+
Eric Newton 2013-05-15, 18:52
+
Christopher 2013-05-15, 20:27
+
Josh Elser 2013-05-15, 21:20
Copy link to this message
-
Re: interesting
Jim Klucar 2013-05-20, 02:13
Eric, what version of Accumulo did you use? I'm assuming 1.5.0
On Wed, May 15, 2013 at 5:20 PM, Josh Elser <[EMAIL PROTECTED]> wrote:

> Definitely, with a note on the ingest job duration, too.
>
>
> On 05/15/2013 04:27 PM, Christopher wrote:
>
>> I'd be very curious how something faster, like Snappy, compared.
>>
>> --
>> Christopher L Tubbs II
>> http://gravatar.com/ctubbsii
>>
>>
>> On Wed, May 15, 2013 at 2:52 PM, Eric Newton <[EMAIL PROTECTED]>
>> wrote:
>>
>>> I don't intend to do that.
>>>
>>>
>>> On Wed, May 15, 2013 at 12:11 PM, Josh Elser <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>> Just kidding, re-read the rest of this. Let me try again:
>>>>
>>>> Any intents to retry this with different compression codecs?
>>>>
>>>>
>>>> On 5/15/13 12:00 PM, Josh Elser wrote:
>>>>
>>>>> RFile... with gzip? Or did you use another compressor?
>>>>>
>>>>> On 5/15/13 10:58 AM, Eric Newton wrote:
>>>>>
>>>>>> I ingested the 2-gram data on a 10 node cluster.  It took just under 7
>>>>>> hours.  For most of the job, accumulo ingested at about 200K
>>>>>> k-v/server.
>>>>>>
>>>>>> $ hadoop fs -dus /accumulo/tables/2 /data/n-grams/2-grams
>>>>>> /accumulo/tables/274632273653
>>>>>> /data/n-grams/2-**grams154271541304
>>>>>>
>>>>>> That's a very nice result.  RFile compressed the same data to half the
>>>>>> gzip'd CSV format.
>>>>>>
>>>>>> There are 37,582,158,107 entries in the 2-gram set, which means that
>>>>>> accumulo is using only 2 bytes for each entry.
>>>>>>
>>>>>> -Eric Newton, which appeared 62 times in 37 books in 2008.
>>>>>>
>>>>>>
>>>>>> On Fri, May 3, 2013 at 7:20 PM, Eric Newton <[EMAIL PROTECTED]
>>>>>> <mailto:[EMAIL PROTECTED]>**> wrote:
>>>>>>
>>>>>>      ngram == row
>>>>>>      year == column family
>>>>>>      count == column qualifier (prepended with zeros for sort)
>>>>>>      book count == value
>>>>>>
>>>>>>      I used ascii text for the counts, even.
>>>>>>
>>>>>>      I'm not sure if the google entries are sorted, so the sort would
>>>>>>      help compression.
>>>>>>
>>>>>>      The RFile format does not repeat identical data from key to key,
>>>>>> so
>>>>>>      in most cases, the row is not repeated.  That gives gzip other
>>>>>>      things to work on.
>>>>>>
>>>>>>      I'll have to do more analysis to figure out why RFile did so
>>>>>> well.
>>>>>>        Perhaps google used less aggressive settings for their
>>>>>> compression.
>>>>>>
>>>>>>      I'm more interested in 2-grams to test our partial-row
>>>>>> compression
>>>>>>      in 1.5.
>>>>>>
>>>>>>      -Eric
>>>>>>
>>>>>>
>>>>>>      On Fri, May 3, 2013 at 4:09 PM, Jared Winick <
>>>>>> [EMAIL PROTECTED]
>>>>>>      <mailto:[EMAIL PROTECTED]>**> wrote:
>>>>>>
>>>>>>          That is very interesting and sounds like a fun friday
>>>>>> project!
>>>>>>          Could you please elaborate on how you mapped the original
>>>>>> format of
>>>>>>
>>>>>>          ngram TAB year TAB match_count TAB volume_count NEWLINE
>>>>>>
>>>>>>          into Accumulo key/values? Could you briefly explain what
>>>>>> feature
>>>>>>          in Accumulo is responsible for this improvement in storage
>>>>>>          efficiency. This could be a helpful illustration for users to
>>>>>>          know how key/value design can take advantage of these
>>>>>> Accumulo
>>>>>>          features. Thanks a lot!
>>>>>>
>>>>>>          Jared
>>>>>>
>>>>>>
>>>>>>          On Fri, May 3, 2013 at 1:24 PM, Eric Newton
>>>>>>          <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>**>
>>>>>> wrote:
>>>>>>
>>>>>>              I think David Medinets suggested some publicly available
>>>>>>              data sources that could be used to compare the storage
>>>>>>              requirements of different key/value stores.
>>>>>>
>>>>>>              Today I tried it out.
>>>>>>
>>>>>>              I took the google 1-gram word lists and ingested them
>>>>>> into
>>>>>>              accumulo.
>>>>>>
>>>>>>
>>>>>> http://storage.googleapis.com/**books/ngrams/books/datasetsv2.**html<http://storage.googleapis.com/books/ngrams/books/datasetsv2.html>
+
Eric Newton 2013-05-20, 14:46