Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> sum of mutation.numBytes() significantly different from rfile size


Copy link to this message
-
RE: sum of mutation.numBytes() significantly different from rfile size
Comparing the rfiles with compressed CSV files, the results do make sense now.

Thanks,
David

-----Original Message-----
From: Eric Newton [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, October 29, 2013 11:05 PM
To: [EMAIL PROTECTED]
Subject: Re: sum of mutation.numBytes() significantly different from rfile size

For comparison, I posted this some time ago:

http://tinyurl.com/k28bkbg

I was surprised that RFile was smaller than a gzip'd CSV file, too.

On Tue, Oct 29, 2013 at 6:35 PM, Keith Turner <[EMAIL PROTECTED]> wrote:
>
>
>
> On Tue, Oct 29, 2013 at 5:50 PM, Slater, David M.
> <[EMAIL PROTECTED]>
> wrote:
>>
>> Hello,
>>
>>
>>
>> I'm seeing about an order of magnitude difference between the number
>> of bytes returned by mutation.numBytes() and the size of the rfiles
>> on disk (Accumulo 1.4.2). Note that all of my mutations are new
>> entries, and there are no combiners running.
>>
>>
>>
>> While I understand that there is some compression on the rfile, I
>> would be really surprised if it was 10:1.
>>
>>
>>
>> My entries are composed of a row ID (most of which is equivalent to
>> the previous row ID), an empty column family, a nonempty column
>> qualifier (which likely shares a lot with the previous qualifier),
>> and an empty value. An example of the rowID and column qualifier might be:
>
>
> In 1.4 if a field (row, col fam, etc) in key is the same as the
> previous, then its not written again.  So if the row is the same in 10 consecutive
> keys, its only written once.   Maybe this explains the difference. Scan the
> table to make sure all of the data you expect to be there is there.
>
>>
>>
>>
>> (forward table)
>>
>> 0000000000000|9|fa19                 IP|127.000.000.001
>>
>> 0000000000000|9|fa19                  PORT|00080
>>
>> ...
>>
>> 0000000000000|9|fa22                  IP|128.032.144.139
>>
>> ...
>>
>> <timeblock>|<hash>|<uid>       <index>|<textual value>
>>
>>
>>
>> OR
>>
>> (reverse table)
>>
>> 0000000000000|IP|127.000.000.001         fa19
>>
>> 0000000000000|IP|127.000.000.001         fd02
>>
>> 0000000000000|IP|127.000.000.002         123
>>
>> ...
>>
>> 0000000000000|PORT|00080                      fa19
>>
>>
>>
>> The numBytes() method appears to return a number of bytes equal to
>> the string length of the row ID and column qualifiers, plus 26 * # of
>> column qualifiers.
>>
>>
>>
>> Is there something else that I'm missing, or would this possibly
>> compress by that much?
>>
>>
>>
>> Thanks,
>>
>> David
>
>