Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo, mail # user - sum of mutation.numBytes() significantly different from rfile size


+
Slater, David M. 2013-10-29, 21:50
+
Keith Turner 2013-10-29, 22:35
Copy link to this message
-
Re: sum of mutation.numBytes() significantly different from rfile size
Eric Newton 2013-10-30, 03:05
For comparison, I posted this some time ago:

http://tinyurl.com/k28bkbg

I was surprised that RFile was smaller than a gzip'd CSV file, too.

On Tue, Oct 29, 2013 at 6:35 PM, Keith Turner <[EMAIL PROTECTED]> wrote:
>
>
>
> On Tue, Oct 29, 2013 at 5:50 PM, Slater, David M. <[EMAIL PROTECTED]>
> wrote:
>>
>> Hello,
>>
>>
>>
>> I’m seeing about an order of magnitude difference between the number of
>> bytes returned by mutation.numBytes() and the size of the rfiles on disk
>> (Accumulo 1.4.2). Note that all of my mutations are new entries, and there
>> are no combiners running.
>>
>>
>>
>> While I understand that there is some compression on the rfile, I would be
>> really surprised if it was 10:1.
>>
>>
>>
>> My entries are composed of a row ID (most of which is equivalent to the
>> previous row ID), an empty column family, a nonempty column qualifier (which
>> likely shares a lot with the previous qualifier), and an empty value. An
>> example of the rowID and column qualifier might be:
>
>
> In 1.4 if a field (row, col fam, etc) in key is the same as the previous,
> then its not written again.  So if the row is the same in 10 consecutive
> keys, its only written once.   Maybe this explains the difference. Scan the
> table to make sure all of the data you expect to be there is there.
>
>>
>>
>>
>> (forward table)
>>
>> 0000000000000|9|fa19                 IP|127.000.000.001
>>
>> 0000000000000|9|fa19                  PORT|00080
>>
>> …
>>
>> 0000000000000|9|fa22                  IP|128.032.144.139
>>
>> …
>>
>> <timeblock>|<hash>|<uid>       <index>|<textual value>
>>
>>
>>
>> OR
>>
>> (reverse table)
>>
>> 0000000000000|IP|127.000.000.001         fa19
>>
>> 0000000000000|IP|127.000.000.001         fd02
>>
>> 0000000000000|IP|127.000.000.002         123
>>
>> …
>>
>> 0000000000000|PORT|00080                      fa19
>>
>>
>>
>> The numBytes() method appears to return a number of bytes equal to the
>> string length of the row ID and column qualifiers, plus 26 * # of column
>> qualifiers.
>>
>>
>>
>> Is there something else that I’m missing, or would this possibly compress
>> by that much?
>>
>>
>>
>> Thanks,
>>
>> David
>
>
+
Slater, David M. 2013-10-30, 15:47
+
Josh Elser 2013-10-29, 22:02