Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> sum of mutation.numBytes() significantly different from rfile size


Copy link to this message
-
RE: sum of mutation.numBytes() significantly different from rfile size
Comparing the rfiles with compressed CSV files, the results do make sense now.

Thanks,
David

-----Original Message-----
From: Eric Newton [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, October 29, 2013 11:05 PM
To: [EMAIL PROTECTED]
Subject: Re: sum of mutation.numBytes() significantly different from rfile size

For comparison, I posted this some time ago:

http://tinyurl.com/k28bkbg

I was surprised that RFile was smaller than a gzip'd CSV file, too.

On Tue, Oct 29, 2013 at 6:35 PM, Keith Turner <[EMAIL PROTECTED]> wrote:
>
>
>
> On Tue, Oct 29, 2013 at 5:50 PM, Slater, David M.
> <[EMAIL PROTECTED]>
> wrote:
>>
>> Hello,
>>
>>
>>
>> I'm seeing about an order of magnitude difference between the number
>> of bytes returned by mutation.numBytes() and the size of the rfiles
>> on disk (Accumulo 1.4.2). Note that all of my mutations are new
>> entries, and there are no combiners running.
>>
>>
>>
>> While I understand that there is some compression on the rfile, I
>> would be really surprised if it was 10:1.
>>
>>
>>
>> My entries are composed of a row ID (most of which is equivalent to
>> the previous row ID), an empty column family, a nonempty column
>> qualifier (which likely shares a lot with the previous qualifier),
>> and an empty value. An example of the rowID and column qualifier might be:
>
>
> In 1.4 if a field (row, col fam, etc) in key is the same as the
> previous, then its not written again.  So if the row is the same in 10 consecutive
> keys, its only written once.   Maybe this explains the difference. Scan the
> table to make sure all of the data you expect to be there is there.
>
>>
>>
>>
>> (forward table)
>>
>> 0000000000000|9|fa19                 IP|127.000.000.001
>>
>> 0000000000000|9|fa19                  PORT|00080
>>
>> ...
>>
>> 0000000000000|9|fa22                  IP|128.032.144.139
>>
>> ...
>>
>> <timeblock>|<hash>|<uid>       <index>|<textual value>
>>
>>
>>
>> OR
>>
>> (reverse table)
>>
>> 0000000000000|IP|127.000.000.001         fa19
>>
>> 0000000000000|IP|127.000.000.001         fd02
>>
>> 0000000000000|IP|127.000.000.002         123
>>
>> ...
>>
>> 0000000000000|PORT|00080                      fa19
>>
>>
>>
>> The numBytes() method appears to return a number of bytes equal to
>> the string length of the row ID and column qualifiers, plus 26 * # of
>> column qualifiers.
>>
>>
>>
>> Is there something else that I'm missing, or would this possibly
>> compress by that much?
>>
>>
>>
>> Thanks,
>>
>> David
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB