
+
Slater, David M. 20131029, 21:50
+
Keith Turner 20131029, 22:35

Re: sum of mutation.numBytes() significantly different from rfile size
For comparison, I posted this some time ago:
http://tinyurl.com/k28bkbg I was surprised that RFile was smaller than a gzip'd CSV file, too. On Tue, Oct 29, 2013 at 6:35 PM, Keith Turner <[EMAIL PROTECTED]> wrote: > > > > On Tue, Oct 29, 2013 at 5:50 PM, Slater, David M. <[EMAIL PROTECTED]> > wrote: >> >> Hello, >> >> >> >> I’m seeing about an order of magnitude difference between the number of >> bytes returned by mutation.numBytes() and the size of the rfiles on disk >> (Accumulo 1.4.2). Note that all of my mutations are new entries, and there >> are no combiners running. >> >> >> >> While I understand that there is some compression on the rfile, I would be >> really surprised if it was 10:1. >> >> >> >> My entries are composed of a row ID (most of which is equivalent to the >> previous row ID), an empty column family, a nonempty column qualifier (which >> likely shares a lot with the previous qualifier), and an empty value. An >> example of the rowID and column qualifier might be: > > > In 1.4 if a field (row, col fam, etc) in key is the same as the previous, > then its not written again. So if the row is the same in 10 consecutive > keys, its only written once. Maybe this explains the difference. Scan the > table to make sure all of the data you expect to be there is there. > >> >> >> >> (forward table) >> >> 00000000000009fa19 IP127.000.000.001 >> >> 00000000000009fa19 PORT00080 >> >> … >> >> 00000000000009fa22 IP128.032.144.139 >> >> … >> >> <timeblock><hash><uid> <index><textual value> >> >> >> >> OR >> >> (reverse table) >> >> 0000000000000IP127.000.000.001 fa19 >> >> 0000000000000IP127.000.000.001 fd02 >> >> 0000000000000IP127.000.000.002 123 >> >> … >> >> 0000000000000PORT00080 fa19 >> >> >> >> The numBytes() method appears to return a number of bytes equal to the >> string length of the row ID and column qualifiers, plus 26 * # of column >> qualifiers. >> >> >> >> Is there something else that I’m missing, or would this possibly compress >> by that much? >> >> >> >> Thanks, >> >> David > > +
Slater, David M. 20131030, 15:47
+
Josh Elser 20131029, 22:02


