|
|
-
compressing values returned to scanner
ameet kini 2012-10-01, 19:03
My understanding of compression in Accumulo 1.4.1 is that it is on by default and that data is decompressed by the tablet server, so data on the wire between server/client is decompressed. Is there a way to shift the decompression from happening on the server to the client? I have a use case where each Value in my table is relatively large (~ 8MB) and I can benefit from compression over the wire. I don't have any server side iterators, so the values don't need to be decompressed by the tablet server. Also, each scan returns a few rows, so client-side decompression can be fast.
The only way I can think of now is to disable compression on that table, and handle compression/decompression in the application. But if there is a way to do this in Accumulo, I'd prefer that.
Thanks, Ameet
-
Re: compressing values returned to scanner
Marc Parisi 2012-10-01, 19:19
You could compress the data in the value, and decompress the data upon receipt by the scanner.
On Mon, Oct 1, 2012 at 3:03 PM, ameet kini <[EMAIL PROTECTED]> wrote:
> > My understanding of compression in Accumulo 1.4.1 is that it is on by > default and that data is decompressed by the tablet server, so data on the > wire between server/client is decompressed. Is there a way to shift the > decompression from happening on the server to the client? I have a use case > where each Value in my table is relatively large (~ 8MB) and I can benefit > from compression over the wire. I don't have any server side iterators, so > the values don't need to be decompressed by the tablet server. Also, each > scan returns a few rows, so client-side decompression can be fast. > > The only way I can think of now is to disable compression on that table, > and handle compression/decompression in the application. But if there is a > way to do this in Accumulo, I'd prefer that. > > Thanks, > Ameet >
-
Re: compressing values returned to scanner
ameet kini 2012-10-01, 19:27
In other words, "handle compression/decompression in the application" :)
I'm looking to see if there's a way to do this in Accumulo. Maybe a table level config parameter. There's already the "table.file.compress.type", which when set to NONE disables compression. Instead, I would like to keep compression on, and defer the decompression to the client.
Ameet On Mon, Oct 1, 2012 at 3:19 PM, Marc Parisi <[EMAIL PROTECTED]> wrote:
> You could compress the data in the value, and decompress the data upon > receipt by the scanner. > > > On Mon, Oct 1, 2012 at 3:03 PM, ameet kini <[EMAIL PROTECTED]> wrote: > >> >> My understanding of compression in Accumulo 1.4.1 is that it is on by >> default and that data is decompressed by the tablet server, so data on the >> wire between server/client is decompressed. Is there a way to shift the >> decompression from happening on the server to the client? I have a use case >> where each Value in my table is relatively large (~ 8MB) and I can benefit >> from compression over the wire. I don't have any server side iterators, so >> the values don't need to be decompressed by the tablet server. Also, each >> scan returns a few rows, so client-side decompression can be fast. >> >> The only way I can think of now is to disable compression on that table, >> and handle compression/decompression in the application. But if there is a >> way to do this in Accumulo, I'd prefer that. >> >> Thanks, >> Ameet >> > >
-
Re: compressing values returned to scanner
William Slacum 2012-10-01, 19:32
If you aren't often looking at the data in the value on the tablet server (like in an iterator), you can also pre-compress your values on ingest.
On Mon, Oct 1, 2012 at 12:19 PM, Marc Parisi <[EMAIL PROTECTED]> wrote:
> You could compress the data in the value, and decompress the data upon > receipt by the scanner. > > > On Mon, Oct 1, 2012 at 3:03 PM, ameet kini <[EMAIL PROTECTED]> wrote: > >> >> My understanding of compression in Accumulo 1.4.1 is that it is on by >> default and that data is decompressed by the tablet server, so data on the >> wire between server/client is decompressed. Is there a way to shift the >> decompression from happening on the server to the client? I have a use case >> where each Value in my table is relatively large (~ 8MB) and I can benefit >> from compression over the wire. I don't have any server side iterators, so >> the values don't need to be decompressed by the tablet server. Also, each >> scan returns a few rows, so client-side decompression can be fast. >> >> The only way I can think of now is to disable compression on that table, >> and handle compression/decompression in the application. But if there is a >> way to do this in Accumulo, I'd prefer that. >> >> Thanks, >> Ameet >> > >
-
Re: compressing values returned to scanner
ameet kini 2012-10-01, 19:40
That is exactly my use case (ingest once, serve often, no server-side iterators).
And I'm doing pre-compression on ingest. I was just looking to do away with app-level compression code. Not a biggie.
Ameet On Mon, Oct 1, 2012 at 3:32 PM, William Slacum < [EMAIL PROTECTED]> wrote:
> If you aren't often looking at the data in the value on the tablet server > (like in an iterator), you can also pre-compress your values on ingest. > > > On Mon, Oct 1, 2012 at 12:19 PM, Marc Parisi <[EMAIL PROTECTED]> wrote: > >> You could compress the data in the value, and decompress the data upon >> receipt by the scanner. >> >> >> On Mon, Oct 1, 2012 at 3:03 PM, ameet kini <[EMAIL PROTECTED]> wrote: >> >>> >>> My understanding of compression in Accumulo 1.4.1 is that it is on by >>> default and that data is decompressed by the tablet server, so data on the >>> wire between server/client is decompressed. Is there a way to shift the >>> decompression from happening on the server to the client? I have a use case >>> where each Value in my table is relatively large (~ 8MB) and I can benefit >>> from compression over the wire. I don't have any server side iterators, so >>> the values don't need to be decompressed by the tablet server. Also, each >>> scan returns a few rows, so client-side decompression can be fast. >>> >>> The only way I can think of now is to disable compression on that table, >>> and handle compression/decompression in the application. But if there is a >>> way to do this in Accumulo, I'd prefer that. >>> >>> Thanks, >>> Ameet >>> >> >> >
-
Re: compressing values returned to scanner
William Slacum 2012-10-01, 20:00
Someone can correct me if I'm wrong, but I believe the file compression option you quoted is for the RFiles in HDFS. You can enable compression there and will still see some benefit even if you compress the values on ingest.
On Mon, Oct 1, 2012 at 12:40 PM, ameet kini <[EMAIL PROTECTED]> wrote:
> That is exactly my use case (ingest once, serve often, no server-side > iterators). > > And I'm doing pre-compression on ingest. I was just looking to do away > with app-level compression code. Not a biggie. > > Ameet > > > On Mon, Oct 1, 2012 at 3:32 PM, William Slacum < > [EMAIL PROTECTED]> wrote: > >> If you aren't often looking at the data in the value on the tablet server >> (like in an iterator), you can also pre-compress your values on ingest. >> >> >> On Mon, Oct 1, 2012 at 12:19 PM, Marc Parisi <[EMAIL PROTECTED]> wrote: >> >>> You could compress the data in the value, and decompress the data upon >>> receipt by the scanner. >>> >>> >>> On Mon, Oct 1, 2012 at 3:03 PM, ameet kini <[EMAIL PROTECTED]> wrote: >>> >>>> >>>> My understanding of compression in Accumulo 1.4.1 is that it is on by >>>> default and that data is decompressed by the tablet server, so data on the >>>> wire between server/client is decompressed. Is there a way to shift the >>>> decompression from happening on the server to the client? I have a use case >>>> where each Value in my table is relatively large (~ 8MB) and I can benefit >>>> from compression over the wire. I don't have any server side iterators, so >>>> the values don't need to be decompressed by the tablet server. Also, each >>>> scan returns a few rows, so client-side decompression can be fast. >>>> >>>> The only way I can think of now is to disable compression on that >>>> table, and handle compression/decompression in the application. But if >>>> there is a way to do this in Accumulo, I'd prefer that. >>>> >>>> Thanks, >>>> Ameet >>>> >>> >>> >> >
-
Re: compressing values returned to scanner
Marc Parisi 2012-10-01, 20:26
Ameet, keys and values ( relative keys ) are extracted from a decompressor stream. In the case of block compression (i.e. gz ), you would need to return a block so the receiver can decompress it. Therefore, using existing compression, as Slacum mentioned, then decompressing the value is likely the best method. On Mon, Oct 1, 2012 at 4:00 PM, William Slacum < [EMAIL PROTECTED]> wrote:
> Someone can correct me if I'm wrong, but I believe the file compression > option you quoted is for the RFiles in HDFS. You can enable compression > there and will still see some benefit even if you compress the values on > ingest. > > > On Mon, Oct 1, 2012 at 12:40 PM, ameet kini <[EMAIL PROTECTED]> wrote: > >> That is exactly my use case (ingest once, serve often, no server-side >> iterators). >> >> And I'm doing pre-compression on ingest. I was just looking to do away >> with app-level compression code. Not a biggie. >> >> Ameet >> >> >> On Mon, Oct 1, 2012 at 3:32 PM, William Slacum < >> [EMAIL PROTECTED]> wrote: >> >>> If you aren't often looking at the data in the value on the tablet >>> server (like in an iterator), you can also pre-compress your values on >>> ingest. >>> >>> >>> On Mon, Oct 1, 2012 at 12:19 PM, Marc Parisi <[EMAIL PROTECTED]> wrote: >>> >>>> You could compress the data in the value, and decompress the data upon >>>> receipt by the scanner. >>>> >>>> >>>> On Mon, Oct 1, 2012 at 3:03 PM, ameet kini <[EMAIL PROTECTED]> wrote: >>>> >>>>> >>>>> My understanding of compression in Accumulo 1.4.1 is that it is on by >>>>> default and that data is decompressed by the tablet server, so data on the >>>>> wire between server/client is decompressed. Is there a way to shift the >>>>> decompression from happening on the server to the client? I have a use case >>>>> where each Value in my table is relatively large (~ 8MB) and I can benefit >>>>> from compression over the wire. I don't have any server side iterators, so >>>>> the values don't need to be decompressed by the tablet server. Also, each >>>>> scan returns a few rows, so client-side decompression can be fast. >>>>> >>>>> The only way I can think of now is to disable compression on that >>>>> table, and handle compression/decompression in the application. But if >>>>> there is a way to do this in Accumulo, I'd prefer that. >>>>> >>>>> Thanks, >>>>> Ameet >>>>> >>>> >>>> >>> >> >
-
Re: compressing values returned to scanner
Marc Parisi 2012-10-01, 20:44
I'm sorry, I was't clear. Blame my sickness. When I typed block compression I was referring to the blocks within the BCFile ( block compressed ), not gz. But the point still remains. You couldn't return the stream through thrift ( you could return the whole block ), so you would need to decompress the keys and values. You could delay decompression of the value, but you need to decompress to find the size of the value after the relative key, whereas double compression would get you what you want.
hope that's clear.
On Mon, Oct 1, 2012 at 4:26 PM, Marc Parisi <[EMAIL PROTECTED]> wrote:
> Ameet, keys and values ( relative keys ) are extracted from a decompressor > stream. In the case of block compression (i.e. gz ), you would need to > return a block so the receiver can decompress it. Therefore, using existing > compression, as Slacum mentioned, then decompressing the value is likely > the best method. > > > On Mon, Oct 1, 2012 at 4:00 PM, William Slacum < > [EMAIL PROTECTED]> wrote: > >> Someone can correct me if I'm wrong, but I believe the file compression >> option you quoted is for the RFiles in HDFS. You can enable compression >> there and will still see some benefit even if you compress the values on >> ingest. >> >> >> On Mon, Oct 1, 2012 at 12:40 PM, ameet kini <[EMAIL PROTECTED]> wrote: >> >>> That is exactly my use case (ingest once, serve often, no server-side >>> iterators). >>> >>> And I'm doing pre-compression on ingest. I was just looking to do away >>> with app-level compression code. Not a biggie. >>> >>> Ameet >>> >>> >>> On Mon, Oct 1, 2012 at 3:32 PM, William Slacum < >>> [EMAIL PROTECTED]> wrote: >>> >>>> If you aren't often looking at the data in the value on the tablet >>>> server (like in an iterator), you can also pre-compress your values on >>>> ingest. >>>> >>>> >>>> On Mon, Oct 1, 2012 at 12:19 PM, Marc Parisi <[EMAIL PROTECTED]> wrote: >>>> >>>>> You could compress the data in the value, and decompress the data upon >>>>> receipt by the scanner. >>>>> >>>>> >>>>> On Mon, Oct 1, 2012 at 3:03 PM, ameet kini <[EMAIL PROTECTED]>wrote: >>>>> >>>>>> >>>>>> My understanding of compression in Accumulo 1.4.1 is that it is on by >>>>>> default and that data is decompressed by the tablet server, so data on the >>>>>> wire between server/client is decompressed. Is there a way to shift the >>>>>> decompression from happening on the server to the client? I have a use case >>>>>> where each Value in my table is relatively large (~ 8MB) and I can benefit >>>>>> from compression over the wire. I don't have any server side iterators, so >>>>>> the values don't need to be decompressed by the tablet server. Also, each >>>>>> scan returns a few rows, so client-side decompression can be fast. >>>>>> >>>>>> The only way I can think of now is to disable compression on that >>>>>> table, and handle compression/decompression in the application. But if >>>>>> there is a way to do this in Accumulo, I'd prefer that. >>>>>> >>>>>> Thanks, >>>>>> Ameet >>>>>> >>>>> >>>>> >>>> >>> >> >
-
Re: compressing values returned to scanner
David Medinets 2012-10-02, 00:35
+1 for double compression. CPU time is cheap. In theory, you can apply domain-specific compression in your application.
-
Re: compressing values returned to scanner
Keith Turner 2012-10-02, 18:24
On Mon, Oct 1, 2012 at 3:03 PM, ameet kini <[EMAIL PROTECTED]> wrote: > > My understanding of compression in Accumulo 1.4.1 is that it is on by > default and that data is decompressed by the tablet server, so data on the > wire between server/client is decompressed. Is there a way to shift the > decompression from happening on the server to the client? I have a use case > where each Value in my table is relatively large (~ 8MB) and I can benefit > from compression over the wire. I don't have any server side iterators, so > the values don't need to be decompressed by the tablet server. Also, each > scan returns a few rows, so client-side decompression can be fast. > > The only way I can think of now is to disable compression on that table, and > handle compression/decompression in the application. But if there is a way > to do this in Accumulo, I'd prefer that. >
There are two levels of compression in Accumulo. First redundant parts of the key are not stored. If the row in a key is the same as the previous row, then its not stored again. The same is done for columns and time stamps. After the relative encoding is done a block of key values is then compressed with gzip.
As data is read from an RFile, when the row of a key is the same as the previous key it will just point to the previous keys row. This is carried forward over the wire. As keys are transferred, duplicate fields in the key are not transferred.
As far as decompressing on the client side vs server side, the server at least needs to decompress keys. On the server side you usually need to read from multiple sorted files and order the result. So you need to decompress keys on the server side to compare them. Also iterators on the server side need the keys and values decompressed.
> Thanks, > Ameet
-
Re: compressing values returned to scanner
ameet kini 2012-10-02, 18:30
> need to decompress keys on the server side to compare them. Also > iterators on the server side need the keys and values decompressed.
keys, I understand, but why do values need to be decompressed if there were no user iterators installed on the server? Are there system iterators that look inside the value?
Ameet
On Tue, Oct 2, 2012 at 2:24 PM, Keith Turner <[EMAIL PROTECTED]> wrote:
> On Mon, Oct 1, 2012 at 3:03 PM, ameet kini <[EMAIL PROTECTED]> wrote: > > > > My understanding of compression in Accumulo 1.4.1 is that it is on by > > default and that data is decompressed by the tablet server, so data on > the > > wire between server/client is decompressed. Is there a way to shift the > > decompression from happening on the server to the client? I have a use > case > > where each Value in my table is relatively large (~ 8MB) and I can > benefit > > from compression over the wire. I don't have any server side iterators, > so > > the values don't need to be decompressed by the tablet server. Also, each > > scan returns a few rows, so client-side decompression can be fast. > > > > The only way I can think of now is to disable compression on that table, > and > > handle compression/decompression in the application. But if there is a > way > > to do this in Accumulo, I'd prefer that. > > > > There are two levels of compression in Accumulo. First redundant > parts of the key are not stored. If the row in a key is the same as > the previous row, then its not stored again. The same is done for > columns and time stamps. After the relative encoding is done a block > of key values is then compressed with gzip. > > As data is read from an RFile, when the row of a key is the same as > the previous key it will just point to the previous keys row. This is > carried forward over the wire. As keys are transferred, duplicate > fields in the key are not transferred. > > As far as decompressing on the client side vs server side, the server > at least needs to decompress keys. On the server side you usually > need to read from multiple sorted files and order the result. So you > need to decompress keys on the server side to compare them. Also > iterators on the server side need the keys and values decompressed. > > > Thanks, > > Ameet >
-
Re: compressing values returned to scanner
ameet kini 2012-10-02, 18:48
In re-reading your response, I may have overlooked one key point.
>> columns and time stamps. After the relative encoding is done a block >> of key values is then compressed with gzip.
Are the keys+values compressed together as one block? If thats the case, I can see why its not possible to only decompress keys and leave values compressed.
Also, I've switched to double compression as per previous posts and its working nicely. I see about 10-15% more compression over just application level Value compression.
Thanks for your responses, Ameet
On Tue, Oct 2, 2012 at 2:30 PM, ameet kini <[EMAIL PROTECTED]> wrote: >> need to decompress keys on the server side to compare them. Also >> iterators on the server side need the keys and values decompressed. > > keys, I understand, but why do values need to be decompressed if there were > no user iterators installed on the server? Are there system iterators that > look inside the value? > > Ameet > > On Tue, Oct 2, 2012 at 2:24 PM, Keith Turner <[EMAIL PROTECTED]> wrote: >> >> On Mon, Oct 1, 2012 at 3:03 PM, ameet kini <[EMAIL PROTECTED]> wrote: >> > >> > My understanding of compression in Accumulo 1.4.1 is that it is on by >> > default and that data is decompressed by the tablet server, so data on >> > the >> > wire between server/client is decompressed. Is there a way to shift the >> > decompression from happening on the server to the client? I have a use >> > case >> > where each Value in my table is relatively large (~ 8MB) and I can >> > benefit >> > from compression over the wire. I don't have any server side iterators, >> > so >> > the values don't need to be decompressed by the tablet server. Also, >> > each >> > scan returns a few rows, so client-side decompression can be fast. >> > >> > The only way I can think of now is to disable compression on that table, >> > and >> > handle compression/decompression in the application. But if there is a >> > way >> > to do this in Accumulo, I'd prefer that. >> > >> >> There are two levels of compression in Accumulo. First redundant >> parts of the key are not stored. If the row in a key is the same as >> the previous row, then its not stored again. The same is done for >> columns and time stamps. After the relative encoding is done a block >> of key values is then compressed with gzip. >> >> As data is read from an RFile, when the row of a key is the same as >> the previous key it will just point to the previous keys row. This is >> carried forward over the wire. As keys are transferred, duplicate >> fields in the key are not transferred. >> >> As far as decompressing on the client side vs server side, the server >> at least needs to decompress keys. On the server side you usually >> need to read from multiple sorted files and order the result. So you >> need to decompress keys on the server side to compare them. Also >> iterators on the server side need the keys and values decompressed. >> >> > Thanks, >> > Ameet > >
-
Re: compressing values returned to scanner
Keith Turner 2012-10-02, 18:55
On Tue, Oct 2, 2012 at 2:30 PM, ameet kini <[EMAIL PROTECTED]> wrote: >> need to decompress keys on the server side to compare them. Also >> iterators on the server side need the keys and values decompressed. > > keys, I understand, but why do values need to be decompressed if there were > no user iterators installed on the server? Are there system iterators that > look inside the value?
I do not think any of the default iterators look at the value. You could possibly compress the value and lazily decompress it as its needed by iterators. It seems like each value would need to be compressed individually and you would not be able to compress groups of values. I say this because values need to be interleaved as they are read from multiple files and ordered. So you lose the ability to pass back a group of compressed values w/o ever decompressing them. Compressing each value separately may incur a lot of overhead for smaller values. For larger values it would be great.
Other than iterators, compressing values individually could all be done at the client side with a wrapper around the APIs for reading a writing. Iterators that operate on a table w/ compressed values could possibly extend an iterator that decompresses it when used.
> > Ameet > > On Tue, Oct 2, 2012 at 2:24 PM, Keith Turner <[EMAIL PROTECTED]> wrote: >> >> On Mon, Oct 1, 2012 at 3:03 PM, ameet kini <[EMAIL PROTECTED]> wrote: >> > >> > My understanding of compression in Accumulo 1.4.1 is that it is on by >> > default and that data is decompressed by the tablet server, so data on >> > the >> > wire between server/client is decompressed. Is there a way to shift the >> > decompression from happening on the server to the client? I have a use >> > case >> > where each Value in my table is relatively large (~ 8MB) and I can >> > benefit >> > from compression over the wire. I don't have any server side iterators, >> > so >> > the values don't need to be decompressed by the tablet server. Also, >> > each >> > scan returns a few rows, so client-side decompression can be fast. >> > >> > The only way I can think of now is to disable compression on that table, >> > and >> > handle compression/decompression in the application. But if there is a >> > way >> > to do this in Accumulo, I'd prefer that. >> > >> >> There are two levels of compression in Accumulo. First redundant >> parts of the key are not stored. If the row in a key is the same as >> the previous row, then its not stored again. The same is done for >> columns and time stamps. After the relative encoding is done a block >> of key values is then compressed with gzip. >> >> As data is read from an RFile, when the row of a key is the same as >> the previous key it will just point to the previous keys row. This is >> carried forward over the wire. As keys are transferred, duplicate >> fields in the key are not transferred. >> >> As far as decompressing on the client side vs server side, the server >> at least needs to decompress keys. On the server side you usually >> need to read from multiple sorted files and order the result. So you >> need to decompress keys on the server side to compare them. Also >> iterators on the server side need the keys and values decompressed. >> >> > Thanks, >> > Ameet > >
-
Re: compressing values returned to scanner
Keith Turner 2012-10-02, 20:34
On Tue, Oct 2, 2012 at 2:48 PM, ameet kini <[EMAIL PROTECTED]> wrote: > In re-reading your response, I may have overlooked one key point. > >>> columns and time stamps. After the relative encoding is done a block >>> of key values is then compressed with gzip. > > Are the keys+values compressed together as one block? If thats the > case, I can see why its not possible to only decompress keys and leave > values compressed.
yes, it currently compresses a sequence of key values into a single block.
> > Also, I've switched to double compression as per previous posts and > its working nicely. I see about 10-15% more compression over just > application level Value compression. > > Thanks for your responses, > Ameet > > On Tue, Oct 2, 2012 at 2:30 PM, ameet kini <[EMAIL PROTECTED]> wrote: >>> need to decompress keys on the server side to compare them. Also >>> iterators on the server side need the keys and values decompressed. >> >> keys, I understand, but why do values need to be decompressed if there were >> no user iterators installed on the server? Are there system iterators that >> look inside the value? >> >> Ameet >> >> On Tue, Oct 2, 2012 at 2:24 PM, Keith Turner <[EMAIL PROTECTED]> wrote: >>> >>> On Mon, Oct 1, 2012 at 3:03 PM, ameet kini <[EMAIL PROTECTED]> wrote: >>> > >>> > My understanding of compression in Accumulo 1.4.1 is that it is on by >>> > default and that data is decompressed by the tablet server, so data on >>> > the >>> > wire between server/client is decompressed. Is there a way to shift the >>> > decompression from happening on the server to the client? I have a use >>> > case >>> > where each Value in my table is relatively large (~ 8MB) and I can >>> > benefit >>> > from compression over the wire. I don't have any server side iterators, >>> > so >>> > the values don't need to be decompressed by the tablet server. Also, >>> > each >>> > scan returns a few rows, so client-side decompression can be fast. >>> > >>> > The only way I can think of now is to disable compression on that table, >>> > and >>> > handle compression/decompression in the application. But if there is a >>> > way >>> > to do this in Accumulo, I'd prefer that. >>> > >>> >>> There are two levels of compression in Accumulo. First redundant >>> parts of the key are not stored. If the row in a key is the same as >>> the previous row, then its not stored again. The same is done for >>> columns and time stamps. After the relative encoding is done a block >>> of key values is then compressed with gzip. >>> >>> As data is read from an RFile, when the row of a key is the same as >>> the previous key it will just point to the previous keys row. This is >>> carried forward over the wire. As keys are transferred, duplicate >>> fields in the key are not transferred. >>> >>> As far as decompressing on the client side vs server side, the server >>> at least needs to decompress keys. On the server side you usually >>> need to read from multiple sorted files and order the result. So you >>> need to decompress keys on the server side to compare them. Also >>> iterators on the server side need the keys and values decompressed. >>> >>> > Thanks, >>> > Ameet >> >>
|
|