|
|
-
Split control in Lzo index
Shi Yu 2011-06-23, 20:59
Hi,
My specific question is: is it possible to control the split of Lzo files by customize the Lzo index files?
The background of the problem is:
I have a file which has the following format
key1 value1 key1 value2 key2 value3 key2 value4 ...
Its size in plain text before compression is 11 M. After Lzo compression, the size is 681 K. I tried this on two formats: Text format and Sequence format with block compression. They are almost the same.
However, when I join the same keys together and reformat the file as
key1 value1 value2 key2 value3 value4 ...
The size before compression is of course more or less the same, 11M. But after Lzo compression, the size is 4.8 M. My guess is: maybe the Lzo compression algorithm could compress a lot of similar values in the first format, whereas in the second format the concatenation of multiple values are less likely to be identical, therefore the compression rate decreases.
So, again my question is, if I would like to keep the file in the first format, I would prohibit mapper to split the file within the same key. For example, all "key1" should go to the same mapper. Is it doable on a Lzo file? Because the split behavior of Lzo files relies on the index files, is there anyway to control the split by customizing the Lzo index files?
BTW, when using the second format, I found that bzip2 has better compression rate than Lzo (2.1 M). Did I made any mistake when using Lzo compression?
Thanks!
Best Regards,
Shi
+
Shi Yu 2011-06-23, 20:59
-
Re: Split control in Lzo index
Dmitriy Ryaboy 2011-06-23, 21:35
Shi, bzip compresses much better than lzo. It is also significantly more expensive (we are talking orders of magnitude) than LZO, both on compression and decompression.
As for your question regarding custom splits -- LzoIndex does not support this kind of logic, as it's written to be generic and doesn't know how to read individual records, but you can certainly customize it to fit your use case.
D
On Thu, Jun 23, 2011 at 1:59 PM, Shi Yu <[EMAIL PROTECTED]> wrote:
> Hi, > > My specific question is: is it possible to control the split of Lzo files > by customize the Lzo index files? > > The background of the problem is: > > I have a file which has the following format > > key1 value1 > key1 value2 > key2 value3 > key2 value4 > ... > > Its size in plain text before compression is 11 M. After Lzo compression, > the size is 681 K. I tried this on two formats: Text format and Sequence > format with block compression. They are almost the same. > > However, when I join the same keys together and reformat the file as > > key1 value1 value2 > key2 value3 value4 > ... > > The size before compression is of course more or less the same, 11M. But > after Lzo compression, the size is 4.8 M. My guess is: maybe the Lzo > compression algorithm could compress a lot of similar values in the first > format, whereas in the second format the concatenation of multiple values > are less likely to be identical, therefore the compression rate decreases. > > So, again my question is, if I would like to keep the file in the first > format, I would prohibit mapper to split the file within the same key. For > example, all "key1" should go to the same mapper. Is it doable on a Lzo > file? Because the split behavior of Lzo files relies on the index files, is > there anyway to control the split by customizing the Lzo index files? > > BTW, when using the second format, I found that bzip2 has better > compression rate than Lzo (2.1 M). Did I made any mistake when using Lzo > compression? > > Thanks! > > Best Regards, > > Shi > > >
+
Dmitriy Ryaboy 2011-06-23, 21:35
-
Re: Split control in Lzo index
Shi Yu 2011-06-23, 21:52
Thanks Dmitriy!
Not sure how much work it will be. I guess I should customize the InputFormat class in this case, right?
Shi * *On 6/23/2011 4:35 PM, Dmitriy Ryaboy wrote: > Shi, > bzip compresses much better than lzo. It is also significantly more > expensive (we are talking orders of magnitude) than LZO, both on compression > and decompression. > > As for your question regarding custom splits -- LzoIndex does not support > this kind of logic, as it's written to be generic and doesn't know how to > read individual records, but you can certainly customize it to fit your use > case. > > D > > > > On Thu, Jun 23, 2011 at 1:59 PM, Shi Yu<[EMAIL PROTECTED]> wrote: > >> Hi, >> >> My specific question is: is it possible to control the split of Lzo files >> by customize the Lzo index files? >> >> The background of the problem is: >> >> I have a file which has the following format >> >> key1 value1 >> key1 value2 >> key2 value3 >> key2 value4 >> ... >> >> Its size in plain text before compression is 11 M. After Lzo compression, >> the size is 681 K. I tried this on two formats: Text format and Sequence >> format with block compression. They are almost the same. >> >> However, when I join the same keys together and reformat the file as >> >> key1 value1 value2 >> key2 value3 value4 >> ... >> >> The size before compression is of course more or less the same, 11M. But >> after Lzo compression, the size is 4.8 M. My guess is: maybe the Lzo >> compression algorithm could compress a lot of similar values in the first >> format, whereas in the second format the concatenation of multiple values >> are less likely to be identical, therefore the compression rate decreases. >> >> So, again my question is, if I would like to keep the file in the first >> format, I would prohibit mapper to split the file within the same key. For >> example, all "key1" should go to the same mapper. Is it doable on a Lzo >> file? Because the split behavior of Lzo files relies on the index files, is >> there anyway to control the split by customizing the Lzo index files? >> >> BTW, when using the second format, I found that bzip2 has better >> compression rate than Lzo (2.1 M). Did I made any mistake when using Lzo >> compression? >> >> Thanks! >> >> Best Regards, >> >> Shi >> >> >>
+
Shi Yu 2011-06-23, 21:52
-
Re: Split control in Lzo index
Bharath Mundlapudi 2011-06-24, 07:00
>> BTW, when using the second format, I found that bzip2 has better compression rate than Lzo (2.1 M). Did I made any mistake when using Lzo compression? It depends on your requirements. Like if you prefer high compression rate over performance. bzip2 is orders of magnitude slower than Lzo.
-Bharath
+
Bharath Mundlapudi 2011-06-24, 07:00
|
|