More specifically, seeking to a known location in the uncompressed data. So not just seeking to “the nearest record boundary”, but seeking to “position 100000000 in the uncompressed data”. I can see that if the writer kept track of this information on the side it would be available; my question is more about the standard formats (e.g. LZO compression in SequenceFile) supporting this without additional work.
From: Rahul Bhattacharjee [mailto:[EMAIL PROTECTED]]
Sent: Friday, May 24, 2013 1:00 AM
To: [EMAIL PROTECTED]
Subject: Re: splittable vs seekable compressed formats
Yeah , I think John meant seeking to record boundaries.
On Fri, May 24, 2013 at 12:22 PM, Harsh J <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
SequenceFiles should be seekable provided you know/manage their sync
points during writes I think. With LZO this may be non-trivial.
On Thu, May 23, 2013 at 11:01 PM, John Lilley <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
> I’ve read about splittable compressed formats in Hadoop. Are any of these
> formats also “seekable” (in other words, be able to seek to an absolute
> location in the uncompressed data).