Daniel Russel 2013-01-23, 05:50
Thiruvalluvan MG 2013-01-23, 12:56
Daniel Russel 2013-01-23, 17:03
Thiruvalluvan MG 2013-01-24, 16:46
Daniel Russel 2013-01-28, 19:31
-Re: Seeks with DataFileReader in C++
Thiruvalluvan MG 2013-01-30, 13:38
I think your approach will work. Currently the for the DataFileReader to work, the input just needs to stream. sizeBytes() will add an additional constraint that we are able to compute the size, a-priori. I think that is okay.
Please go ahead.
PS. Instead of discussing this over e-mail, it is better to do it in a JIRA ticket. People will have ready access to the discussion in the future. Please open a ticket as soon as you can. Thank you.
From: Daniel Russel <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]; Thiruvalluvan MG <[EMAIL PROTECTED]>
Sent: Tuesday, 29 January 2013 1:01 AM
Subject: Re: Seeks with DataFileReader in C++
On Jan 24, 2013, at 8:46 AM, Thiruvalluvan MG <[EMAIL PROTECTED]> wrote:
> I think it is a good use case. One way to achieve what you want is to:
> 1. Expose the existing members objectCount_ and byteCount_ of DataFileReaderBase as size_t objectsRemainingInBlock() and size_t bytesRemainingInBlock() in DataFileReader class.
> 2. Add a new method in DataFileReader class void skip(size_t n), which skips n objects.
> 3. If you prefer you can add skipBlock() which is a shorthand for skip(objectsRemainingInBlock()).
> Does it work for you?
Quite possibly. My main concern with the above API, is that it (from what I understand) still forces the Reader to go through and inspect each block sequentially. I had been thinking of an API more like
- void seekBytes(size_t offset); // seek to the start of the first block that does not start before offset by seeking there and then scanning for a sync mark
- size_t offsetBytes() const; // get the current offset in the file
- size_t sizeBytes() const; // get the size of the file
That would provide (I think)
- constant time access to objects deep in the file
- allow the construction of indexes for the data file by, for example, seeking at each i/1000 of the file, saving the resulting offset (and extracted identifier from the object)
The cost would be that you have lower precision (finding the nth record requires that you be able to identify it and, possibly, do a search) and be able to identify objects based solely on the context (as determining its index in the file would still require a linear scan). Also requires that Avro be able to find a sync mark starting from an arbitrary point in the file, but, based on my understanding, that is a valid assumption (please correct me if I'm wrong).