|
|
-
Seeks with DataFileReader in C++
Daniel Russel 2013-01-23, 05:50
From what I can tell, there is no way to do any sort of random access with the C++ DataFileReader API. Is this correct? Is someone working on that? If not, and people think this would be a generally interesting capability, I'd consider implementing it as I'd kind of like to have it. Thanks. --Daniel
-
Re: Seeks with DataFileReader in C++
Thiruvalluvan MG 2013-01-23, 12:56
Hi Daniel,
I think it will be nice if you can describe your use case. Yes, we'll be interested in seeing your implementation. Since this will be an added feature, it harms none unless they use this feature. Please go ahead and create a ticket and submit a patch.
Thanks
Thiru ________________________________ From: Daniel Russel <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Wednesday, 23 January 2013 11:20 AM Subject: Seeks with DataFileReader in C++ From what I can tell, there is no way to do any sort of random access with the C++ DataFileReader API. Is this correct? Is someone working on that? If not, and people think this would be a generally interesting capability, I'd consider implementing it as I'd kind of like to have it. Thanks. --Daniel
-
Re: Seeks with DataFileReader in C++
Daniel Russel 2013-01-23, 17:03
In our case, we have files created from large numbers of frames stored sequentially as records in a data file. Currently, finding the i-th frame requires going to the beginning and reading all records until the appropriate one is found. Doing binary search or some sort of index based search would decrease load times for many operations significantly. It would also make implementing map-reduce sorts of operations on the data files easier since currently there is no reliably way to shard the files.
I'll work on the patch, nothing written yet :-) --Daniel
On Jan 23, 2013, at 4:56 AM, Thiruvalluvan MG <[EMAIL PROTECTED]> wrote:
> Hi Daniel, > > I think it will be nice if you can describe your use case. Yes, we'll be interested in seeing your implementation. Since this will be an added feature, it harms none unless they use this feature. Please go ahead and create a ticket and submit a patch. > > Thanks > > Thiru > > > ________________________________ > From: Daniel Russel <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Wednesday, 23 January 2013 11:20 AM > Subject: Seeks with DataFileReader in C++ > > From what I can tell, there is no way to do any sort of random access with the C++ DataFileReader API. Is this correct? Is someone working on that? If not, and people think this would be a generally interesting capability, I'd consider implementing it as I'd kind of like to have it. Thanks. > --Daniel
-
Re: Seeks with DataFileReader in C++
Thiruvalluvan MG 2013-01-24, 16:46
Daniel,
I think it is a good use case. One way to achieve what you want is to:
1. Expose the existing members objectCount_ and byteCount_ of DataFileReaderBase as size_t objectsRemainingInBlock() and size_t bytesRemainingInBlock() in DataFileReader class. 2. Add a new method in DataFileReader class void skip(size_t n), which skips n objects. 3. If you prefer you can add skipBlock() which is a shorthand for skip(objectsRemainingInBlock()).
Does it work for you?
Thanks
Thiru ________________________________ From: Daniel Russel <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; Thiruvalluvan MG <[EMAIL PROTECTED]> Sent: Wednesday, 23 January 2013 10:33 PM Subject: Re: Seeks with DataFileReader in C++ In our case, we have files created from large numbers of frames stored sequentially as records in a data file. Currently, finding the i-th frame requires going to the beginning and reading all records until the appropriate one is found. Doing binary search or some sort of index based search would decrease load times for many operations significantly. It would also make implementing map-reduce sorts of operations on the data files easier since currently there is no reliably way to shard the files.
I'll work on the patch, nothing written yet :-) --Daniel
On Jan 23, 2013, at 4:56 AM, Thiruvalluvan MG <[EMAIL PROTECTED]> wrote:
> Hi Daniel, > > I think it will be nice if you can describe your use case. Yes, we'll be interested in seeing your implementation. Since this will be an added feature, it harms none unless they use this feature. Please go ahead and create a ticket and submit a patch. > > Thanks > > Thiru > > > ________________________________ > From: Daniel Russel <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Wednesday, 23 January 2013 11:20 AM > Subject: Seeks with DataFileReader in C++ > > From what I can tell, there is no way to do any sort of random access with the C++ DataFileReader API. Is this correct? Is someone working on that? If not, and people think this would be a generally interesting capability, I'd consider implementing it as I'd kind of like to have it. Thanks. > --Daniel
-
Re: Seeks with DataFileReader in C++
Daniel Russel 2013-01-28, 19:31
On Jan 24, 2013, at 8:46 AM, Thiruvalluvan MG <[EMAIL PROTECTED]> wrote:
> Daniel, > > I think it is a good use case. One way to achieve what you want is to: > > 1. Expose the existing members objectCount_ and byteCount_ of DataFileReaderBase as size_t objectsRemainingInBlock() and size_t bytesRemainingInBlock() in DataFileReader class. > 2. Add a new method in DataFileReader class void skip(size_t n), which skips n objects. > 3. If you prefer you can add skipBlock() which is a shorthand for skip(objectsRemainingInBlock()). > > Does it work for you? Quite possibly. My main concern with the above API, is that it (from what I understand) still forces the Reader to go through and inspect each block sequentially. I had been thinking of an API more like - void seekBytes(size_t offset); // seek to the start of the first block that does not start before offset by seeking there and then scanning for a sync mark - size_t offsetBytes() const; // get the current offset in the file - size_t sizeBytes() const; // get the size of the file
That would provide (I think) - constant time access to objects deep in the file - allow the construction of indexes for the data file by, for example, seeking at each i/1000 of the file, saving the resulting offset (and extracted identifier from the object) The cost would be that you have lower precision (finding the nth record requires that you be able to identify it and, possibly, do a search) and be able to identify objects based solely on the context (as determining its index in the file would still require a linear scan). Also requires that Avro be able to find a sync mark starting from an arbitrary point in the file, but, based on my understanding, that is a valid assumption (please correct me if I'm wrong).
--Daniel
-
Re: Seeks with DataFileReader in C++
Thiruvalluvan MG 2013-01-30, 13:38
Daniel,
I think your approach will work. Currently the for the DataFileReader to work, the input just needs to stream. sizeBytes() will add an additional constraint that we are able to compute the size, a-priori. I think that is okay.
Please go ahead.
Thanks
Thiru
PS. Instead of discussing this over e-mail, it is better to do it in a JIRA ticket. People will have ready access to the discussion in the future. Please open a ticket as soon as you can. Thank you. ________________________________ From: Daniel Russel <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; Thiruvalluvan MG <[EMAIL PROTECTED]> Sent: Tuesday, 29 January 2013 1:01 AM Subject: Re: Seeks with DataFileReader in C++
On Jan 24, 2013, at 8:46 AM, Thiruvalluvan MG <[EMAIL PROTECTED]> wrote:
> Daniel, > > I think it is a good use case. One way to achieve what you want is to: > > 1. Expose the existing members objectCount_ and byteCount_ of DataFileReaderBase as size_t objectsRemainingInBlock() and size_t bytesRemainingInBlock() in DataFileReader class. > 2. Add a new method in DataFileReader class void skip(size_t n), which skips n objects. > 3. If you prefer you can add skipBlock() which is a shorthand for skip(objectsRemainingInBlock()). > > Does it work for you? Quite possibly. My main concern with the above API, is that it (from what I understand) still forces the Reader to go through and inspect each block sequentially. I had been thinking of an API more like - void seekBytes(size_t offset); // seek to the start of the first block that does not start before offset by seeking there and then scanning for a sync mark - size_t offsetBytes() const; // get the current offset in the file - size_t sizeBytes() const; // get the size of the file
That would provide (I think) - constant time access to objects deep in the file - allow the construction of indexes for the data file by, for example, seeking at each i/1000 of the file, saving the resulting offset (and extracted identifier from the object)
The cost would be that you have lower precision (finding the nth record requires that you be able to identify it and, possibly, do a search) and be able to identify objects based solely on the context (as determining its index in the file would still require a linear scan). Also requires that Avro be able to find a sync mark starting from an arbitrary point in the file, but, based on my understanding, that is a valid assumption (please correct me if I'm wrong).
--Daniel
|
|