Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # dev >> Seeks with DataFileReader in C++


Copy link to this message
-
Re: Seeks with DataFileReader in C++

On Jan 24, 2013, at 8:46 AM, Thiruvalluvan MG <[EMAIL PROTECTED]> wrote:

> Daniel,
>
> I think it is a good use case. One way to achieve what you want is to:
>
> 1. Expose the existing members objectCount_ and byteCount_ of DataFileReaderBase as size_t objectsRemainingInBlock() and size_t bytesRemainingInBlock() in DataFileReader class.
> 2. Add a new method in DataFileReader class void skip(size_t n), which skips n objects.
> 3. If you prefer you can add skipBlock() which is a shorthand for skip(objectsRemainingInBlock()).
>
> Does it work for you?
Quite possibly. My main concern with the above API, is that it (from what I understand) still forces the Reader to go through and inspect each block sequentially. I had been thinking of an API more like
- void seekBytes(size_t offset); // seek to the start of the first block that does not start before offset by seeking there and then scanning for a sync mark
- size_t offsetBytes() const; // get the current offset in the file
- size_t sizeBytes() const; // get the size of the file

That would provide (I think)
- constant time access to objects deep in the file
- allow the construction of indexes for the data file by, for example, seeking at each i/1000 of the file, saving the resulting offset (and extracted identifier from the object)
 
The cost would be that you have lower precision (finding the nth record requires that you be able to identify it and, possibly, do a search) and be able to identify objects based solely on the context (as determining its index in the file would still require a linear scan). Also requires that Avro be able to find a sync mark starting from an arbitrary point in the file, but, based on my understanding, that is a valid assumption (please correct me if I'm wrong).

         --Daniel
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB