Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Random access in an avro file


Copy link to this message
-
Re: Random access in an avro file
I guess I will answer this question myself. It seems like the file expects
the records to be entered in a sorted order inside of it doing the sorting
internally[1]. I don't think it should hurt us but honestly was a little
surprising. It feels like this should be javadoc'ed somewhere that it is
the responsibility of the consumers to sort the records themselves by the
given key before appending to the file. Otherwise, a very useful addition
to the avro library! :)

[1]
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.avro/avro-mapred/1.7.2/org/apache/avro/hadoop/file/SortedKeyValueFile.java#536
On Mon, Jul 1, 2013 at 1:50 PM, [EMAIL PROTECTED] <
[EMAIL PROTECTED]> wrote:

> Thanks again Doug. SortedKeyValueFile looks really promising and seems to
> fit our use case well.
>
> One last thing I was concerned about was the performance
> of maintaining the sorted order in the file. Especially because in our case
> the file might get pretty large(hundred thousands to million). If there is
> a limit on the file size to achieve maximum performance, we can possibly
> think about closing the file and start writing to another file once we
> start to hit that limit.
>
>
> On Mon, Jul 1, 2013 at 12:51 PM, Doug Cutting <[EMAIL PROTECTED]> wrote:
>
>> On Mon, Jul 1, 2013 at 10:26 AM, [EMAIL PROTECTED]
>> <[EMAIL PROTECTED]> wrote:
>> > Out of curiosity, is maintaining sync markers while writing the file and
>> > then passing these markers to the readers while reading not a good way
>> to
>> > achieve random access in avro?
>>
>> Yes, seeking to the position of a sync marker is possible.  This is
>> what SortedKeyValueFile does.  You need to store the list of positions
>> of sync markers, and if seek is to a column value rather than a row
>> number, then you need to store these values (keys) with the positions.
>>  Those are what's in SortedKeyValueFile's "index" file.
>>
>> Doug
>>
>
>
>
> --
> Swarnim
>

--
Swarnim
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB