Hi all,

I'm looking for suggestions on how to optimize a number of Hadoop jobs (written using Cascading) that only need a fraction of the records store in Avro files.

Essentially I have a small number (let's say 10K) of essentially random keys out of a total of 100M unique values, and I need to select & process all and only those records in my Avro files where the key field matches. The set of keys that are of interest changes with each run.

I have about 1TB of compressed data to scan through, saved as about 200 5GB files. This represents about 10B records.

The data format has to stay as Avro, for interchange with various groups.

As I'm building the Avro files, I could sort by the key field.

I'm wondering if it's feasible to build a skip table that would let me seek to a sync position in the Avro file and read from it. If the default sync interval is 16K, then I'd have 65M of these that I could use, and even if every key of interest had 100 records that were each in a separate block, this would still dramatically cut down on the amount of data I'd have to scan over.

But is that possible? Any input would be appreciated.

Ken Krugler
+1 530-210-6378
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB