Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Generating snappy compressed avro files as hadoop map reduce input files


Copy link to this message
-
Re: Generating snappy compressed avro files as hadoop map reduce input files
If you're using hadoop, why not use AvroSequenceFileOutputFormat - this works fine with snappy (block level compression may be best depending on your data)

On Oct 13, 2013, at 10:58 AM, David Ginzburg <[EMAIL PROTECTED]> wrote:

> As mentioned in http://stackoverflow.com/a/15821136 Hadoop's snappy codec just doesn't work with externally generated files.
>
> Can files generated by DataFileWriter  serve as input files for a map reduce job, specially EMR jobs ?
> From: Bertrand Dechoux <[EMAIL PROTECTED]>
> Sent: Sunday, October 13, 2013 6:36 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Generating snappy compressed avro files as hadoop map reduce input files
>  
> I am not sure to understand the relation between your problem and the way the temporary data are stored after the map phase.
>
> However, I guess you are looking for a DataFileWriter and its setCodec function.
> http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#setCodec%28org.apache.avro.file.CodecFactory%29
>
> Regards
>
> Bertrand
>
> PS : A snappy-compressed avro file is not a standard file which has been compressed afterwards but really a specific file containing compressed blocks. This principle is similar to the SequenceFile's. Maybe that's what you mean by different snappy codec?
>
> On Sun, Oct 13, 2013 at 5:16 PM, David Ginzburg <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I am writing an application that produces avro record files , to be stored on AWS S3 as possible input to EMR.
> I would like to compress with snappy codec before storing them on S3.
> It is my understanding that hadoop currently uses a different snappy codec, mostly used as intermediate map output format .
> My question is how can I generate within my application logic (not MR) snappy compressed avro files?
>
>
>
>
>

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB