Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro, mail # user - Generating snappy compressed avro files as hadoop map reduce input files


Copy link to this message
-
Re: Generating snappy compressed avro files as hadoop map reduce input files
Bertrand Dechoux 2013-10-14, 07:49
David wants to generate those files from outside Hadoop. So InputFormat and
OutputFormat may not be the most appropriate.

The aim of avro files are to be readable and writable easily without hadoop
MapReduce. The stackoverflow link (in short) does only talk about the
limitation of most compression algorithms : they are not splittable (by
Hadoop or everybody). That's the case for Snappy.

It is a known limitation. And it is why there is an avro file which is a
specific file which is not itself compressed but parts of the files
(blocks) are compressed. That way there is no issue with "splittability".

The stackoverflow is about a text file (logs) which has been snappy
compressed. That's not a good practice. Once again, there is specific file
which is the avro file.

http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#setCodec%28org.apache.avro.file.CodecFactory%29
should be what you need. If you are not using a java program, to generate
your data, others languages are also supported.

Then when/if you will want to process that with MapReduce, there is
AvroInputFormat and AvroOutputFormat.

Regards

Bertrand

PS : from the Cloudera blog post linked by stackoverflow

"One thing to note is that Snappy is intended to be used with a container
format, like Sequence Files or Avro Data Files, rather than being used
directly on plain text, for example, since the latter is not splittable and
can’t be processed in parallel using MapReduce. "

On Sun, Oct 13, 2013 at 11:31 PM, graham sanderson <[EMAIL PROTECTED]> wrote:

> I haven't actually tried writing, but look at AvroSequenceFileOutputFormat
> (and obviously have native snappy libraries on your box)
>
> Also the javadoc is a bit IMHO ambiguous on AvroJob setup - you can
> totally use NullWritable (or any other hadoop supported Serializable) as a
> key.
>
> On Oct 13, 2013, at 2:23 PM, David Ginzburg <[EMAIL PROTECTED]>
> wrote:
>
> Thanks,
> I am not generating the avro files with hadoop MR, but a different process.
> I Plan to just store the files on s3 for potential archive processing with
> EMR.
> Can I use AvroSequenceFile from a non M/R process to generate the sequence
> files having my avro records as the values, and null keys ?
> ------------------------------
> *From:* graham sanderson <[EMAIL PROTECTED]>
> *Sent:* Sunday, October 13, 2013 9:16 PM
> *To:* [EMAIL PROTECTED]
> *Subject:* Re: Generating snappy compressed avro files as hadoop map
> reduce input files
>
> If you're using hadoop, why not use AvroSequenceFileOutputFormat - this
> works fine with snappy (block level compression may be best depending on
> your data)
>
> On Oct 13, 2013, at 10:58 AM, David Ginzburg <[EMAIL PROTECTED]>
> wrote:
>
> As mentioned in http://stackoverflow.com/a/15821136 Hadoop's snappy codec
> just doesn't work with externally generated files.
>
> Can files generated by DataFileWriter<http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#setCodec%28org.apache.avro.file.CodecFactory%29>
> serve as input files for a map reduce job, specially EMR jobs ?
> ------------------------------
> *From:* Bertrand Dechoux <[EMAIL PROTECTED]>
> *Sent:* Sunday, October 13, 2013 6:36 PM
> *To:* [EMAIL PROTECTED]
> *Subject:* Re: Generating snappy compressed avro files as hadoop map
> reduce input files
>
> I am not sure to understand the relation between your problem and the way
> the temporary data are stored after the map phase.
>
> However, I guess you are looking for a DataFileWriter and its setCodec
> function.
>
> http://avro.apache.org/docs/current/api/java/org/apache/avro/file/DataFileWriter.html#setCodec%28org.apache.avro.file.CodecFactory%29
>
> Regards
>
> Bertrand
>
> PS : A snappy-compressed avro file is not a standard file which has been
> compressed afterwards but really a specific file containing compressed
> blocks. This principle is similar to the SequenceFile's. Maybe that's what
> you mean by different snappy codec?
Bertrand Dechoux