Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro, mail # user - Can spill to disk be in compressed Avro format to reduce I/O?

Copy link to this message
Re: Can spill to disk be in compressed Avro format to reduce I/O?
Scott Carey 2012-01-12, 20:53

On 1/12/12 12:35 PM, "Frank Grimes" <[EMAIL PROTECTED]> wrote:

> So I decided to try writing my own AvroStreamCombiner utility and it seems to
> choke when passing multiple input files:
>> hadoop dfs -cat hdfs://hadoop/machine1.log.avro
>> hdfs://hadoop/machine2.log.avro | ./deliveryLogAvroStreamCombiner.sh >
>> combined.log.avro
>> Exception in thread "main" java.io.IOException: Invalid sync!
>> at org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:293)
>> at org.apache.avro.file.DataFileWriter.appendAllFrom(DataFileWriter.java:329)
>> at DeliveryLogAvroStreamCombiner.main(Unknown Source)
> Here's the code in question:
> public class DeliveryLogAvroStreamCombiner {
> /**
>  * @param args
>  */
> public static void main(String[] args) throws Exception {
> DataFileStream<DeliveryLogEvent> dfs = null;
> DataFileWriter<DeliveryLogEvent> dfw = null;
> try {
> dfs = new DataFileStream<DeliveryLogEvent>(System.in, new
> SpecificDatumReader<DeliveryLogEvent>());
> OutputStream stdout = System.out;
> dfw = new DataFileWriter<DeliveryLogEvent>(new
> SpecificDatumWriter<DeliveryLogEvent>());
> dfw.setCodec(CodecFactory.deflateCodec(9));
> dfw.setSyncInterval(1024 * 256);
> dfw.create(DeliveryLogEvent.SCHEMA$, stdout);
> dfw.appendAllFrom(dfs, false);

dfs is from System.in, which has multiple files one after the other.  Each
file will need its own DataFileStream (has its own header and metadata).

In Java you could get the list of files, and for each file use HDFS's API to
open the file stream, and append that to your one file.
In bash you could loop over all the source files and append one at a time
(the above fails on the second file).  However, in order to append to the
end of a pre-existing file the only API now takes a File, not a seekable
stream, so Avro would need a patch to allow that in HDFS (also, only an HDFS
version that supports appends would work).

Other things of note:
You will probably get better total file size compression by using a larger
sync interval (1M to 4 M) than deflate level 9.  Deflate 9 is VERY slow and
almost never compresses more than 1% better than deflate 6, which is much
faster.  I suggest experimenting with the 'recodec' option on some of your
files to see what the best size / performance tradeoff is.  I doubt that
256K (pre-compression) blocks with level 9 compression is the way to go.

For reference: http://tukaani.org/lzma/benchmarks.html
(gzip uses deflate compression)

> }
> finally {
> if (dfs != null) try {dfs.close();} catch (Exception e) {e.printStackTrace();}
> if (dfw != null) try {dfw.close();} catch (Exception e) {e.printStackTrace();}
> }
> }
> }
> Is there any way this could be made to work without needing to download the
> individual files to disk and calling append for each of them?
> Thanks,
> Frank Grimes
> On 2012-01-12, at 2:24 PM, Frank Grimes wrote:
>> Hi Scott,
>> If I have a map-only job, would I want only one mapper running to pull all
>> the records from the source input files and stream/append them to the target
>> avro file?
>> Would that be no different (or more efficient) than doing "hadoop dfs -cat
>> file1 file2 file3" and piping the output to append to a "hadoop dfs -put
>> combinedFile"?
>> In that case, my only question is how would I combine the avro files into a
>> new file without deserializing them?
>> Thanks,
>> Frank Grimes
>> On 2012-01-12, at 1:14 PM, Scott Carey wrote:
>>> On 1/12/12 8:27 AM, "Frank Grimes" <[EMAIL PROTECTED]> wrote:
>>>> Hi All,
>>>> We have Avro data files in HDFS which are compressed using the Deflate
>>>> codec.
>>>> We have written an M/R job using the Avro Mapred API to combine those
>>>> files.
>>>> It seems to be working fine, however when we run it we notice that the
>>>> temporary work area (spills, etc) seem to be uncompressed.
>>>> We're thinking we might see a speedup due to reduced I/O if the temporary