Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Can spill to disk be in compressed Avro format to reduce I/O?


+
Frank Grimes 2012-01-12, 16:27
+
Scott Carey 2012-01-12, 18:14
+
Frank Grimes 2012-01-12, 19:24
+
Scott Carey 2012-01-12, 20:36
+
Frank Grimes 2012-01-12, 20:35
+
Scott Carey 2012-01-12, 20:53
Copy link to this message
-
Re: Can spill to disk be in compressed Avro format to reduce I/O?
The Recodec tool may be useful, and the source code is a good reference.

java ­jar avro-tools-<VERSION>.jar
http://svn.apache.org/viewvc/avro/tags/release-1.6.1/lang/java/tools/src/ma
in/java/org/apache/avro/tool/RecodecTool.java?view=co

https://issues.apache.org/jira/browse/AVRO-684

On 1/12/12 12:53 PM, "Scott Carey" <[EMAIL PROTECTED]> wrote:
>
>
>On 1/12/12 12:35 PM, "Frank Grimes" <[EMAIL PROTECTED]> wrote:
>
>
>>So I decided to try writing my own AvroStreamCombiner utility and it
>>seems to choke when passing multiple input files:
>>
>>
>>hadoop dfs -cat hdfs://hadoop/machine1.log.avro
>>hdfs://hadoop/machine2.log.avro | ./deliveryLogAvroStreamCombiner.sh >
>>combined.log.avro
>>
>>
>>
>>
>>Exception in thread "main" java.io.IOException: Invalid sync!
>>
>>at
>>org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:293)
>>at
>>org.apache.avro.file.DataFileWriter.appendAllFrom(DataFileWriter.java:329
>>)
>>at DeliveryLogAvroStreamCombiner.main(Unknown Source)
>>
>>
>>
>>
>>Here's the code in question:
>>
>>public class DeliveryLogAvroStreamCombiner {
>>
>>
>> /**
>> * @param args
>> */
>> public static void main(String[] args) throws Exception {
>> DataFileStream<DeliveryLogEvent> dfs = null;
>> DataFileWriter<DeliveryLogEvent> dfw = null;
>>
>>
>> try {
>> dfs = new DataFileStream<DeliveryLogEvent>(System.in, new
>>SpecificDatumReader<DeliveryLogEvent>());
>>
>>
>> OutputStream stdout = System.out;
>>
>>
>> dfw = new DataFileWriter<DeliveryLogEvent>(new
>>SpecificDatumWriter<DeliveryLogEvent>());
>> dfw.setCodec(CodecFactory.deflateCodec(9));
>> dfw.setSyncInterval(1024 * 256);
>> dfw.create(DeliveryLogEvent.SCHEMA$, stdout);
>>
>> dfw.appendAllFrom(dfs, false);
>>
>>
>>
>
>dfs is from System.in, which has multiple files one after the other.
>Each file will need its own DataFileStream (has its own header and
>metadata).  
>
>In Java you could get the list of files, and for each file use HDFS's API
>to open the file stream, and append that to your one file.
>In bash you could loop over all the source files and append one at a time
>(the above fails on the second file).  However, in order to append to the
>end of a pre-existing file the only API now takes a File, not a seekable
>stream, so Avro would need a patch to allow that in HDFS (also, only an
>HDFS version that supports appends would work).
>
>Other things of note:
>You will probably get better total file size compression by using a
>larger sync interval (1M to 4 M) than deflate level 9.  Deflate 9 is VERY
>slow and almost never compresses more than 1% better than deflate 6,
>which is much faster.  I suggest experimenting with the 'recodec' option
>on some of your files to see what the best size / performance tradeoff
>is.  I doubt that 256K (pre-compression) blocks with level 9 compression
>is the way to go.
>
>For reference: http://tukaani.org/lzma/benchmarks.html
>(gzip uses deflate compression)
>
>-Scott
>
>
>
>> }
>> finally {
>> if (dfs != null) try {dfs.close();} catch (Exception e)
>>{e.printStackTrace();}
>> if (dfw != null) try {dfw.close();} catch (Exception e)
>>{e.printStackTrace();}
>> }
>> }
>>
>>}
>>
>>
>>Is there any way this could be made to work without needing to download
>>the individual files to disk and calling append for each of them?
>>
>>Thanks,
>>
>>Frank Grimes
>>
>>
>>On 2012-01-12, at 2:24 PM, Frank Grimes wrote:
>>
>>
>>Hi Scott,
>>
>>If I have a map-only job, would I want only one mapper running to pull
>>all the records from the source input files and stream/append them to
>>the target avro file?
>>Would that be no different (or more efficient) than doing "hadoop dfs
>>-cat file1 file2 file3" and piping the output to append to a "hadoop dfs
>>-put combinedFile"?
>>In that case, my only question is how would I combine the avro files
>>into a new file without deserializing them?
>>
>>Thanks,
>>
>>Frank Grimes
>>
>>
>>On 2012-01-12, at 1:14 PM, Scott Carey wrote:
>>
>>
>>
>>
>>On 1/12/12 8:27 AM, "Frank Grimes" <[EMAIL PROTECTED]> wrote:
+
Frank Grimes 2012-01-13, 01:52
+
Scott Carey 2012-01-13, 03:31
+
Frank Grimes 2012-01-13, 15:07