Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Can spill to disk be in compressed Avro format to reduce I/O?


+
Frank Grimes 2012-01-12, 16:27
+
Scott Carey 2012-01-12, 18:14
+
Frank Grimes 2012-01-12, 19:24
+
Scott Carey 2012-01-12, 20:36
+
Frank Grimes 2012-01-12, 20:35
+
Scott Carey 2012-01-12, 20:53
Copy link to this message
-
Re: Can spill to disk be in compressed Avro format to reduce I/O?
The Recodec tool may be useful, and the source code is a good reference.

java ­jar avro-tools-<VERSION>.jar
http://svn.apache.org/viewvc/avro/tags/release-1.6.1/lang/java/tools/src/ma
in/java/org/apache/avro/tool/RecodecTool.java?view=co

https://issues.apache.org/jira/browse/AVRO-684

On 1/12/12 12:53 PM, "Scott Carey" <[EMAIL PROTECTED]> wrote:
>
>
>On 1/12/12 12:35 PM, "Frank Grimes" <[EMAIL PROTECTED]> wrote:
>
>
>>So I decided to try writing my own AvroStreamCombiner utility and it
>>seems to choke when passing multiple input files:
>>
>>
>>hadoop dfs -cat hdfs://hadoop/machine1.log.avro
>>hdfs://hadoop/machine2.log.avro | ./deliveryLogAvroStreamCombiner.sh >
>>combined.log.avro
>>
>>
>>
>>
>>Exception in thread "main" java.io.IOException: Invalid sync!
>>
>>at
>>org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:293)
>>at
>>org.apache.avro.file.DataFileWriter.appendAllFrom(DataFileWriter.java:329
>>)
>>at DeliveryLogAvroStreamCombiner.main(Unknown Source)
>>
>>
>>
>>
>>Here's the code in question:
>>
>>public class DeliveryLogAvroStreamCombiner {
>>
>>
>> /**
>> * @param args
>> */
>> public static void main(String[] args) throws Exception {
>> DataFileStream<DeliveryLogEvent> dfs = null;
>> DataFileWriter<DeliveryLogEvent> dfw = null;
>>
>>
>> try {
>> dfs = new DataFileStream<DeliveryLogEvent>(System.in, new
>>SpecificDatumReader<DeliveryLogEvent>());
>>
>>
>> OutputStream stdout = System.out;
>>
>>
>> dfw = new DataFileWriter<DeliveryLogEvent>(new
>>SpecificDatumWriter<DeliveryLogEvent>());
>> dfw.setCodec(CodecFactory.deflateCodec(9));
>> dfw.setSyncInterval(1024 * 256);
>> dfw.create(DeliveryLogEvent.SCHEMA$, stdout);
>>
>> dfw.appendAllFrom(dfs, false);
>>
>>
>>
>
>dfs is from System.in, which has multiple files one after the other.
>Each file will need its own DataFileStream (has its own header and
>metadata).  
>
>In Java you could get the list of files, and for each file use HDFS's API
>to open the file stream, and append that to your one file.
>In bash you could loop over all the source files and append one at a time
>(the above fails on the second file).  However, in order to append to the
>end of a pre-existing file the only API now takes a File, not a seekable
>stream, so Avro would need a patch to allow that in HDFS (also, only an
>HDFS version that supports appends would work).
>
>Other things of note:
>You will probably get better total file size compression by using a
>larger sync interval (1M to 4 M) than deflate level 9.  Deflate 9 is VERY
>slow and almost never compresses more than 1% better than deflate 6,
>which is much faster.  I suggest experimenting with the 'recodec' option
>on some of your files to see what the best size / performance tradeoff
>is.  I doubt that 256K (pre-compression) blocks with level 9 compression
>is the way to go.
>
>For reference: http://tukaani.org/lzma/benchmarks.html
>(gzip uses deflate compression)
>
>-Scott
>
>
>
>> }
>> finally {
>> if (dfs != null) try {dfs.close();} catch (Exception e)
>>{e.printStackTrace();}
>> if (dfw != null) try {dfw.close();} catch (Exception e)
>>{e.printStackTrace();}
>> }
>> }
>>
>>}
>>
>>
>>Is there any way this could be made to work without needing to download
>>the individual files to disk and calling append for each of them?
>>
>>Thanks,
>>
>>Frank Grimes
>>
>>
>>On 2012-01-12, at 2:24 PM, Frank Grimes wrote:
>>
>>
>>Hi Scott,
>>
>>If I have a map-only job, would I want only one mapper running to pull
>>all the records from the source input files and stream/append them to
>>the target avro file?
>>Would that be no different (or more efficient) than doing "hadoop dfs
>>-cat file1 file2 file3" and piping the output to append to a "hadoop dfs
>>-put combinedFile"?
>>In that case, my only question is how would I combine the avro files
>>into a new file without deserializing them?
>>
>>Thanks,
>>
>>Frank Grimes
>>
>>
>>On 2012-01-12, at 1:14 PM, Scott Carey wrote:
>>
>>
>>
>>
>>On 1/12/12 8:27 AM, "Frank Grimes" <[EMAIL PROTECTED]> wrote:
+
Frank Grimes 2012-01-13, 01:52
+
Scott Carey 2012-01-13, 03:31
+
Frank Grimes 2012-01-13, 15:07
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB