Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro, mail # user - Can spill to disk be in compressed Avro format to reduce I/O?


+
Frank Grimes 2012-01-12, 16:27
+
Scott Carey 2012-01-12, 18:14
+
Frank Grimes 2012-01-12, 19:24
+
Scott Carey 2012-01-12, 20:36
+
Frank Grimes 2012-01-12, 20:35
+
Scott Carey 2012-01-12, 20:53
+
Scott Carey 2012-01-12, 22:09
+
Frank Grimes 2012-01-13, 01:52
Copy link to this message
-
Re: Can spill to disk be in compressed Avro format to reduce I/O?
Scott Carey 2012-01-13, 03:31


On 1/12/12 5:52 PM, "Frank Grimes" <[EMAIL PROTECTED]> wrote:

> Hi Scott,
>
> I've looked into this some more and I now see what you mean about appending to
> HDFS directly not being possible with the current DataFileWriter API.
>
> That's unfortunate because we really would like to avoid needing to hit disk
> just to write temporary files. (and the associated cleanup)
>
> I kinda like the notion of not requiring HDFS APIs to achieve this merging of
> Avro files/streams.
>
> Assuming we wanted to be able to stream multiple files as in my example...
> could DataFileStream easily be changed to support that use case?
> i.e. allow it to skip/ignore subsequent header and metadata in the stream or
> not error out with "Invalid sync!"?

That may be possible, open a JIRA to discuss further.  It should be modified
to 'reset' to the start of a new file or stream and continue from there,
since it needs to read the header and find the new sync value and validate
that the schemas match and the codec is compatible.  It may be possible to
detect the end of one file and the start of another if the files are
streamed back to back, but perhaps not reliably.
The avro-tools could be extended to have a command line tool that takes a
list of files (HDFS or local) and writes a new file (HDFS or local)
concatenated and possibly "recodec'd".

>
> Thanks,
>
> Frank Grimes
>
>
> On 2012-01-12, at 3:53 PM, Scott Carey wrote:
>
>>
>>
>> On 1/12/12 12:35 PM, "Frank Grimes" <[EMAIL PROTECTED]> wrote:
>>
>>> So I decided to try writing my own AvroStreamCombiner utility and it seems
>>> to choke when passing multiple input files:
>>>
>>>> hadoop dfs -cat hdfs://hadoop/machine1.log.avro
>>>> hdfs://hadoop/machine2.log.avro | ./deliveryLogAvroStreamCombiner.sh >
>>>> combined.log.avro
>>>
>>>> Exception in thread "main" java.io.IOException: Invalid sync!
>>>> at
>>>> org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:293)
>>>> at
>>>> org.apache.avro.file.DataFileWriter.appendAllFrom(DataFileWriter.java:329)
>>>> at DeliveryLogAvroStreamCombiner.main(Unknown Source)
>>>
>>>
>>> Here's the code in question:
>>>
>>> public class DeliveryLogAvroStreamCombiner {
>>>
>>> /**
>>>  * @param args
>>>  */
>>> public static void main(String[] args) throws Exception {
>>> DataFileStream<DeliveryLogEvent> dfs = null;
>>> DataFileWriter<DeliveryLogEvent> dfw = null;
>>>
>>> try {
>>> dfs = new DataFileStream<DeliveryLogEvent>(System.in, new
>>> SpecificDatumReader<DeliveryLogEvent>());
>>>
>>> OutputStream stdout = System.out;
>>>
>>> dfw = new DataFileWriter<DeliveryLogEvent>(new
>>> SpecificDatumWriter<DeliveryLogEvent>());
>>> dfw.setCodec(CodecFactory.deflateCodec(9));
>>> dfw.setSyncInterval(1024 * 256);
>>> dfw.create(DeliveryLogEvent.SCHEMA$, stdout);
>>>
>>> dfw.appendAllFrom(dfs, false);
>>
>> dfs is from System.in, which has multiple files one after the other.  Each
>> file will need its own DataFileStream (has its own header and metadata).
>>
>> In Java you could get the list of files, and for each file use HDFS's API to
>> open the file stream, and append that to your one file.
>> In bash you could loop over all the source files and append one at a time
>> (the above fails on the second file).  However, in order to append to the end
>> of a pre-existing file the only API now takes a File, not a seekable stream,
>> so Avro would need a patch to allow that in HDFS (also, only an HDFS version
>> that supports appends would work).
>>
>> Other things of note:
>> You will probably get better total file size compression by using a larger
>> sync interval (1M to 4 M) than deflate level 9.  Deflate 9 is VERY slow and
>> almost never compresses more than 1% better than deflate 6, which is much
>> faster.  I suggest experimenting with the 'recodec' option on some of your
>> files to see what the best size / performance tradeoff is.  I doubt that 256K
>> (pre-compression) blocks with level 9 compression is the way to go.
+
Frank Grimes 2012-01-13, 15:07