Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # user - Can HDFSSink write headers as well?


Copy link to this message
-
Re: Can HDFSSink write headers as well?
Bhaskar V. Karambelkar 2012-08-21, 15:22
On Tue, Aug 21, 2012 at 2:25 AM, バーチャル クリストファー
<[EMAIL PROTECTED]>wrote:

> Hi David,
>
> Currently there is no way to write headers to HDFS using the built-in
> Flume functionality.
>

This is not entirely true, the following combination will write headers to
HDFS, in an avro_data file format (binary).

agent.sinks.hdfsBinarySink.hdfs.fileType = DataStream
agent.sinks.hdfsBinarySink.serializer =  avro_client
agent.sinks.hdfsBinarySink.hdfs.writeFormat =  writable

The serializer used is part of flume distribution viz.
flume-ng-core/src/main/java/org/apache/flume/serialization/FlumeEventAvroEventSerializer.java

A file thus written can be processed with AVRO mapreduce API found in AVRO
distribution.

Also note that simply using DataStream doesn't mean it's a text file, the
serializer and hdfs.writeFormat also decide
whether the file is text or binary.

I've read the entire HDFS sink code and exprimented with it a lot, so if
you want more details, let me know.

>
> If you are writing to text or binary files on HDFS (i.e. you have set
> hdfs.fileType = DataStream or CompressedStream in your config), then you
> can supply your own custom serializer, which will allow you to write
> headers to HDFS. You will need to write a serializer that implements
> org.apache.flume.**serialization.EventSerializer.
>
> If, on the other hand, you are writing to HDFS SequenceFiles, then
> unfortunately there is no way to customize the way that events are
> serialized, so you cannot write event headers to HDFS. This is a known
> issue (FLUME-1100) and I have supplied a patch to fix it.
>
> Chris.
>
>
>
> On 2012/08/21 11:36, David Capwell wrote:
>
>> I was wondering if I pass random data to an event's header, can the
>> HDFSSink write it to HDFS?  I know it can use the headers to split the data
>> into different paths, but what about writing the data to HDFS itself?
>>
>> thanks for your time reading this email.
>>
>
>
>