Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> HDFS Sink Config Help


Copy link to this message
-
Re: HDFS Sink Config Help
Hi Jeff,

Thanks for your suggestions.  My only Flume experience so far is with the
Elasticsearch sink, which serializes (headers and body) to JSON
automatically.  I was expecting something similar from the HDFS sink and
when it didn't do that I started questioning the file format when I should
have been looking at the serializer.  A misunderstanding on my part.

I just finished serializing to JSON when I saw you suggested Avro.  I'll
look into that.  Is that what you would use if you were going to query with
Hive external tables?

Thanks again!

-- Jeremy
On Thu, Oct 31, 2013 at 4:42 PM, Jeff Lord <[EMAIL PROTECTED]> wrote:

> Jeremy,
>
> Datastream fileType will let you write text files.
> CompressedStream will do just that.
> SequenceFile will create sequence files as you have guessed and you can
> use either Text or Writeable (bytes) for your data here.
>
> So flume is configureable out of the box with regards to the size of your
> files. Yes you are correct that it is better to create files that are at
> least the size of a full block.
> You can roll your files based on time, size, or number of events. Rolling
> on an hourly basis makes perfect sense.
>
> With all that said we recommend writing to avro container files as that
> format is most well suited for being used in the Hadoop ecosystem.
> Avro has many benefits which include support for compression, code
> generation, versioning and schema evolution.
> You can do this with flume by specifying the avro_event type for the
> serializer property in your hdfs sink.
>
> Hope this helps.
>
> -Jeff
>
>
> On Wed, Oct 30, 2013 at 4:15 PM, Jeremy Karlson <[EMAIL PROTECTED]>wrote:
>
>> Hi everyone.
>>
>> I'm trying to set up Flume to log into HDFS.  Along the way, Flume
>> attaches a number of headers (environment, hostname, etc) that I would also
>> like to store with my log messages.  Ideally, I'd like to be able to use
>> Hive to query all of this later.  I must also admit to knowing next to
>> nothing about HDFS.  That probably doesn't help.  :-P
>>
>> I'm confused about the HDFS sink configuration.  Specifically, I'm trying
>> to understand what these two options do (and how they interact):
>>
>> hdfs.fileType
>> hdfs.writeFormat
>>
>> File Type:
>>
>> DataStream - This appears to write the event body, and loses all headers.
>>  Correct?
>> CompressedStream - I assume just a compressed data stream.
>>  SequenceFile - I think this is what I want, since it seems to be a
>> key/value based thing, which I assume means it will include headers.
>>
>> Write Format: This seems to only apply for SequenceFile above, but lots
>> of Internet examples seem to state otherwise.  I'm also unclear on the
>> difference here.  Isn't "Text" just a specific type of "Writable" in HDFS?
>>
>> Also, I'm unclear on why Flume, by default, seems to be set up to make
>> such small HDFS files.  Isn't HDFS designed (and more efficient) when
>> storing larger files that are closer to the size of a full block?  I was
>> thinking it made more sense to write all log data to a single file, and
>> roll that file hourly (or whatever, depending on volume).  Thoughts here?
>>
>> Thanks a lot.
>>
>> -- Jeremy
>>
>>
>>
>