Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume >> mail # user >> HDFS Sink Config Help


+
Jeremy Karlson 2013-10-30, 23:15
+
Jeff Lord 2013-10-31, 23:42
+
Jeremy Karlson 2013-11-01, 16:50
Copy link to this message
-
Re: HDFS Sink Config Help
Yes definitely use avro instead of json if you can.
HIVE-895 added support for that. Pretty much the entire Hadoop ecosystem
has support for avro at this point. The ability to evolve/version the
schema is one of the main benefits.
On Fri, Nov 1, 2013 at 9:50 AM, Jeremy Karlson <[EMAIL PROTECTED]>wrote:

> Hi Jeff,
>
> Thanks for your suggestions.  My only Flume experience so far is with the
> Elasticsearch sink, which serializes (headers and body) to JSON
> automatically.  I was expecting something similar from the HDFS sink and
> when it didn't do that I started questioning the file format when I should
> have been looking at the serializer.  A misunderstanding on my part.
>
> I just finished serializing to JSON when I saw you suggested Avro.  I'll
> look into that.  Is that what you would use if you were going to query with
> Hive external tables?
>
> Thanks again!
>
> -- Jeremy
>
>
>
> On Thu, Oct 31, 2013 at 4:42 PM, Jeff Lord <[EMAIL PROTECTED]> wrote:
>
>> Jeremy,
>>
>> Datastream fileType will let you write text files.
>> CompressedStream will do just that.
>> SequenceFile will create sequence files as you have guessed and you can
>> use either Text or Writeable (bytes) for your data here.
>>
>> So flume is configureable out of the box with regards to the size of your
>> files. Yes you are correct that it is better to create files that are at
>> least the size of a full block.
>> You can roll your files based on time, size, or number of events. Rolling
>> on an hourly basis makes perfect sense.
>>
>> With all that said we recommend writing to avro container files as that
>> format is most well suited for being used in the Hadoop ecosystem.
>> Avro has many benefits which include support for compression, code
>> generation, versioning and schema evolution.
>> You can do this with flume by specifying the avro_event type for the
>> serializer property in your hdfs sink.
>>
>> Hope this helps.
>>
>> -Jeff
>>
>>
>> On Wed, Oct 30, 2013 at 4:15 PM, Jeremy Karlson <[EMAIL PROTECTED]>wrote:
>>
>>> Hi everyone.
>>>
>>> I'm trying to set up Flume to log into HDFS.  Along the way, Flume
>>> attaches a number of headers (environment, hostname, etc) that I would also
>>> like to store with my log messages.  Ideally, I'd like to be able to use
>>> Hive to query all of this later.  I must also admit to knowing next to
>>> nothing about HDFS.  That probably doesn't help.  :-P
>>>
>>> I'm confused about the HDFS sink configuration.  Specifically, I'm
>>> trying to understand what these two options do (and how they interact):
>>>
>>> hdfs.fileType
>>> hdfs.writeFormat
>>>
>>> File Type:
>>>
>>> DataStream - This appears to write the event body, and loses all
>>> headers.  Correct?
>>> CompressedStream - I assume just a compressed data stream.
>>>  SequenceFile - I think this is what I want, since it seems to be a
>>> key/value based thing, which I assume means it will include headers.
>>>
>>> Write Format: This seems to only apply for SequenceFile above, but lots
>>> of Internet examples seem to state otherwise.  I'm also unclear on the
>>> difference here.  Isn't "Text" just a specific type of "Writable" in HDFS?
>>>
>>> Also, I'm unclear on why Flume, by default, seems to be set up to make
>>> such small HDFS files.  Isn't HDFS designed (and more efficient) when
>>> storing larger files that are closer to the size of a full block?  I was
>>> thinking it made more sense to write all log data to a single file, and
>>> roll that file hourly (or whatever, depending on volume).  Thoughts here?
>>>
>>> Thanks a lot.
>>>
>>> -- Jeremy
>>>
>>>
>>>
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB