Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> best way to make all hdfs records in one file under a folder?


Copy link to this message
-
Re: best way to make all hdfs records in one file under a folder?
If you don't intend to roll based on # of events than you will want to set
rollCount to 0.
MyAgent.sinks.HDFS.hdfs.rollCount = 0
On Mon, Jan 20, 2014 at 12:35 PM, Jimmy <[EMAIL PROTECTED]> wrote:

> Seems like the only reason is "too many files" issue, correct?
>
> File Crusher executed regularly might be better option than trying to tune
> it in flume
>
> http://www.jointhegrid.com/hadoop_filecrush/index.jsp
>
>
>
> ---------- Forwarded message ----------
> From: Chen Wang <[EMAIL PROTECTED]>
> Date: Mon, Jan 20, 2014 at 11:21 AM
> Subject: Re: best way to make all hdfs records in one file under a folder?
> To: [EMAIL PROTECTED]
>
>
> Chris,
> Its by every 6 minutes(thats why i set the roll time to be 60*5=300. the
> data size is around 15M. Thus I want them all in one file.
> Chen
>
>
> On Mon, Jan 20, 2014 at 10:57 AM, Christopher Shannon <
> [EMAIL PROTECTED]> wrote:
>
>> How is your data partitioned, by date?
>>
>>
>> On Monday, January 20, 2014, Chen Wang <[EMAIL PROTECTED]>
>> wrote:
>>
>>> Guys,
>>> I have flume setup to flow partitioned data to hdfs, each partition has
>>> its own file folder. Is there a way to specify all the data under one
>>> partition to be in one file?
>>> I am currently using
>>> MyAgent.sinks.HDFS.hdfs.batchSize = 10000
>>> MyAgent.sinks.HDFS.hdfs.rollSize = 15000000
>>> MyAgent.sinks.HDFS.hdfs.rollCount = 10000
>>> MyAgent.sinks.HDFS.hdfs.rollInterval = 360
>>>
>>> to make the file roll on 15m data or after 6 minute.
>>>
>>> Is this the best way to achieve my goal?
>>> Thanks,
>>> Chen
>>>
>>>
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB