Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Ideal file size


Copy link to this message
-
Re: Ideal file size
Mohit Anchlia 2012-06-06, 17:14
On Wed, Jun 6, 2012 at 9:48 AM, M. C. Srivas <[EMAIL PROTECTED]> wrote:

> Many factors to consider than just the size of the file.  . How long can
> you wait before you *have to* process the data?  5 minutes? 5 hours? 5
> days?  If you want good timeliness, you need to roll-over faster.  The
> longer you wait:
>
> 1.  the lesser the load on the NN.
> 2.  but the poorer the timeliness
> 3.  and the larger chance of lost data  (ie, the data is not saved until
> the file is closed and rolled over, unless you want to sync() after every
> write)
>
> To Begin with I was going to use Flume and specify rollover file size. I
understand the above parameters, I just want to ensure that too many small
files doesn't cause problem on the NameNode. For instance there would be
times when we get GBs of data in an hour and at times only few 100 MB. From
what Harsh, Edward and you've described it doesn't cause issues with the
NameNode but rather increase in processing times if there are too many
small files. Looks like I need to find that balance.

It would also be interesting to see how others solve this problem when not
using Flume.
>
>
> On Wed, Jun 6, 2012 at 7:00 AM, Mohit Anchlia <[EMAIL PROTECTED]
> >wrote:
>
> > We have continuous flow of data into the sequence file. I am wondering
> what
> > would be the ideal file size before file gets rolled over. I know too
> many
> > small files are not good but could someone tell me what would be the
> ideal
> > size such that it doesn't overload NameNode.
> >
>