|
|
+
Kunsheng Chen 2009-10-19, 02:57
+
Amogh Vasekar 2009-10-19, 13:01
+
Ashutosh Chauhan 2009-10-19, 15:30
+
Kunsheng Chen 2009-10-20, 01:21
-
Re: Hadoop dfs can't allocate memory with enough hard disk space when data gets hugeDmitriy Ryaboy 2009-10-20, 02:01
For searching (grepping) mailing list archives, I like MarkMail:
http://hadoop.markmail.org/ (try searching for "small files"). For concatenating files -- cat works, if you don't care about provenance; as an alternative, you can also write a simple MR program that creates a SequenceFile by reading in all the little files and producing (filePath, fileContents) records. The Cloudera post Ashutosh referred you to has a brief overview of all the "standard" ideas. -Dmitriy On Mon, Oct 19, 2009 at 9:21 PM, Kunsheng Chen <[EMAIL PROTECTED]> wrote: > I guess this is exactly the problem is! > > Is there any way I could do "Greping archives" inside the MP program ? Or some hadoop command that could combine all small pieces files into a big one ? > > > > Thanks, > > -Kun > > > --- On Mon, 10/19/09, Ashutosh Chauhan <[EMAIL PROTECTED]> wrote: > >> From: Ashutosh Chauhan <[EMAIL PROTECTED]> >> Subject: Re: Hadoop dfs can't allocate memory with enough hard disk space when data gets huge >> To: [EMAIL PROTECTED] >> Date: Monday, October 19, 2009, 3:30 PM >> You might be hitting into the problem >> of "small-files". This has been >> discussed multiple times on the list. Greping through >> archives will help. >> Also http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/ >> >> Ashutosh >> >> On Sun, Oct 18, 2009 at 22:57, Kunsheng Chen <[EMAIL PROTECTED]> >> wrote: >> >> > I and running a hadoop program to perform MapReduce >> work on files inside a >> > folder. >> > >> > My program is basically doing Map and Reduce work, >> each line of any file is >> > a pair of string, and the result is a string associate >> with occurence inside >> > all files. >> > >> > The program works fine until the number of files grow >> to about 80,000,then >> > the 'cannot allocate memory' error occur for some >> reason. >> > >> > Each of the file contains around 50 lines, but the >> total size of all files >> > is no more than 1.5 GB. There are 3 datanodes >> performing calculation,each of >> > them have more than 10GB hd left. >> > >> > I am wondering if that is normal for Hadoop because >> the data is too large ? >> > Or it might be my programs problem ? >> > >> > It is really not supposed to be since Hadoop was >> developed for processing >> > large data sets. >> > >> > >> > Any idea is well appreciated >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > |