Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> Hadoop dfs can't allocate memory with enough hard disk space when data gets huge


+
Kunsheng Chen 2009-10-19, 02:57
+
Amogh Vasekar 2009-10-19, 13:01
+
Ashutosh Chauhan 2009-10-19, 15:30
+
Kunsheng Chen 2009-10-20, 01:21
Copy link to this message
-
Re: Hadoop dfs can't allocate memory with enough hard disk space when data gets huge
For searching (grepping) mailing list archives, I like MarkMail:
http://hadoop.markmail.org/ (try searching for "small files").

For concatenating files -- cat works, if you don't care about
provenance; as an alternative, you can also write a simple MR program
that creates a SequenceFile by reading in all the little files and
producing (filePath, fileContents) records.

The Cloudera post Ashutosh referred you to has a brief overview of all
the "standard" ideas.

-Dmitriy

On Mon, Oct 19, 2009 at 9:21 PM, Kunsheng Chen <[EMAIL PROTECTED]> wrote:
> I guess this is exactly the problem is!
>
> Is there any way I could do "Greping archives" inside the MP program ? Or some hadoop command that could combine all small pieces files into a big one ?
>
>
>
> Thanks,
>
> -Kun
>
>
> --- On Mon, 10/19/09, Ashutosh Chauhan <[EMAIL PROTECTED]> wrote:
>
>> From: Ashutosh Chauhan <[EMAIL PROTECTED]>
>> Subject: Re: Hadoop dfs can't allocate memory with enough hard disk space when  data gets huge
>> To: [EMAIL PROTECTED]
>> Date: Monday, October 19, 2009, 3:30 PM
>> You might be hitting into the problem
>> of "small-files". This has been
>> discussed multiple times on the list. Greping through
>> archives will help.
>> Also http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/
>>
>> Ashutosh
>>
>> On Sun, Oct 18, 2009 at 22:57, Kunsheng Chen <[EMAIL PROTECTED]>
>> wrote:
>>
>> > I and running a hadoop program to perform MapReduce
>> work on files inside a
>> > folder.
>> >
>> > My program is basically doing Map and Reduce work,
>> each line of any file is
>> > a pair of string, and the result is a string associate
>> with occurence inside
>> > all files.
>> >
>> > The program works fine until the number of files grow
>> to about 80,000,then
>> > the 'cannot allocate memory' error occur for some
>> reason.
>> >
>> > Each of the file contains around 50 lines, but the
>> total size of all files
>> > is no more than 1.5 GB. There are 3 datanodes
>> performing calculation,each of
>> > them have more than 10GB hd left.
>> >
>> > I am wondering if that is normal for Hadoop because
>> the data is too large ?
>> > Or it might be my programs problem ?
>> >
>> > It is really not supposed to be since Hadoop was
>> developed for processing
>> > large data sets.
>> >
>> >
>> > Any idea is well appreciated
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>