Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Hadoop dfs can't allocate memory with enough hard disk space when data gets huge


Copy link to this message
-
Re: Hadoop dfs can't allocate memory with enough hard disk space when data gets huge
For searching (grepping) mailing list archives, I like MarkMail:
http://hadoop.markmail.org/ (try searching for "small files").

For concatenating files -- cat works, if you don't care about
provenance; as an alternative, you can also write a simple MR program
that creates a SequenceFile by reading in all the little files and
producing (filePath, fileContents) records.

The Cloudera post Ashutosh referred you to has a brief overview of all
the "standard" ideas.

-Dmitriy

On Mon, Oct 19, 2009 at 9:21 PM, Kunsheng Chen <[EMAIL PROTECTED]> wrote:
> I guess this is exactly the problem is!
>
> Is there any way I could do "Greping archives" inside the MP program ? Or some hadoop command that could combine all small pieces files into a big one ?
>
>
>
> Thanks,
>
> -Kun
>
>
> --- On Mon, 10/19/09, Ashutosh Chauhan <[EMAIL PROTECTED]> wrote:
>
>> From: Ashutosh Chauhan <[EMAIL PROTECTED]>
>> Subject: Re: Hadoop dfs can't allocate memory with enough hard disk space when  data gets huge
>> To: [EMAIL PROTECTED]
>> Date: Monday, October 19, 2009, 3:30 PM
>> You might be hitting into the problem
>> of "small-files". This has been
>> discussed multiple times on the list. Greping through
>> archives will help.
>> Also http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/
>>
>> Ashutosh
>>
>> On Sun, Oct 18, 2009 at 22:57, Kunsheng Chen <[EMAIL PROTECTED]>
>> wrote:
>>
>> > I and running a hadoop program to perform MapReduce
>> work on files inside a
>> > folder.
>> >
>> > My program is basically doing Map and Reduce work,
>> each line of any file is
>> > a pair of string, and the result is a string associate
>> with occurence inside
>> > all files.
>> >
>> > The program works fine until the number of files grow
>> to about 80,000,then
>> > the 'cannot allocate memory' error occur for some
>> reason.
>> >
>> > Each of the file contains around 50 lines, but the
>> total size of all files
>> > is no more than 1.5 GB. There are 3 datanodes
>> performing calculation,each of
>> > them have more than 10GB hd left.
>> >
>> > I am wondering if that is normal for Hadoop because
>> the data is too large ?
>> > Or it might be my programs problem ?
>> >
>> > It is really not supposed to be since Hadoop was
>> developed for processing
>> > large data sets.
>> >
>> >
>> > Any idea is well appreciated
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB