Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: produce a large sequencefile (1TB)


Copy link to this message
-
Re: produce a large sequencefile (1TB)
Hi Jerry,

I think whether it is acceptable to set multiple reducers to generate more
MapFile(IndexFile, DataFile)s.

I want to know the real difficulties of multiply reducer to
post-processing. Maybe there are some questions about app?

2013/8/20 Jerry Lam <[EMAIL PROTECTED]>

> Hi Bing,
>
> you are correct. The local storage does not have enough capacity to hold
> the temporary files generated by the mappers. Since we want a single
> sequence file at the end, we are forced to use 1 reducer.
>
> The use case is that we want to generate an index for the 1TB sequence
> file that we can randomly access each row in the sequence file. In
> practice, this is simply a MapFile.
>
> Any idea how to resolve this dilemma is greatly appreciated.
>
> Jerry
>
>
>
> On Mon, Aug 19, 2013 at 8:14 PM, Bing Jiang <[EMAIL PROTECTED]>wrote:
>
>> hi,Jerry.
>> I think you are worrying about the volumn of mapreduce local file, but
>> would  you give us more details about your apps.
>>  On Aug 20, 2013 6:09 AM, "Jerry Lam" <[EMAIL PROTECTED]> wrote:
>>
>>> Hi Hadoop users and developers,
>>>
>>> I have a use case that I need produce a large sequence file of 1 TB in
>>> size when each datanode has  200GB of storage but I have 30 datanodes.
>>>
>>> The problem is that no single reducer can hold 1TB of data during the
>>> reduce phase to generate a single sequence file even I use aggressive
>>> compression. Any datanode will run out of space since this is a single
>>> reducer job.
>>>
>>> Any comment and help is appreciated.
>>>
>>> Jerry
>>>
>>
>
--
Bing Jiang
Tel:(86)134-2619-1361
weibo: http://weibo.com/jiangbinglover
BLOG: www.binospace.com
BLOG: http://blog.sina.com.cn/jiangbinglover
Focus on distributed computing, HDFS/HBase
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB