Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Re: produce a large sequencefile (1TB)


+
Jerry Lam 2013-08-20, 02:25
+
Bing Jiang 2013-08-20, 02:55
+
Harsh J 2013-08-20, 05:38
Copy link to this message
-
Re: produce a large sequencefile (1TB)
Hi Harsh,

Thank you for the reply. It really answers my question and provides
practical advices.

Best Regards,

Jerry
On Tue, Aug 20, 2013 at 1:38 AM, Harsh J <[EMAIL PROTECTED]> wrote:

> Unfortunately given the way Reducers work today you wouldn't be able
> to do this. They are designed to fetch all data before the merge, sort
> and process it through the reducer implementation. For that to work,
> as you've yourself deduced, you will need as much space locally
> available.
>
> What you could do however, is perhaps just run a Map-only job, let it
> produce smaller files, then run a non-MR java app that reads them all
> one by one, and appends to a single HDFS SequenceFile. This is like a
> reducer, but minus a local sort phase. If the sort is important to you
> as well, then your tweaking will have to go further into using
> multiple reducers with Total Order Partitioning, and then running this
> external java app.
>
> On Tue, Aug 20, 2013 at 8:25 AM, Bing Jiang <[EMAIL PROTECTED]>
> wrote:
> > Hi Jerry,
> >
> > I think whether it is acceptable to set multiple reducers to generate
> more
> > MapFile(IndexFile, DataFile)s.
> >
> > I want to know the real difficulties of multiply reducer to
> post-processing.
> > Maybe there are some questions about app?
> >
> >
> >
> > 2013/8/20 Jerry Lam <[EMAIL PROTECTED]>
> >>
> >> Hi Bing,
> >>
> >> you are correct. The local storage does not have enough capacity to hold
> >> the temporary files generated by the mappers. Since we want a single
> >> sequence file at the end, we are forced to use 1 reducer.
> >>
> >> The use case is that we want to generate an index for the 1TB sequence
> >> file that we can randomly access each row in the sequence file. In
> practice,
> >> this is simply a MapFile.
> >>
> >> Any idea how to resolve this dilemma is greatly appreciated.
> >>
> >> Jerry
> >>
> >>
> >>
> >> On Mon, Aug 19, 2013 at 8:14 PM, Bing Jiang <[EMAIL PROTECTED]>
> >> wrote:
> >>>
> >>> hi,Jerry.
> >>> I think you are worrying about the volumn of mapreduce local file, but
> >>> would  you give us more details about your apps.
> >>>
> >>> On Aug 20, 2013 6:09 AM, "Jerry Lam" <[EMAIL PROTECTED]> wrote:
> >>>>
> >>>> Hi Hadoop users and developers,
> >>>>
> >>>> I have a use case that I need produce a large sequence file of 1 TB in
> >>>> size when each datanode has  200GB of storage but I have 30 datanodes.
> >>>>
> >>>> The problem is that no single reducer can hold 1TB of data during the
> >>>> reduce phase to generate a single sequence file even I use aggressive
> >>>> compression. Any datanode will run out of space since this is a single
> >>>> reducer job.
> >>>>
> >>>> Any comment and help is appreciated.
> >>>>
> >>>> Jerry
> >>
> >>
> >
> >
> >
> > --
> > Bing Jiang
> > Tel:(86)134-2619-1361
> > weibo: http://weibo.com/jiangbinglover
> > BLOG: www.binospace.com
> > BLOG: http://blog.sina.com.cn/jiangbinglover
> > Focus on distributed computing, HDFS/HBase
>
>
>
> --
> Harsh J
>