Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Too many maps?


Copy link to this message
-
Re: Too many maps?
Thank you, Sonal,

at least that big job I was looking at just finished :)

Mark

On Tue, Sep 6, 2011 at 11:56 PM, Sonal Goyal <[EMAIL PROTECTED]> wrote:

> Mark,
>
> Having a large number of emitted key values from the mapper should not be a
> problem. Just make sure that you have enough reducers to handle the data so
> that the reduce stage does not become a bottleneck.
>
> Best Regards,
> Sonal
> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
> Nube Technologies <http://www.nubetech.co>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
>
>
> On Wed, Sep 7, 2011 at 8:44 AM, Mark Kerzner <[EMAIL PROTECTED]>
> wrote:
>
> > Harsh,
> >
> > I read one PST file, which contains many emails. But then I emit many
> maps,
> > like this
> >
> >        MapWritable mapWritable = createMapWritable(metadata, fileName);
> >        // use MD5 of the input file as Hadoop key
> >        FileInputStream fileInputStream = new FileInputStream(fileName);
> >        MD5Hash key = MD5Hash.digest(fileInputStream);
> >        fileInputStream.close();
> >        // emit map
> >        context.write(key, mapWritable);
> >
> > and it is this context.write calls that I have a great number of. Is that
> a
> > problem?
> >
> > Mark
> >
> > On Tue, Sep 6, 2011 at 10:06 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> >
> > > You can use an input format that lets you read multiple files per map
> > > (like say, all local files. See CombineFileInputFormat for one
> > > implementation that does this). This way you get reduced map #s and
> > > you don't really have to clump your files. One record reader would be
> > > initialized per file, so I believe you should be free to generate
> > > unique identities per file/email with this approach (whenever a new
> > > record reader is initialized)?
> > >
> > > On Wed, Sep 7, 2011 at 7:12 AM, Mark Kerzner <[EMAIL PROTECTED]>
> > > wrote:
> > > > Hi,
> > > >
> > > > I am testing my Hadoop-based FreeEed <http://freeeed.org/>, an open
> > > source
> > > > tool for eDiscovery, and I am using the Enron data
> > > > set<
> > http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2
> > > >for
> > > > that. In my processing, each email with its attachments becomes a
> map,
> > > > and it is later collected by a reducer and written to the output.
> With
> > > the
> > > > (PST) mailboxes of around 2-5 Gigs, I begin to the see the numbers of
> > > emails
> > > > of about 50,000. I remember in Yahoo best practices that the number
> of
> > > maps
> > > > should not exceed 75,000, and I can see that I can break this barrier
> > > soon.
> > > >
> > > > I could, potentially, combine a few emails into one map, but I would
> be
> > > > doing it only to circumvent the size problem, not because my
> processing
> > > > requires it. Besides, my keys are the MD5 hashes of the files, and I
> > use
> > > > them to find duplicates. If I combine a few emails into a map, I
> cannot
> > > use
> > > > the hashes as keys in a meaningful way anymore.
> > > >
> > > > So my question is, can't I have millions of maps, if that's how many
> > > > artifacts I need to process, and why not?
> > > >
> > > > Thank you. Sincerely,
> > > > Mark
> > > >
> > >
> > >
> > >
> > > --
> > > Harsh J
> > >
> >
>