Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Too many maps?


Copy link to this message
-
Re: Too many maps?
Mark Kerzner 2011-09-07, 03:14
Harsh,

I read one PST file, which contains many emails. But then I emit many maps,
like this

        MapWritable mapWritable = createMapWritable(metadata, fileName);
        // use MD5 of the input file as Hadoop key
        FileInputStream fileInputStream = new FileInputStream(fileName);
        MD5Hash key = MD5Hash.digest(fileInputStream);
        fileInputStream.close();
        // emit map
        context.write(key, mapWritable);

and it is this context.write calls that I have a great number of. Is that a
problem?

Mark

On Tue, Sep 6, 2011 at 10:06 PM, Harsh J <[EMAIL PROTECTED]> wrote:

> You can use an input format that lets you read multiple files per map
> (like say, all local files. See CombineFileInputFormat for one
> implementation that does this). This way you get reduced map #s and
> you don't really have to clump your files. One record reader would be
> initialized per file, so I believe you should be free to generate
> unique identities per file/email with this approach (whenever a new
> record reader is initialized)?
>
> On Wed, Sep 7, 2011 at 7:12 AM, Mark Kerzner <[EMAIL PROTECTED]>
> wrote:
> > Hi,
> >
> > I am testing my Hadoop-based FreeEed <http://freeeed.org/>, an open
> source
> > tool for eDiscovery, and I am using the Enron data
> > set<http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2
> >for
> > that. In my processing, each email with its attachments becomes a map,
> > and it is later collected by a reducer and written to the output. With
> the
> > (PST) mailboxes of around 2-5 Gigs, I begin to the see the numbers of
> emails
> > of about 50,000. I remember in Yahoo best practices that the number of
> maps
> > should not exceed 75,000, and I can see that I can break this barrier
> soon.
> >
> > I could, potentially, combine a few emails into one map, but I would be
> > doing it only to circumvent the size problem, not because my processing
> > requires it. Besides, my keys are the MD5 hashes of the files, and I use
> > them to find duplicates. If I combine a few emails into a map, I cannot
> use
> > the hashes as keys in a meaningful way anymore.
> >
> > So my question is, can't I have millions of maps, if that's how many
> > artifacts I need to process, and why not?
> >
> > Thank you. Sincerely,
> > Mark
> >
>
>
>
> --
> Harsh J
>