Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Too many maps?


Copy link to this message
-
Too many maps?
Hi,

I am testing my Hadoop-based FreeEed <http://freeeed.org/>, an open source
tool for eDiscovery, and I am using the Enron data
set<http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2>for
that. In my processing, each email with its attachments becomes a map,
and it is later collected by a reducer and written to the output. With the
(PST) mailboxes of around 2-5 Gigs, I begin to the see the numbers of emails
of about 50,000. I remember in Yahoo best practices that the number of maps
should not exceed 75,000, and I can see that I can break this barrier soon.

I could, potentially, combine a few emails into one map, but I would be
doing it only to circumvent the size problem, not because my processing
requires it. Besides, my keys are the MD5 hashes of the files, and I use
them to find duplicates. If I combine a few emails into a map, I cannot use
the hashes as keys in a meaningful way anymore.

So my question is, can't I have millions of maps, if that's how many
artifacts I need to process, and why not?

Thank you. Sincerely,
Mark
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB