Rough estimation: since word count requires very little computation, it is
io centric, we can do estimation based on disk speed.
Assume 10 disk with each 100MBps for each node, that is about 1GBps per
node; assume 70% utilization in mapper, we have 700MBps for each node. For
30 nodes, it is total about 20GBps, so we need about 500 seconds for 10 TB
Adding some map reduce overhead and the final merging, say 20% overhead, we
can expect about 10 minutes here.
On Tuesday, April 15, 2014, Shashidhar Rao <[EMAIL PROTECTED]>