Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Improving MR job disk IO


Copy link to this message
-
Re: Improving MR job disk IO
I don't think it necessarily means that the job is a bad candidate for MR.
It's a different type of a workload. Hortonworks has a great article on the
different types of workloads you might see and how that affects your
provisioning choices at
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.2/bk_cluster-planning-guide/content/ch_hardware-recommendations.html

I have not looked at the Grep code so I'm not sure why it's behaving the
way it is. Still curious that streaming has a higher IO throughput and
lower CPU usage. It may have to do with the fact that /bin/grep is a native
implementation and Grep (Hadoop) is probably using Java Pattern/Matcher api.
On Thu, Oct 10, 2013 at 12:29 PM, Xuri Nagarin <[EMAIL PROTECTED]> wrote:

> Thanks Pradeep. Does it mean this job is a bad candidate for MR?
>
> Interestingly, running the cmdline '/bin/grep' under a streaming job
> provides (1) Much better disk throughput and, (2) CPU load is almost evenly
> spread across all cores/threads (no CPU gets pegged to 100%).
>
>
>
>
> On Thu, Oct 10, 2013 at 11:15 AM, Pradeep Gollakota <[EMAIL PROTECTED]>wrote:
>
>> Actually... I believe that is expected behavior. Since your CPU is pegged
>> at 100% you're not going to be IO bound. Typically jobs tend to be CPU
>> bound or IO bound. If you're CPU bound you expect to see low IO throughput.
>> If you're IO bound, you expect to see low CPU usage.
>>
>>
>> On Thu, Oct 10, 2013 at 11:05 AM, Xuri Nagarin <[EMAIL PROTECTED]> wrote:
>>
>>> Hi,
>>>
>>> I have a simple Grep job (from bundled examples) that I am running on a
>>> 11-node cluster. Each node is 2x8-core Intel Xeons (shows 32 CPUs with HT
>>> on), 64GB RAM and 8 x 1TB disks. I have mappers set to 20 per node.
>>>
>>> When I run the Grep job, I notice that CPU gets pegged to 100% on
>>> multiple cores but disk throughput remains a dismal 1-2 Mbytes/sec on a
>>> single disk on each node. So I guess, the cluster is poorly performing in
>>> terms of disk IO. Running Terasort, I see each disk puts out 25-35
>>> Mbytes/sec with a total cluster throughput of above 1.5 Gbytes/sec.
>>>
>>> How do I go about re-configuring or re-writing the job to utilize
>>> maximum disk IO?
>>>
>>> TIA,
>>>
>>> Xuri
>>>
>>>
>>>
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB