Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Improving MR job disk IO


Copy link to this message
-
Re: Improving MR job disk IO
I don't think it necessarily means that the job is a bad candidate for MR.
It's a different type of a workload. Hortonworks has a great article on the
different types of workloads you might see and how that affects your
provisioning choices at
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.2/bk_cluster-planning-guide/content/ch_hardware-recommendations.html

I have not looked at the Grep code so I'm not sure why it's behaving the
way it is. Still curious that streaming has a higher IO throughput and
lower CPU usage. It may have to do with the fact that /bin/grep is a native
implementation and Grep (Hadoop) is probably using Java Pattern/Matcher api.
On Thu, Oct 10, 2013 at 12:29 PM, Xuri Nagarin <[EMAIL PROTECTED]> wrote:

> Thanks Pradeep. Does it mean this job is a bad candidate for MR?
>
> Interestingly, running the cmdline '/bin/grep' under a streaming job
> provides (1) Much better disk throughput and, (2) CPU load is almost evenly
> spread across all cores/threads (no CPU gets pegged to 100%).
>
>
>
>
> On Thu, Oct 10, 2013 at 11:15 AM, Pradeep Gollakota <[EMAIL PROTECTED]>wrote:
>
>> Actually... I believe that is expected behavior. Since your CPU is pegged
>> at 100% you're not going to be IO bound. Typically jobs tend to be CPU
>> bound or IO bound. If you're CPU bound you expect to see low IO throughput.
>> If you're IO bound, you expect to see low CPU usage.
>>
>>
>> On Thu, Oct 10, 2013 at 11:05 AM, Xuri Nagarin <[EMAIL PROTECTED]> wrote:
>>
>>> Hi,
>>>
>>> I have a simple Grep job (from bundled examples) that I am running on a
>>> 11-node cluster. Each node is 2x8-core Intel Xeons (shows 32 CPUs with HT
>>> on), 64GB RAM and 8 x 1TB disks. I have mappers set to 20 per node.
>>>
>>> When I run the Grep job, I notice that CPU gets pegged to 100% on
>>> multiple cores but disk throughput remains a dismal 1-2 Mbytes/sec on a
>>> single disk on each node. So I guess, the cluster is poorly performing in
>>> terms of disk IO. Running Terasort, I see each disk puts out 25-35
>>> Mbytes/sec with a total cluster throughput of above 1.5 Gbytes/sec.
>>>
>>> How do I go about re-configuring or re-writing the job to utilize
>>> maximum disk IO?
>>>
>>> TIA,
>>>
>>> Xuri
>>>
>>>
>>>
>>
>