-Re: Improving MR job disk IO
DSuiter RDX 2013-10-11, 11:48
So, perhaps this has been thought of, but perhaps not.
It is my understanding that grep is usually sorting things one line at a
time. As I am currently experimenting with Avro, I am finding that the
local grep function does not handle it well at all, because it is one long
line essentially, so working from local Avro, grep does not do well at
pattern matching, it just returns the whole file as a match, and it takes a
long time to view it in vi editor as well since there are no EOL markers.
If you have modified for sequence file, are you reading a sequence file
that has newline characters? If not, perhaps the file is being read as one
whole line, causing some unexpected effects.
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com
On Thu, Oct 10, 2013 at 4:50 PM, Xuri Nagarin <[EMAIL PROTECTED]> wrote:
> On Thu, Oct 10, 2013 at 1:27 PM, Pradeep Gollakota <[EMAIL PROTECTED]>wrote:
>> I don't think it necessarily means that the job is a bad candidate for
>> MR. It's a different type of a workload. Hortonworks has a great article on
>> the different types of workloads you might see and how that affects your
>> provisioning choices at
> One statement that stood out to me in the link above is "For these
> reasons, Hortonworks recommends that you either use the Balanced workload
> configuration or invest in a pilot Hadoop cluster and plan to evolve as you
> analyze the workload patterns in your environment."
> Now, this is not a critique/concern of HW but rather of hadoop. Well, what
> if my workloads can be both CPU and IO intensive? Do I take the approach of
>> I have not looked at the Grep code so I'm not sure why it's behaving the
>> way it is. Still curious that streaming has a higher IO throughput and
>> lower CPU usage. It may have to do with the fact that /bin/grep is a native
>> implementation and Grep (Hadoop) is probably using Java Pattern/Matcher api.
> The Grep code is from the bundled examples in CDH. I made one line
> modification for it to read Sequence files. The streaming job probably does
> not have lower CPU utilization but I see that it does even out the CPU
> utilization among all the available processors. I guess the native grep
> binary threads better than the java MR job?
> Which brings me to ask - If you have the mapper/reducer functionality
> built into a platform specific binary, then won't it always be more
> efficient than a java MR job? And, in such cases, am I better off with
> streaming than Java MR?
> Thanks for your responses.
>> On Thu, Oct 10, 2013 at 12:29 PM, Xuri Nagarin <[EMAIL PROTECTED]> wrote:
>>> Thanks Pradeep. Does it mean this job is a bad candidate for MR?
>>> Interestingly, running the cmdline '/bin/grep' under a streaming job
>>> provides (1) Much better disk throughput and, (2) CPU load is almost evenly
>>> spread across all cores/threads (no CPU gets pegged to 100%).
>>> On Thu, Oct 10, 2013 at 11:15 AM, Pradeep Gollakota <
>>> [EMAIL PROTECTED]> wrote:
>>>> Actually... I believe that is expected behavior. Since your CPU is
>>>> pegged at 100% you're not going to be IO bound. Typically jobs tend to be
>>>> CPU bound or IO bound. If you're CPU bound you expect to see low IO
>>>> throughput. If you're IO bound, you expect to see low CPU usage.
>>>> On Thu, Oct 10, 2013 at 11:05 AM, Xuri Nagarin <[EMAIL PROTECTED]>wrote:
>>>>> I have a simple Grep job (from bundled examples) that I am running on
>>>>> a 11-node cluster. Each node is 2x8-core Intel Xeons (shows 32 CPUs with HT
>>>>> on), 64GB RAM and 8 x 1TB disks. I have mappers set to 20 per node.
>>>>> When I run the Grep job, I notice that CPU gets pegged to 100% on