Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Improving MR job disk IO

Xuri Nagarin 2013-10-10, 18:05
Pradeep Gollakota 2013-10-10, 18:15
Xuri Nagarin 2013-10-10, 19:29
Pradeep Gollakota 2013-10-10, 20:27
Xuri Nagarin 2013-10-10, 20:50
DSuiter RDX 2013-10-11, 11:48
Xuri Nagarin 2013-10-15, 03:02
Copy link to this message
Re: Improving MR job disk IO
There are a few reasons to use map/reduce, or just map-only or
reduce-only jobs.
1) You want to do parallel algorithms where data from multiple machines
have to be cross-checked. Map/Reduce allows this.
2) You want to run several instances of a job. Hadoop does this reliably
by monitoring all instances, restarting failed ones, etc.
3) You have way too much data to fit on one computer. Same as #2.

You might not need Hadoop if you can run your programs without it.


On 10/14/2013 08:02 PM, Xuri Nagarin wrote:
> Yes, I tested with smaller data sets and the MR job correctly
> reads/matches one line at a time.
> On Fri, Oct 11, 2013 at 4:48 AM, DSuiter RDX <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
>     So, perhaps this has been thought of, but perhaps not.
>     It is my understanding that grep is usually sorting things one
>     line at a time. As I am currently experimenting with Avro, I am
>     finding that the local grep function does not handle it well at
>     all, because it is one long line essentially, so working from
>     local Avro, grep does not do well at pattern matching, it just
>     returns the whole file as a match, and it takes a long time to
>     view it in vi editor as well since there are no EOL markers.
>     If you have modified for sequence file, are you reading a sequence
>     file that has newline characters? If not, perhaps the file is
>     being read as one whole line, causing some unexpected effects.
>     Thanks,
>     *Devin Suiter*
>     Jr. Data Solutions Software Engineer
>     100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>     Google Voice: 412-256-8556 <tel:412-256-8556> | www.rdx.com
>     <http://www.rdx.com/>
>     On Thu, Oct 10, 2013 at 4:50 PM, Xuri Nagarin <[EMAIL PROTECTED]
>     <mailto:[EMAIL PROTECTED]>> wrote:
>         On Thu, Oct 10, 2013 at 1:27 PM, Pradeep Gollakota
>         <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
>             I don't think it necessarily means that the job is a bad
>             candidate for MR. It's a different type of a workload.
>             Hortonworks has a great article on the different types of
>             workloads you might see and how that affects your
>             provisioning choices at
>             http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.2/bk_cluster-planning-guide/content/ch_hardware-recommendations.html
>         One statement that stood out to me in the link above is "For
>         these reasons, Hortonworks recommends that you either use the
>         Balanced workload configuration or invest in a pilot Hadoop
>         cluster and plan to evolve as you analyze the workload
>         patterns in your environment."
>         Now, this is not a critique/concern of HW but rather of
>         hadoop. Well, what if my workloads can be both CPU and IO
>         intensive? Do I take the approach of
>         throw-enough-excess-hardware-just-in-case?
>             I have not looked at the Grep code so I'm not sure why
>             it's behaving the way it is. Still curious that streaming
>             has a higher IO throughput and lower CPU usage. It may
>             have to do with the fact that /bin/grep is a native
>             implementation and Grep (Hadoop) is probably using Java
>             Pattern/Matcher api.
>         The Grep code is from the bundled examples in CDH. I made one
>         line modification for it to read Sequence files. The streaming
>         job probably does not have lower CPU utilization but I see
>         that it does even out the CPU utilization among all the
>         available processors. I guess the native grep binary threads
>         better than the java MR job?
>         Which brings me to ask - If you have the mapper/reducer
>         functionality built into a platform specific binary, then
>         won't it always be more efficient than a java MR job? And, in
>         such cases, am I better off with streaming than Java MR?
Xuri Nagarin 2013-10-15, 03:50