Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Improving MR job disk IO


+
Xuri Nagarin 2013-10-10, 18:05
+
Pradeep Gollakota 2013-10-10, 18:15
+
Xuri Nagarin 2013-10-10, 19:29
+
Pradeep Gollakota 2013-10-10, 20:27
+
Xuri Nagarin 2013-10-10, 20:50
+
DSuiter RDX 2013-10-11, 11:48
+
Xuri Nagarin 2013-10-15, 03:02
Copy link to this message
-
Re: Improving MR job disk IO
There are a few reasons to use map/reduce, or just map-only or
reduce-only jobs.
1) You want to do parallel algorithms where data from multiple machines
have to be cross-checked. Map/Reduce allows this.
2) You want to run several instances of a job. Hadoop does this reliably
by monitoring all instances, restarting failed ones, etc.
3) You have way too much data to fit on one computer. Same as #2.

You might not need Hadoop if you can run your programs without it.

Lance

On 10/14/2013 08:02 PM, Xuri Nagarin wrote:
> Yes, I tested with smaller data sets and the MR job correctly
> reads/matches one line at a time.
>
>
>
>
> On Fri, Oct 11, 2013 at 4:48 AM, DSuiter RDX <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
>
>     So, perhaps this has been thought of, but perhaps not.
>
>     It is my understanding that grep is usually sorting things one
>     line at a time. As I am currently experimenting with Avro, I am
>     finding that the local grep function does not handle it well at
>     all, because it is one long line essentially, so working from
>     local Avro, grep does not do well at pattern matching, it just
>     returns the whole file as a match, and it takes a long time to
>     view it in vi editor as well since there are no EOL markers.
>
>     If you have modified for sequence file, are you reading a sequence
>     file that has newline characters? If not, perhaps the file is
>     being read as one whole line, causing some unexpected effects.
>
>     Thanks,
>     *Devin Suiter*
>     Jr. Data Solutions Software Engineer
>     100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>     Google Voice: 412-256-8556 <tel:412-256-8556> | www.rdx.com
>     <http://www.rdx.com/>
>
>
>     On Thu, Oct 10, 2013 at 4:50 PM, Xuri Nagarin <[EMAIL PROTECTED]
>     <mailto:[EMAIL PROTECTED]>> wrote:
>
>         On Thu, Oct 10, 2013 at 1:27 PM, Pradeep Gollakota
>         <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
>
>             I don't think it necessarily means that the job is a bad
>             candidate for MR. It's a different type of a workload.
>             Hortonworks has a great article on the different types of
>             workloads you might see and how that affects your
>             provisioning choices at
>             http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.2/bk_cluster-planning-guide/content/ch_hardware-recommendations.html
>
>
>         One statement that stood out to me in the link above is "For
>         these reasons, Hortonworks recommends that you either use the
>         Balanced workload configuration or invest in a pilot Hadoop
>         cluster and plan to evolve as you analyze the workload
>         patterns in your environment."
>
>         Now, this is not a critique/concern of HW but rather of
>         hadoop. Well, what if my workloads can be both CPU and IO
>         intensive? Do I take the approach of
>         throw-enough-excess-hardware-just-in-case?
>
>
>             I have not looked at the Grep code so I'm not sure why
>             it's behaving the way it is. Still curious that streaming
>             has a higher IO throughput and lower CPU usage. It may
>             have to do with the fact that /bin/grep is a native
>             implementation and Grep (Hadoop) is probably using Java
>             Pattern/Matcher api.
>
>
>         The Grep code is from the bundled examples in CDH. I made one
>         line modification for it to read Sequence files. The streaming
>         job probably does not have lower CPU utilization but I see
>         that it does even out the CPU utilization among all the
>         available processors. I guess the native grep binary threads
>         better than the java MR job?
>
>         Which brings me to ask - If you have the mapper/reducer
>         functionality built into a platform specific binary, then
>         won't it always be more efficient than a java MR job? And, in
>         such cases, am I better off with streaming than Java MR?
+
Xuri Nagarin 2013-10-15, 03:50
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB