Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> 100x slower mapreduce compared to pig


Copy link to this message
-
Re: 100x slower mapreduce compared to pig
I can't seem to find what's causing this slowness. Nothing in the logs.
It's just painfuly slow. However, pig job is awesome in performance that
has the same logic. Here is the mapper code and the pig code:
*public* *static* *class* Map *extends* MapReduceBase

*implements* Mapper<Text, Text, Text, Text> {

*public* *void* map(Text key, Text value,

OutputCollector<Text, Text> output,

Reporter reporter) *throws* IOException {

String line = value.toString();

//log.info("output key:" + key + "value " + value + "value " + line);

FormMLType f;

*try* {

f = FormMLUtils.*convertToRows*(line);

FormMLStack fm = *new* FormMLStack(f,key.toString());

fm.parseFormML();

*for* (String *row* : fm.getFormattedRecords(*false*)){

output.collect(key, value);

}

} *catch* (JAXBException e) {

*log*.error("Error processing record " + key, e);

}

 }

}

And here is the pig udf:
*public* DataBag exec(Tuple input) *throws* IOException {

*try* {

DataBag output = mBagFactory.newDefaultBag();

Object o = input.get(1);

*if* (!(o *instanceof* String)) {

*throw* *new* IOException(

"Expected document input to be chararray, but got "

+ o.getClass().getName());

}

Object o1 = input.get(0);

*if* (!(o1 *instanceof* String)) {

*throw* *new* IOException(

"Expected input to be chararray, but got "

+ o.getClass().getName());

}

String document = (String)o;

String filename = (String)o1;

FormMLType f = FormMLUtils.*convertToRows*(document);

FormMLStack fm = *new* FormMLStack(f,filename);

fm.parseFormML();

*for* (String row : fm.getFormattedRecords(*false*)){

output.add(mTupleFactory.newTuple(row));

}

*return* output;

} *catch* (ExecException ee) {

log.error("Failed to Process ", ee);

*throw* ee;

} *catch* (JAXBException e) {

// *TODO* Auto-generated catch block

log.error("Invalid xml", e);

*throw* *new* IllegalArgumentException("invalid xml " +
e.getCause().getMessage());

}

}

On Wed, Feb 29, 2012 at 9:27 AM, Mohit Anchlia <[EMAIL PROTECTED]>wrote:

> I am going to try few things today. I have a JAXBContext object that
> marshals the xml, this is static instance but my guess at this point is
> that since this is in separate jar then the one where job runs and I used
> DistributeCache.addClassPath this context is being created on every call
> for some reason. I don't know why that would be. I am going to create this
> instance as static in the mapper class itself and see if that helps. I also
> add debugs. Will post the results after try it out.
>
>
> On Tue, Feb 28, 2012 at 4:18 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:
>
>> It would be great if we can take a look at what you are doing in the UDF
>> vs
>> the Mapper.
>>
>> 100x slow does not make sense for the same job/logic, its either the
>> Mapper
>> code or may be the cluster was busy at the time you scheduled MapReduce
>> job?
>>
>> Thanks,
>> Prashant
>>
>> On Tue, Feb 28, 2012 at 4:11 PM, Mohit Anchlia <[EMAIL PROTECTED]
>> >wrote:
>>
>> > I am comparing runtime of similar logic. The entire logic is exactly
>> same
>> > but surprisingly map reduce job that I submit is 100x slow. For pig I
>> use
>> > udf and for hadoop I use mapper only and the logic same as pig. Even the
>> > splits on the admin page are same. Not sure why it's so slow. I am
>> > submitting job like:
>> >
>> > java -classpath
>> >
>> >
>> .:analytics.jar:/hadoop-0.20.2-cdh3u3/lib/*:/root/.mohit/hadoop-0.20.2-cdh3u3/*:common.jar
>> > com.services.dp.analytics.hadoop.mapred.FormMLProcessor
>> >
>> >
>> /examples/testfile40.seq,/examples/testfile41.seq,/examples/testfile42.seq,/examples/testfile43.seq,/examples/testfile44.seq,/examples/testfile45.seq,/examples/testfile46.seq,/examples/testfile47.seq,/examples/testfile48.seq,/examples/testfile49.seq
>> > /examples/output1/
>> >
>> > How should I go about looking the root cause of why it's so slow? Any
>> > suggestions would be really appreciated.
>> >
>> >
>> >
>> > One of the things I noticed is that on the admin page of map task list I
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB