Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> 100x slower mapreduce compared to pig


Copy link to this message
-
Re: 100x slower mapreduce compared to pig
I think I've found the problem. There was one line of code that caused this
issue :)  that was output.collect(key, value);

I had to add more logging to the code to get to it. For some reason kill
-QUIT didn't send the stacktrace to the userLogs/<job>/<attempt>/syslog , I
searched all the logs and couldn't find one. Does anyone know where
stacktraces are generally sent?

On Wed, Feb 29, 2012 at 1:08 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote:

> I can't seem to find what's causing this slowness. Nothing in the logs.
> It's just painfuly slow. However, pig job is awesome in performance that
> has the same logic. Here is the mapper code and the pig code:
>
>
> *public* *static* *class* Map *extends* MapReduceBase
>
> *implements* Mapper<Text, Text, Text, Text> {
>
> *public* *void* map(Text key, Text value,
>
> OutputCollector<Text, Text> output,
>
> Reporter reporter)
> *throws* IOException {
>
> String line = value.toString();
>
> //log.info("output key:" + key + "value " + value + "value " + line);
>
> FormMLType f;
>
> *try* {
>
> f = FormMLUtils.*convertToRows*(line);
>
> FormMLStack fm > *new* FormMLStack(f,key.toString());
>
> fm.parseFormML();
>
> *for* (String *row* : fm.getFormattedRecords(*false*)){
>
> output.collect(key, value);
>
> }
>
> }
> *catch* (JAXBException e) {
>
> *log*.error("Error processing record " + key, e);
>
> }
>
>  }
>
> }
>
> And here is the pig udf:
>
>
> *public* DataBag exec(Tuple input) *throws* IOException {
>
> *try* {
>
> DataBag output > mBagFactory.newDefaultBag();
>
> Object o = input.get(1);
>
> *if* (!(o *instanceof* String)) {
>
> *throw* *new* IOException(
>
> "Expected document input to be chararray, but got "
>
> + o.getClass().getName());
>
> }
>
> Object o1 = input.get(0);
>
> *if* (!(o1 *instanceof* String)) {
>
> *throw* *new* IOException(
>
> "Expected input to be chararray, but got "
>
> + o.getClass().getName());
>
> }
>
> String document = (String)o;
>
> String filename = (String)o1;
>
> FormMLType f = FormMLUtils.*convertToRows*(document);
>
> FormMLStack fm > *new* FormMLStack(f,filename);
>
> fm.parseFormML();
>
> *for* (String row : fm.getFormattedRecords(*false*)){
>
> output.add(
> mTupleFactory.newTuple(row));
>
> }
>
> *return* output;
>
> }
> *catch* (ExecException ee) {
>
> log.error("Failed to Process ", ee);
>
> *throw* ee;
>
> }
> *catch* (JAXBException e) {
>
> // *TODO* Auto-generated catch block
>
> log.error("Invalid xml", e);
>
> *throw* *new* IllegalArgumentException("invalid xml " +
> e.getCause().getMessage());
>
> }
>
> }
>
>   On Wed, Feb 29, 2012 at 9:27 AM, Mohit Anchlia <[EMAIL PROTECTED]>wrote:
>
>> I am going to try few things today. I have a JAXBContext object that
>> marshals the xml, this is static instance but my guess at this point is
>> that since this is in separate jar then the one where job runs and I used
>> DistributeCache.addClassPath this context is being created on every call
>> for some reason. I don't know why that would be. I am going to create this
>> instance as static in the mapper class itself and see if that helps. I also
>> add debugs. Will post the results after try it out.
>>
>>
>> On Tue, Feb 28, 2012 at 4:18 PM, Prashant Kommireddi <[EMAIL PROTECTED]
>> > wrote:
>>
>>> It would be great if we can take a look at what you are doing in the UDF
>>> vs
>>> the Mapper.
>>>
>>> 100x slow does not make sense for the same job/logic, its either the
>>> Mapper
>>> code or may be the cluster was busy at the time you scheduled MapReduce
>>> job?
>>>
>>> Thanks,
>>> Prashant
>>>
>>> On Tue, Feb 28, 2012 at 4:11 PM, Mohit Anchlia <[EMAIL PROTECTED]
>>> >wrote:
>>>
>>> > I am comparing runtime of similar logic. The entire logic is exactly
>>> same
>>> > but surprisingly map reduce job that I submit is 100x slow. For pig I
>>> use
>>> > udf and for hadoop I use mapper only and the logic same as pig. Even
>>> the
>>> > splits on the admin page are same. Not sure why it's so slow. I am
>>> > submitting job like:
>>> >
>>