Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Job setup for a pig run takes ages


Copy link to this message
-
Re: Job setup for a pig run takes ages
It is now in trunk (future 0.11)
Julien

On Jun 19, 2012, at 9:35 AM, Thejas Nair wrote:

> Yes, pig should cache the results of getschema(). that is someting to be fixed.
> Julien is working on some patches to fix the issues related to udfcontext, including reducing the number of times load function gets instantiated.
>
> -Thejas
>
>
> On 6/12/12 4:16 PM, Alex Rovner wrote:
>> Thejas,
>>
>> Why can't Pig cache results of getschema instead of repeatedly calling the load function? This is causing us to put in lots of unnecessary workarounds in place since during those calls one cant rely on the udf context as it gets cleared somewhere inbetween those calls?
>>
>> Sent from my iPhone
>>
>> On Jun 1, 2012, at 12:45 PM, Thejas Nair<[EMAIL PROTECTED]>  wrote:
>>
>>> Can you do a jstack<pid>  on the pig client process id a few times and see what it is doing when it is taking so long and send the results ?
>>> With that we should be able to easily identify what might be happening and suggest a fix.
>>>
>>> I think pig calls getSchema() on load functions more times than actually necessary during the query optimization phase (each time after a transform). One theory I have is that your avro load function does a stat on all the 50k files each time to determine the common schema. This might be causing the delay to generate query plan.
>>>
>>> About the second large lag before queue seems to be time spend in input split calculations, not within pig logic.
>>>
>>> These are just theories I have. The jstack output would help determine if this is actually the case.
>>>
>>> Also, I think this delay is likely to be caused by the number of input files, and not the actual data size. You might also want to look at ways to reduce the actual number of input files. Is it generated by a previous MR job ? If yes, given your cluster size, it works against you to have so many maps or reducers in the previous MR job. If pig is generating the data,  you should consider setting the parallelism (reducers) or the number of maps (see http://pig.apache.org/docs/r0.10.0/perf.html#combine-files) .
>>> While reducing number of input files will most likely solve your problem, I am still interested in knowing where the delays are coming from and if we can fix something in pig to improve the situation.
>>>
>>> Thanks,
>>> Thejas
>>>
>>>
>>> On 6/1/12 6:34 AM, Markus Resch wrote:
>>>> Hi Prashant, Hi Thejas,
>>>>
>>>> thanks for your very quick answer.
>>>> No, this is not a typo. Those time stamps are true and as I said the
>>>> machines are not very busy during this time.
>>>>
>>>> As this is our test cluster I am sure I am the only one who is running
>>>> jobs on it. Another issue we have is that we are currently only able to
>>>> run one job at a time but this shouldn't be the topic of this request.
>>>> We even have no continuous input stream to that cluster but copied a
>>>> bunch of data to it some time ago.
>>>> From my perspective the 464 GB of input data you are mentioned is the
>>>> uncompressed amount of the 160GByte compressed files. Which I get when I
>>>> use hadoop -f dus on that folder.
>>>>
>>>> Another interesting fact for you could be that we're running the
>>>> cloudera CDH3 Update 3 version on our systems.
>>>>
>>>> I suspect this could be due to some fancy avro schema validation
>>>> implicitly executed by the avro storage? If so, can this be avoided?
>>>>
>>>> Sadly I'm currently not able to provide you the actual script currently
>>>> as it contains confidential information but I will try to provide you a
>>>> version as soon as possible. But I'd rather think of a configuration
>>>> problem to the hadoop or pig anyways as the script works fine with a
>>>> smaller amount of input data
>>>>
>>>> I would ask the hadoop mailing list if this issue would occur during the
>>>> actual mapred run but as this occur even before a single mapred job is
>>>> launched I suspect pig to have a problem.
>>>>
>>>> Thanks
>>>> Markus
>>