Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Problem when executionengine.util.MapRedUtil combine input paths


Copy link to this message
-
Re: Problem when executionengine.util.MapRedUtil combine input paths
Hi Charles,
Which load function are you using ? Is the default (PigStorage?).
In the hadoop counters for the job in the jobtracker ui, do you see the expected number of input records being read?
-Thejas

On 2/28/11 10:57 AM, "Charles Gonçalves" <[EMAIL PROTECTED]> wrote:

I'm not using any filtering in the script.
I'm just want to see the total traffic per day in all logs.

If I combine 1000 log files into  one and run the script on this log files I
got the correct answer for those logs.
But when I'm run with   all the *43458* log files I got a incorrect output.
The correct would be an histogram for each day from 2010-10 but the result
contain only data from 2010-10-21.
And if I process all the logs with an awk script I got the correct answer.
On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai <[EMAIL PROTECTED]> wrote:

> Not sure if I get your question. In 0.8, Pig combine small files into one
> map, so it is possible you get less output files.

This is not the problem.
But thanks anyway!

If that is your concern, you can try to disable split combine using
> "-Dpig.splitCombination=false"
>
> Daniel
>
>
> Charles Gonçalves wrote:
>
>> I tried to process a big number of small files on pig and I got a strange
>> problem.
>>
>> 2011-02-27 00:00:58,746 [Thread-15] INFO
>>  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
>> to process : *43458*
>> 2011-02-27 00:00:58,755 [Thread-15] INFO
>>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> input
>> paths to process : *43458*
>> 2011-02-27 00:01:14,173 [Thread-15] INFO
>>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> input
>> paths (combined) to process : *329*
>>
>> When the script finish to process, the result is just about a subgroup of
>> the input files.
>> These are logs from a whole month,  but the results are just from the day
>> 21.
>>
>>
>> Maybe I'm missing something.
>> Any Ideas?
>>
>>
>>
>
>
--
*Charles Ferreira Gonçalves *
http://homepages.dcc.ufmg.br/~charles/
UFMG - ICEx - Dcc
Cel.: 55 31 87741485
Tel.:  55 31 34741485
Lab.: 55 31 34095840