Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Problem when executionengine.util.MapRedUtil combine input paths


Copy link to this message
-
Re: Problem when executionengine.util.MapRedUtil combine input paths
Hi Charles,
Which load function are you using ? Is the default (PigStorage?).
In the hadoop counters for the job in the jobtracker ui, do you see the expected number of input records being read?
-Thejas

On 2/28/11 10:57 AM, "Charles Gonçalves" <[EMAIL PROTECTED]> wrote:

I'm not using any filtering in the script.
I'm just want to see the total traffic per day in all logs.

If I combine 1000 log files into  one and run the script on this log files I
got the correct answer for those logs.
But when I'm run with   all the *43458* log files I got a incorrect output.
The correct would be an histogram for each day from 2010-10 but the result
contain only data from 2010-10-21.
And if I process all the logs with an awk script I got the correct answer.
On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai <[EMAIL PROTECTED]> wrote:

> Not sure if I get your question. In 0.8, Pig combine small files into one
> map, so it is possible you get less output files.

This is not the problem.
But thanks anyway!

If that is your concern, you can try to disable split combine using
> "-Dpig.splitCombination=false"
>
> Daniel
>
>
> Charles Gonçalves wrote:
>
>> I tried to process a big number of small files on pig and I got a strange
>> problem.
>>
>> 2011-02-27 00:00:58,746 [Thread-15] INFO
>>  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
>> to process : *43458*
>> 2011-02-27 00:00:58,755 [Thread-15] INFO
>>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> input
>> paths to process : *43458*
>> 2011-02-27 00:01:14,173 [Thread-15] INFO
>>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> input
>> paths (combined) to process : *329*
>>
>> When the script finish to process, the result is just about a subgroup of
>> the input files.
>> These are logs from a whole month,  but the results are just from the day
>> 21.
>>
>>
>> Maybe I'm missing something.
>> Any Ideas?
>>
>>
>>
>
>
--
*Charles Ferreira Gonçalves *
http://homepages.dcc.ufmg.br/~charles/
UFMG - ICEx - Dcc
Cel.: 55 31 87741485
Tel.:  55 31 34741485
Lab.: 55 31 34095840
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB