|
Charles Gonçalves
2011-02-27, 03:25
Romain Rigaux
2011-02-28, 18:11
Daniel Dai
2011-02-28, 18:29
Charles Gonçalves
2011-02-28, 18:57
Thejas M Nair
2011-02-28, 22:39
Charles Gonçalves
2011-02-28, 23:47
Charles Gonçalves
2011-03-01, 01:40
Daniel Dai
2011-03-01, 21:44
Charles Gonçalves
2011-03-01, 22:02
Dmitriy Ryaboy
2011-03-01, 22:07
|
-
Problem when executionengine.util.MapRedUtil combine input pathsCharles Gonçalves 2011-02-27, 03:25
I tried to process a big number of small files on pig and I got a strange
problem. 2011-02-27 00:00:58,746 [Thread-15] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : *43458* 2011-02-27 00:00:58,755 [Thread-15] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : *43458* 2011-02-27 00:01:14,173 [Thread-15] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : *329* When the script finish to process, the result is just about a subgroup of the input files. These are logs from a whole month, but the results are just from the day 21. Maybe I'm missing something. Any Ideas? -- *Charles Ferreira Gonçalves * http://homepages.dcc.ufmg.br/~charles/ UFMG - ICEx - Dcc Cel.: 55 31 87741485 Tel.: 55 31 34741485 Lab.: 55 31 34095840
-
Re: Problem when executionengine.util.MapRedUtil combine input pathsRomain Rigaux 2011-02-28, 18:11
Normally Pig 0.8 is just combining the small
files<http://pig.apache.org/docs/r0.8.0/cookbook.html#Combine+Small+Input+Files>into bigger ones, you should not lose any records. You might be filtering out/limiting some records in your script. You can try just a LOAD and STORE and see that the output is the same as the input data. Romain On Sat, Feb 26, 2011 at 7:25 PM, Charles Gonçalves <[EMAIL PROTECTED]>wrote: > I tried to process a big number of small files on pig and I got a strange > problem. > > 2011-02-27 00:00:58,746 [Thread-15] INFO > org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths > to process : *43458* > 2011-02-27 00:00:58,755 [Thread-15] INFO > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total > input > paths to process : *43458* > 2011-02-27 00:01:14,173 [Thread-15] INFO > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total > input > paths (combined) to process : *329* > > When the script finish to process, the result is just about a subgroup of > the input files. > These are logs from a whole month, but the results are just from the day > 21. > > > Maybe I'm missing something. > Any Ideas? > > -- > *Charles Ferreira Gonçalves * > http://homepages.dcc.ufmg.br/~charles/ > UFMG - ICEx - Dcc > Cel.: 55 31 87741485 > Tel.: 55 31 34741485 > Lab.: 55 31 34095840 >
-
Re: Problem when executionengine.util.MapRedUtil combine input pathsDaniel Dai 2011-02-28, 18:29
Not sure if I get your question. In 0.8, Pig combine small files into
one map, so it is possible you get less output files. If that is your concern, you can try to disable split combine using "-Dpig.splitCombination=false" Daniel Charles Gon�alves wrote: > I tried to process a big number of small files on pig and I got a strange > problem. > > 2011-02-27 00:00:58,746 [Thread-15] INFO > org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths > to process : *43458* > 2011-02-27 00:00:58,755 [Thread-15] INFO > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input > paths to process : *43458* > 2011-02-27 00:01:14,173 [Thread-15] INFO > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input > paths (combined) to process : *329* > > When the script finish to process, the result is just about a subgroup of > the input files. > These are logs from a whole month, but the results are just from the day > 21. > > > Maybe I'm missing something. > Any Ideas? > >
-
Re: Problem when executionengine.util.MapRedUtil combine input pathsCharles Gonçalves 2011-02-28, 18:57
I'm not using any filtering in the script.
I'm just want to see the total traffic per day in all logs. If I combine 1000 log files into one and run the script on this log files I got the correct answer for those logs. But when I'm run with all the *43458* log files I got a incorrect output. The correct would be an histogram for each day from 2010-10 but the result contain only data from 2010-10-21. And if I process all the logs with an awk script I got the correct answer. On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai <[EMAIL PROTECTED]> wrote: > Not sure if I get your question. In 0.8, Pig combine small files into one > map, so it is possible you get less output files. This is not the problem. But thanks anyway! If that is your concern, you can try to disable split combine using > "-Dpig.splitCombination=false" > > Daniel > > > Charles Gonçalves wrote: > >> I tried to process a big number of small files on pig and I got a strange >> problem. >> >> 2011-02-27 00:00:58,746 [Thread-15] INFO >> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths >> to process : *43458* >> 2011-02-27 00:00:58,755 [Thread-15] INFO >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total >> input >> paths to process : *43458* >> 2011-02-27 00:01:14,173 [Thread-15] INFO >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total >> input >> paths (combined) to process : *329* >> >> When the script finish to process, the result is just about a subgroup of >> the input files. >> These are logs from a whole month, but the results are just from the day >> 21. >> >> >> Maybe I'm missing something. >> Any Ideas? >> >> >> > > -- *Charles Ferreira Gonçalves * http://homepages.dcc.ufmg.br/~charles/ UFMG - ICEx - Dcc Cel.: 55 31 87741485 Tel.: 55 31 34741485 Lab.: 55 31 34095840
-
Re: Problem when executionengine.util.MapRedUtil combine input pathsThejas M Nair 2011-02-28, 22:39
Hi Charles,
Which load function are you using ? Is the default (PigStorage?). In the hadoop counters for the job in the jobtracker ui, do you see the expected number of input records being read? -Thejas On 2/28/11 10:57 AM, "Charles Gonçalves" <[EMAIL PROTECTED]> wrote: I'm not using any filtering in the script. I'm just want to see the total traffic per day in all logs. If I combine 1000 log files into one and run the script on this log files I got the correct answer for those logs. But when I'm run with all the *43458* log files I got a incorrect output. The correct would be an histogram for each day from 2010-10 but the result contain only data from 2010-10-21. And if I process all the logs with an awk script I got the correct answer. On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai <[EMAIL PROTECTED]> wrote: > Not sure if I get your question. In 0.8, Pig combine small files into one > map, so it is possible you get less output files. This is not the problem. But thanks anyway! If that is your concern, you can try to disable split combine using > "-Dpig.splitCombination=false" > > Daniel > > > Charles Gonçalves wrote: > >> I tried to process a big number of small files on pig and I got a strange >> problem. >> >> 2011-02-27 00:00:58,746 [Thread-15] INFO >> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths >> to process : *43458* >> 2011-02-27 00:00:58,755 [Thread-15] INFO >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total >> input >> paths to process : *43458* >> 2011-02-27 00:01:14,173 [Thread-15] INFO >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total >> input >> paths (combined) to process : *329* >> >> When the script finish to process, the result is just about a subgroup of >> the input files. >> These are logs from a whole month, but the results are just from the day >> 21. >> >> >> Maybe I'm missing something. >> Any Ideas? >> >> >> > > -- *Charles Ferreira Gonçalves * http://homepages.dcc.ufmg.br/~charles/ UFMG - ICEx - Dcc Cel.: 55 31 87741485 Tel.: 55 31 34741485 Lab.: 55 31 34095840
-
Re: Problem when executionengine.util.MapRedUtil combine input pathsCharles Gonçalves 2011-02-28, 23:47
On Mon, Feb 28, 2011 at 7:39 PM, Thejas M Nair <[EMAIL PROTECTED]> wrote:
> Hi Charles, > Which load function are you using ? > I'm using a UD load function .. Is the default (PigStorage?). > Nops ... > In the hadoop counters for the job in the jobtracker ui, do you see the > expected number of input records being read? > Is possible to see the counter in the history interface on JobTracker? I will run the jobs again to compare the counter, but my guess is probably not! -Thejas > > > > > On 2/28/11 10:57 AM, "Charles Gonçalves" <[EMAIL PROTECTED]> wrote: > > I'm not using any filtering in the script. > I'm just want to see the total traffic per day in all logs. > > If I combine 1000 log files into one and run the script on this log files > I > got the correct answer for those logs. > But when I'm run with all the *43458* log files I got a incorrect output. > The correct would be an histogram for each day from 2010-10 but the result > contain only data from 2010-10-21. > And if I process all the logs with an awk script I got the correct answer. > > > On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai <[EMAIL PROTECTED]> > wrote: > > > Not sure if I get your question. In 0.8, Pig combine small files into one > > map, so it is possible you get less output files. > > This is not the problem. > But thanks anyway! > > If that is your concern, you can try to disable split combine using > > "-Dpig.splitCombination=false" > > > > Daniel > > > > > > Charles Gonçalves wrote: > > > >> I tried to process a big number of small files on pig and I got a > strange > >> problem. > >> > >> 2011-02-27 00:00:58,746 [Thread-15] INFO > >> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input > paths > >> to process : *43458* > >> 2011-02-27 00:00:58,755 [Thread-15] INFO > >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total > >> input > >> paths to process : *43458* > >> 2011-02-27 00:01:14,173 [Thread-15] INFO > >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total > >> input > >> paths (combined) to process : *329* > >> > >> When the script finish to process, the result is just about a subgroup > of > >> the input files. > >> These are logs from a whole month, but the results are just from the > day > >> 21. > >> > >> > >> Maybe I'm missing something. > >> Any Ideas? > >> > >> > >> > > > > > > > -- > *Charles Ferreira Gonçalves * > http://homepages.dcc.ufmg.br/~charles/ > UFMG - ICEx - Dcc > Cel.: 55 31 87741485 > Tel.: 55 31 34741485 > Lab.: 55 31 34095840 > > > -- *Charles Ferreira Gonçalves * http://homepages.dcc.ufmg.br/~charles/ UFMG - ICEx - Dcc Cel.: 55 31 87741485 Tel.: 55 31 34741485 Lab.: 55 31 34095840
-
Re: Problem when executionengine.util.MapRedUtil combine input pathsCharles Gonçalves 2011-03-01, 01:40
Guys,
The amount of data in the source dir: hdfs://hydra1:57810/user/cdh-hadoop/mscdata/201010_raw 22567369111 What I did was: I run with all logs, 43458 and the counters are: FILE_BYTES_READ253,905,706372,708,857626,614,563HDFS_BYTES_READ2,553,123,7340 2,553,123,734FILE_BYTES_WRITTEN619,877,917372,708,857992,586,774 HDFS_BYTES_WRITTEN 0535535 I did a manual join of the files and run again for the 336 files (the merge of all those files). The job didn't finished yet and the counters are: FILE_BYTES_READ21,054,970,818021,054,970,818HDFS_BYTES_READ16,772,063,486 0 16,772,063,486FILE_BYTES_WRITTEN39,797,038,00810,404,287,55150,201,325,55 I think that the problem could be in the combination of the input files. Is the combination class aware of compression. Because *all my files are compressed*. Maybe the class perform a concatenation and we fall in the hdfs limitation of gzip concatenated files. On Mon, Feb 28, 2011 at 8:47 PM, Charles Gonçalves <[EMAIL PROTECTED]>wrote: > > > On Mon, Feb 28, 2011 at 7:39 PM, Thejas M Nair <[EMAIL PROTECTED]>wrote: > >> Hi Charles, >> Which load function are you using ? >> > I'm using a UD load function .. > > Is the default (PigStorage?). >> > Nops ... > > >> In the hadoop counters for the job in the jobtracker ui, do you see the >> expected number of input records being read? >> > Is possible to see the counter in the history interface on JobTracker? > I will run the jobs again to compare the counter, but my guess is probably > not! > > -Thejas >> >> >> >> >> On 2/28/11 10:57 AM, "Charles Gonçalves" <[EMAIL PROTECTED]> wrote: >> >> I'm not using any filtering in the script. >> I'm just want to see the total traffic per day in all logs. >> >> If I combine 1000 log files into one and run the script on this log files >> I >> got the correct answer for those logs. >> But when I'm run with all the *43458* log files I got a incorrect >> output. >> The correct would be an histogram for each day from 2010-10 but the result >> contain only data from 2010-10-21. >> And if I process all the logs with an awk script I got the correct answer. >> >> >> On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai <[EMAIL PROTECTED]> >> wrote: >> >> > Not sure if I get your question. In 0.8, Pig combine small files into >> one >> > map, so it is possible you get less output files. >> >> This is not the problem. >> But thanks anyway! >> >> If that is your concern, you can try to disable split combine using >> > "-Dpig.splitCombination=false" >> > >> > Daniel >> > >> > >> > Charles Gonçalves wrote: >> > >> >> I tried to process a big number of small files on pig and I got a >> strange >> >> problem. >> >> >> >> 2011-02-27 00:00:58,746 [Thread-15] INFO >> >> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input >> paths >> >> to process : *43458* >> >> 2011-02-27 00:00:58,755 [Thread-15] INFO >> >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total >> >> input >> >> paths to process : *43458* >> >> 2011-02-27 00:01:14,173 [Thread-15] INFO >> >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total >> >> input >> >> paths (combined) to process : *329* >> >> >> >> When the script finish to process, the result is just about a subgroup >> of >> >> the input files. >> >> These are logs from a whole month, but the results are just from the >> day >> >> 21. >> >> >> >> >> >> Maybe I'm missing something. >> >> Any Ideas? >> >> >> >> >> >> >> > >> > >> >> >> -- >> *Charles Ferreira Gonçalves * >> http://homepages.dcc.ufmg.br/~charles/ >> UFMG - ICEx - Dcc >> Cel.: 55 31 87741485 >> Tel.: 55 31 34741485 >> Lab.: 55 31 34095840 >> >> >> > > > -- > *Charles Ferreira Gonçalves * > http://homepages.dcc.ufmg.br/~charles/ > UFMG - ICEx - Dcc > Cel.: 55 31 87741485 > Tel.: 55 31 34741485 > Lab.: 55 31 34095840 > -- *Charles Ferreira Gonçalves * http://homepages.dcc.ufmg.br/~charles/ UFMG - ICEx - Dcc Cel.: 55 31 87741485 Tel.: 55 31 34741485 Lab.: 55 31 34095840
-
Re: Problem when executionengine.util.MapRedUtil combine input pathsDaniel Dai 2011-03-01, 21:44
Combine input splits should be able to handle compressed files. It will
create seperate RecordReader for each file within one input split. So gzip concatenation should not be the case. I am not sure what happen to your script. If possible, give us more information (script, UDF, data, version). Daniel On 02/28/2011 05:40 PM, Charles Gon�alves wrote: > Guys, > > The amount of data in the source dir: > hdfs://hydra1:57810/user/cdh-hadoop/mscdata/201010_raw 22567369111 > > What I did was: > I run with all logs, 43458 and the counters are: > > FILE_BYTES_READ 253,905,706 372,708,857 626,614,563 > HDFS_BYTES_READ 2,553,123,734 0 2,553,123,734 > FILE_BYTES_WRITTEN 619,877,917 372,708,857 992,586,774 > HDFS_BYTES_WRITTEN 0 535 535 > > > I did a manual join of the files and run again for the 336 files (the > merge of all those files). > The job didn't finished yet and the counters are: > > FILE_BYTES_READ 21,054,970,818 0 21,054,970,818 > HDFS_BYTES_READ 16,772,063,486 0 16,772,063,486 > FILE_BYTES_WRITTEN 39,797,038,008 10,404,287,551 50,201,325,55 > > > > I think that the problem could be in the combination of the input files. > Is the combination class aware of compression. > Because *all my files are compressed*. > Maybe the class perform a concatenation and we fall in the hdfs > limitation of gzip concatenated files. > > On Mon, Feb 28, 2011 at 8:47 PM, Charles Gon�alves > <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: > > > > On Mon, Feb 28, 2011 at 7:39 PM, Thejas M Nair > <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: > > Hi Charles, > Which load function are you using ? > > I'm using a UD load function .. > > Is the default (PigStorage?). > > Nops ... > > In the hadoop counters for the job in the jobtracker ui, do > you see the expected number of input records being read? > > Is possible to see the counter in the history interface on > JobTracker? > I will run the jobs again to compare the counter, but my guess is > probably not! > > -Thejas > > > > > On 2/28/11 10:57 AM, "Charles Gon�alves" <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > > I'm not using any filtering in the script. > I'm just want to see the total traffic per day in all logs. > > If I combine 1000 log files into one and run the script > on this log files I > got the correct answer for those logs. > But when I'm run with all the *43458* log files I got a > incorrect output. > The correct would be an histogram for each day from > 2010-10 but the result > contain only data from 2010-10-21. > And if I process all the logs with an awk script I got the > correct answer. > > > On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai > <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> > wrote: > > > Not sure if I get your question. In 0.8, Pig combine > small files into one > > map, so it is possible you get less output files. > > This is not the problem. > But thanks anyway! > > If that is your concern, you can try to disable split > combine using > > "-Dpig.splitCombination=false" > > > > Daniel > > > > > > Charles Gon�alves wrote: > > > >> I tried to process a big number of small files on pig > and I got a strange > >> problem. > >> > >> 2011-02-27 00:00:58,746 [Thread-15] INFO > >> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - > Total input paths > >> to process : *43458* > >> 2011-02-27 00:00:58,755 [Thread-15] INFO > >> > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil
-
Re: Problem when executionengine.util.MapRedUtil combine input pathsCharles Gonçalves 2011-03-01, 22:02
Ok ...
I'm sending both. Versions: Apache Pig version 0.8.0 (r1043805) compiled Dec 08 2010, 17:26:09 Hadoop 0.20.2 On Tue, Mar 1, 2011 at 6:44 PM, Daniel Dai <[EMAIL PROTECTED]> wrote: > Combine input splits should be able to handle compressed files. It will > create seperate RecordReader for each file within one input split. So gzip > concatenation should not be the case. I am not sure what happen to your > script. If possible, give us more information (script, UDF, data, version). > > Daniel > > > > On 02/28/2011 05:40 PM, Charles Gonçalves wrote: > > Guys, > > The amount of data in the source dir: > hdfs://hydra1:57810/user/cdh-hadoop/mscdata/201010_raw 22567369111 > > What I did was: > I run with all logs, 43458 and the counters are: > > FILE_BYTES_READ 253,905,706 372,708,857 626,614,563 HDFS_BYTES_READ > 2,553,123,734 0 2,553,123,734 FILE_BYTES_WRITTEN 619,877,917 372,708,857 > 992,586,774 HDFS_BYTES_WRITTEN 0 535 535 > > > I did a manual join of the files and run again for the 336 files (the > merge of all those files). > The job didn't finished yet and the counters are: > > FILE_BYTES_READ 21,054,970,818 0 21,054,970,818 HDFS_BYTES_READ > 16,772,063,486 0 16,772,063,486 FILE_BYTES_WRITTEN 39,797,038,008 > 10,404,287,551 50,201,325,55 > > > I think that the problem could be in the combination of the input files. > Is the combination class aware of compression. > Because *all my files are compressed*. > Maybe the class perform a concatenation and we fall in the hdfs limitation > of gzip concatenated files. > > On Mon, Feb 28, 2011 at 8:47 PM, Charles Gonçalves <[EMAIL PROTECTED]>wrote: > >> >> >> On Mon, Feb 28, 2011 at 7:39 PM, Thejas M Nair <[EMAIL PROTECTED]>wrote: >> >>> Hi Charles, >>> Which load function are you using ? >>> >> I'm using a UD load function .. >> >> Is the default (PigStorage?). >>> >> Nops ... >> >> >>> In the hadoop counters for the job in the jobtracker ui, do you see the >>> expected number of input records being read? >>> >> Is possible to see the counter in the history interface on JobTracker? >> I will run the jobs again to compare the counter, but my guess is probably >> not! >> >> -Thejas >>> >>> >>> >>> >>> On 2/28/11 10:57 AM, "Charles Gonçalves" <[EMAIL PROTECTED]> wrote: >>> >>> I'm not using any filtering in the script. >>> I'm just want to see the total traffic per day in all logs. >>> >>> If I combine 1000 log files into one and run the script on this log >>> files I >>> got the correct answer for those logs. >>> But when I'm run with all the *43458* log files I got a incorrect >>> output. >>> The correct would be an histogram for each day from 2010-10 but the >>> result >>> contain only data from 2010-10-21. >>> And if I process all the logs with an awk script I got the correct >>> answer. >>> >>> >>> On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai <[EMAIL PROTECTED]> >>> wrote: >>> >>> > Not sure if I get your question. In 0.8, Pig combine small files into >>> one >>> > map, so it is possible you get less output files. >>> >>> This is not the problem. >>> But thanks anyway! >>> >>> If that is your concern, you can try to disable split combine using >>> > "-Dpig.splitCombination=false" >>> > >>> > Daniel >>> > >>> > >>> > Charles Gonçalves wrote: >>> > >>> >> I tried to process a big number of small files on pig and I got a >>> strange >>> >> problem. >>> >> >>> >> 2011-02-27 00:00:58,746 [Thread-15] INFO >>> >> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input >>> paths >>> >> to process : *43458* >>> >> 2011-02-27 00:00:58,755 [Thread-15] INFO >>> >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total >>> >> input >>> >> paths to process : *43458* >>> >> 2011-02-27 00:01:14,173 [Thread-15] INFO >>> >> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total >>> >> input >>> >> paths (combined) to process : *329* >>> >> >>> >> When the script finish to process, the result is just about a subgroup *Charles Ferreira Gonçalves * http://homepages.dcc.ufmg.br/~charles/ UFMG - ICEx - Dcc Cel.: 55 31 87741485 Tel.: 55 31 34741485 Lab.: 55 31 34095840
-
Re: Problem when executionengine.util.MapRedUtil combine input pathsDmitriy Ryaboy 2011-03-01, 22:07
fwiw, something similar happened with the HBase loader in 0.8 -- only the
first of the combined splits was read in (I worked around this by turning off split combination in the loader's setLocation, see pig-1680) D On Tue, Mar 1, 2011 at 2:02 PM, Charles Gonçalves <[EMAIL PROTECTED]>wrote: > Ok ... > > I'm sending both. > Versions: > > Apache Pig version 0.8.0 (r1043805) > compiled Dec 08 2010, 17:26:09 > > Hadoop 0.20.2 > > > > On Tue, Mar 1, 2011 at 6:44 PM, Daniel Dai <[EMAIL PROTECTED]> wrote: > >> Combine input splits should be able to handle compressed files. It will >> create seperate RecordReader for each file within one input split. So gzip >> concatenation should not be the case. I am not sure what happen to your >> script. If possible, give us more information (script, UDF, data, version). >> >> Daniel >> >> >> >> On 02/28/2011 05:40 PM, Charles Gonçalves wrote: >> >> Guys, >> >> The amount of data in the source dir: >> hdfs://hydra1:57810/user/cdh-hadoop/mscdata/201010_raw 22567369111 >> >> What I did was: >> I run with all logs, 43458 and the counters are: >> >> FILE_BYTES_READ 253,905,706 372,708,857 626,614,563 HDFS_BYTES_READ >> 2,553,123,734 0 2,553,123,734 FILE_BYTES_WRITTEN 619,877,917 372,708,857 >> 992,586,774 HDFS_BYTES_WRITTEN 0 535 535 >> >> >> I did a manual join of the files and run again for the 336 files (the >> merge of all those files). >> The job didn't finished yet and the counters are: >> >> FILE_BYTES_READ 21,054,970,818 0 21,054,970,818 HDFS_BYTES_READ >> 16,772,063,486 0 16,772,063,486 FILE_BYTES_WRITTEN 39,797,038,008 >> 10,404,287,551 50,201,325,55 >> >> >> I think that the problem could be in the combination of the input files. >> Is the combination class aware of compression. >> Because *all my files are compressed*. >> Maybe the class perform a concatenation and we fall in the hdfs limitation >> of gzip concatenated files. >> >> On Mon, Feb 28, 2011 at 8:47 PM, Charles Gonçalves <[EMAIL PROTECTED]>wrote: >> >>> >>> >>> On Mon, Feb 28, 2011 at 7:39 PM, Thejas M Nair <[EMAIL PROTECTED]>wrote: >>> >>>> Hi Charles, >>>> Which load function are you using ? >>>> >>> I'm using a UD load function .. >>> >>> Is the default (PigStorage?). >>>> >>> Nops ... >>> >>> >>>> In the hadoop counters for the job in the jobtracker ui, do you see the >>>> expected number of input records being read? >>>> >>> Is possible to see the counter in the history interface on JobTracker? >>> >>> I will run the jobs again to compare the counter, but my guess is >>> probably not! >>> >>> -Thejas >>>> >>>> >>>> >>>> >>>> On 2/28/11 10:57 AM, "Charles Gonçalves" <[EMAIL PROTECTED]> wrote: >>>> >>>> I'm not using any filtering in the script. >>>> I'm just want to see the total traffic per day in all logs. >>>> >>>> If I combine 1000 log files into one and run the script on this log >>>> files I >>>> got the correct answer for those logs. >>>> But when I'm run with all the *43458* log files I got a incorrect >>>> output. >>>> The correct would be an histogram for each day from 2010-10 but the >>>> result >>>> contain only data from 2010-10-21. >>>> And if I process all the logs with an awk script I got the correct >>>> answer. >>>> >>>> >>>> On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai <[EMAIL PROTECTED]> >>>> wrote: >>>> >>>> > Not sure if I get your question. In 0.8, Pig combine small files into >>>> one >>>> > map, so it is possible you get less output files. >>>> >>>> This is not the problem. >>>> But thanks anyway! >>>> >>>> If that is your concern, you can try to disable split combine using >>>> > "-Dpig.splitCombination=false" >>>> > >>>> > Daniel >>>> > >>>> > >>>> > Charles Gonçalves wrote: >>>> > >>>> >> I tried to process a big number of small files on pig and I got a >>>> strange >>>> >> problem. >>>> >> >>>> >> 2011-02-27 00:00:58,746 [Thread-15] INFO >>>> >> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input >>>> paths >>>> >> to process : *43458* |