|
|
-
ORDER Issue (repost to avoid spam filters)
Matthew Smith 2010-08-19, 18:35
All,
I am running pig-0.7.0 and I have been running into an issue running the ORDER command. I have attempted to run pig out of the box on 2 separate LINUX OS (Ubuntu 10.4 and OpenSuse 11.2) and the same issue has occurred. I run these commands in a script file:
start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, dip:chararray, sport:int, dport:int, protocol:int, packets:int, bytes:int, flags:chararray, startTime:long, endTime:long);
target = FILTER start BY sip matches '51.37.8.63';
fail = ORDER target BY bytes DESC;
not_reached = LIMIT fail 10;
dump not_reached;
The error is listed below. I then run:
start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, dip:chararray, sport:int, dport:int, protocol:int, packets:int, bytes:int, flags:chararray, startTime:long, endTime:long);
target = FILTER start BY sip matches '51.37.8.63';
dump target;
This script produces a large list of sips matching the filter. What am I doing wrong that causes pig to not want to ORDER these properly? I have been wrestling with this issue for a week now. Any help would be greatly appreciated.
Best,
Matthew
/ERROR
java.lang.RuntimeException:
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/user/matt/pigsample_24118161_1282155871461
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner
s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:135)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:
117)
at
org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:
527)
at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException:
Input path does not exist:
file:/user/matt/pigsample_24118161_1282155871461
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInp
utFormat.java:224)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigFileInpu
tFormat.listStatus(PigFileInputFormat.java:37)
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInpu
tFormat.java:241)
at
org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:153)
at
org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:115)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner
s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:108)
... 6 more
-
Re: ORDER Issue (repost to avoid spam filters)
Thejas M Nair 2010-08-19, 21:34
I think 0.7 had an issue where order-by used to fail if the input was empty. But that does not seem to be the case here. I am wondering if there is a parsing/data-format issue that is causing bytes column to be empty , though I am not aware of emtpy/null value of sort column causing issues. Can you try dumping just the bytes column ? Another thing you can try is to store the output of filter and load data again before doing order-by ..
Please let us know what you find.
Thanks, Thejas On 8/19/10 11:35 AM, "Matthew Smith" <[EMAIL PROTECTED]> wrote:
All,
I am running pig-0.7.0 and I have been running into an issue running the ORDER command. I have attempted to run pig out of the box on 2 separate LINUX OS (Ubuntu 10.4 and OpenSuse 11.2) and the same issue has occurred. I run these commands in a script file:
start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, dip:chararray, sport:int, dport:int, protocol:int, packets:int, bytes:int, flags:chararray, startTime:long, endTime:long);
target = FILTER start BY sip matches '51.37.8.63';
fail = ORDER target BY bytes DESC;
not_reached = LIMIT fail 10;
dump not_reached;
The error is listed below. I then run:
start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, dip:chararray, sport:int, dport:int, protocol:int, packets:int, bytes:int, flags:chararray, startTime:long, endTime:long);
target = FILTER start BY sip matches '51.37.8.63';
dump target;
This script produces a large list of sips matching the filter. What am I doing wrong that causes pig to not want to ORDER these properly? I have been wrestling with this issue for a week now. Any help would be greatly appreciated.
Best,
Matthew
/ERROR
java.lang.RuntimeException:
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/user/matt/pigsample_24118161_1282155871461
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner
s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:135)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:
117)
at
org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:
527)
at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException:
Input path does not exist:
file:/user/matt/pigsample_24118161_1282155871461
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInp
utFormat.java:224)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigFileInpu
tFormat.listStatus(PigFileInputFormat.java:37)
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInpu
tFormat.java:241)
at
org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:153)
at
org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:115)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner
s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:108)
... 6 more
-
Re: ORDER Issue (repost to avoid spam filters)
Mridul Muralidharan 2010-08-19, 23:44
Are you using pig local mode ? If yes, does this work with hadoop ?
Regards, Mridul
On Friday 20 August 2010 12:05 AM, Matthew Smith wrote: > All, > > > > I am running pig-0.7.0 and I have been running into an issue running the > ORDER command. I have attempted to run pig out of the box on 2 separate > LINUX OS (Ubuntu 10.4 and OpenSuse 11.2) and the same issue has > occurred. I run these commands in a script file: > > > > start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, > dip:chararray, sport:int, dport:int, protocol:int, packets:int, > bytes:int, flags:chararray, startTime:long, endTime:long); > > > > target = FILTER start BY sip matches '51.37.8.63'; > > > > fail = ORDER target BY bytes DESC; > > > > not_reached = LIMIT fail 10; > > > > dump not_reached; > > > > > > The error is listed below. I then run: > > > > > > start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, > dip:chararray, sport:int, dport:int, protocol:int, packets:int, > bytes:int, flags:chararray, startTime:long, endTime:long); > > > > target = FILTER start BY sip matches '51.37.8.63'; > > > > dump target; > > > > > > This script produces a large list of sips matching the filter. What am > I doing wrong that causes pig to not want to ORDER these properly? I > have been wrestling with this issue for a week now. Any help would be > greatly appreciated. > > > > > > > > Best, > > > > Matthew > > > > /ERROR > > > > java.lang.RuntimeException: > > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path > does not exist: file:/user/matt/pigsample_24118161_1282155871461 > > > > at > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner > > s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:135) > > > > at > > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62) > > > > at > > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java: > > 117) > > > > at > > org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java: > > 527) > > > > at > > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613) > > > > at > > org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > > > > at > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > > > > Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: > > Input path does not exist: > > file:/user/matt/pigsample_24118161_1282155871461 > > > > at > > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInp > > utFormat.java:224) > > > > at > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigFileInpu > > tFormat.listStatus(PigFileInputFormat.java:37) > > > > at > > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInpu > > tFormat.java:241) > > > > at > > org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:153) > > > > at > > org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:115) > > > > at > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner > > s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:108) > > > > ... 6 more > > > > > > >
-
RE: ORDER Issue (repost to avoid spam filters)
Matthew Smith 2010-08-20, 20:56
UPDATE: I attempted my code in the amazon cloud (aws.amazon.com) and the script worked as intended over the data set. This leads me to believe that the issue is with pig-0.7.0 or my configuration. I would however like to not pay for something that is free :D. Any other ideas would be most welcome
@Thejas
I changed the Script to:
start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, dip:chararray, sport:int, dport:int, protocol:int, packets:int, bytes:int, flags:chararray, startTime:long, endTime:long);
target = FILTER start BY sip matches '51.37.8.63';
just_bytes= FOREACH target GENERATE bytes;
fail = ORDER just_bytes BY bytes DESC;
not_reached = LIMIT fail 10;
dump not_reached;
and received the same error as before. I then changed the script to:
start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, dip:chararray, sport:int, dport:int, protocol:int, packets:int, bytes:int, flags:chararray, startTime:long, endTime:long);
target = FILTER start BY sip matches '51.37.8.63';
stored = STORE target INTO 'myoutput';
second_start = LOAD 'myoutput/part-m-00000' USING PigStorage('\t') AS (sip:chararray, dip:chararray, sport:int, dport:int, protocol:int, packets:int, bytes:int, flags:chararray, startTime:long, endTime:long);
fail = ORDER second_start BY bytes DESC;
not_reached = LIMIT fail 10;
dump not_reached;
and received the same error.
@Mridul
I am using local mode at the moment. I don't understand the second question.
Thanks,
Matt
From: Thejas M Nair [mailto:[EMAIL PROTECTED]] Sent: Thursday, August 19, 2010 5:34 PM To: [EMAIL PROTECTED]; Matthew Smith Subject: Re: ORDER Issue (repost to avoid spam filters)
I think 0.7 had an issue where order-by used to fail if the input was empty. But that does not seem to be the case here. I am wondering if there is a parsing/data-format issue that is causing bytes column to be empty , though I am not aware of emtpy/null value of sort column causing issues. Can you try dumping just the bytes column ? Another thing you can try is to store the output of filter and load data again before doing order-by ..
Please let us know what you find.
Thanks, Thejas On 8/19/10 11:35 AM, "Matthew Smith" <[EMAIL PROTECTED]> wrote:
All,
I am running pig-0.7.0 and I have been running into an issue running the ORDER command. I have attempted to run pig out of the box on 2 separate LINUX OS (Ubuntu 10.4 and OpenSuse 11.2) and the same issue has occurred. I run these commands in a script file:
start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, dip:chararray, sport:int, dport:int, protocol:int, packets:int, bytes:int, flags:chararray, startTime:long, endTime:long);
target = FILTER start BY sip matches '51.37.8.63';
fail = ORDER target BY bytes DESC;
not_reached = LIMIT fail 10;
dump not_reached;
The error is listed below. I then run:
start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, dip:chararray, sport:int, dport:int, protocol:int, packets:int, bytes:int, flags:chararray, startTime:long, endTime:long);
target = FILTER start BY sip matches '51.37.8.63';
dump target;
This script produces a large list of sips matching the filter. What am I doing wrong that causes pig to not want to ORDER these properly? I have been wrestling with this issue for a week now. Any help would be greatly appreciated.
Best,
Matthew
/ERROR
java.lang.RuntimeException:
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/user/matt/pigsample_24118161_1282155871461
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner
s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:135)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:
117)
at
org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:
527)
at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException:
Input path does not exist:
file:/user/matt/pigsample_24118161_1282155871461
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInp
utFormat.java:224)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigFileInpu
tFormat.listStatus(PigFileInputFormat.java:37)
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInpu
tFormat.java:241)
at
org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:153)
at
org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:115)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner
s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:108)
... 6 more
-
Re: ORDER Issue (repost to avoid spam filters)
Thejas M Nair 2010-08-20, 21:23
I was wondering if the bytes column is having all null values (probably because the input has formatting issues.)
Can check you if the following query gives any output -
start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, dip:chararray, sport:int, dport:int, protocol:int, packets:int, bytes:int, flags:chararray, startTime:long, endTime:long);
target = FILTER start BY sip matches '51.37.8.63';
non_null_bytes = FILTER target by bytes is not null;
dump just_bytes;
-Thejas On 8/20/10 1:56 PM, "Matthew Smith" <[EMAIL PROTECTED]> wrote:
> UPDATE: I attempted my code in the amazon cloud (aws.amazon.com) and the > script worked as intended over the data set. This leads me to believe > that the issue is with pig-0.7.0 or my configuration. I would however > like to not pay for something that is free :D. Any other ideas would be > most welcome > > > > @Thejas > > I changed the Script to: > > start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, > dip:chararray, sport:int, dport:int, protocol:int, packets:int, > bytes:int, flags:chararray, startTime:long, endTime:long); > > target = FILTER start BY sip matches '51.37.8.63'; > > just_bytes= FOREACH target GENERATE bytes; > > fail = ORDER just_bytes BY bytes DESC; > > not_reached = LIMIT fail 10; > > dump not_reached; > > > > and received the same error as before. I then changed the script to: > > > > start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, > dip:chararray, sport:int, dport:int, protocol:int, packets:int, > bytes:int, flags:chararray, startTime:long, endTime:long); > > target = FILTER start BY sip matches '51.37.8.63'; > > stored = STORE target INTO 'myoutput'; > > second_start = LOAD 'myoutput/part-m-00000' USING PigStorage('\t') AS > (sip:chararray, dip:chararray, sport:int, dport:int, protocol:int, > packets:int, bytes:int, flags:chararray, startTime:long, endTime:long); > > fail = ORDER second_start BY bytes DESC; > > not_reached = LIMIT fail 10; > > dump not_reached; > > > > and received the same error. > > > > @Mridul > > I am using local mode at the moment. I don't understand the second > question. > > > > Thanks, > > Matt > > > > > > > > From: Thejas M Nair [mailto:[EMAIL PROTECTED]] > Sent: Thursday, August 19, 2010 5:34 PM > To: [EMAIL PROTECTED]; Matthew Smith > Subject: Re: ORDER Issue (repost to avoid spam filters) > > > > I think 0.7 had an issue where order-by used to fail if the input was > empty. But that does not seem to be the case here. > I am wondering if there is a parsing/data-format issue that is causing > bytes column to be empty , though I am not aware of emtpy/null value of > sort column causing issues. > Can you try dumping just the bytes column ? > Another thing you can try is to store the output of filter and load data > again before doing order-by .. > > Please let us know what you find. > > Thanks, > Thejas > > > > > On 8/19/10 11:35 AM, "Matthew Smith" <[EMAIL PROTECTED]> wrote: > > All, > > > > I am running pig-0.7.0 and I have been running into an issue running the > ORDER command. I have attempted to run pig out of the box on 2 separate > LINUX OS (Ubuntu 10.4 and OpenSuse 11.2) and the same issue has > occurred. I run these commands in a script file: > > > > start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, > dip:chararray, sport:int, dport:int, protocol:int, packets:int, > bytes:int, flags:chararray, startTime:long, endTime:long); > > > > target = FILTER start BY sip matches '51.37.8.63'; > > > > fail = ORDER target BY bytes DESC; > > > > not_reached = LIMIT fail 10; > > > > dump not_reached; > > > > > > The error is listed below. I then run: > > > > > > start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, > dip:chararray, sport:int, dport:int, protocol:int, packets:int, > bytes:int, flags:chararray, startTime:long, endTime:long); > > > > target = FILTER start BY sip matches '51.37.8.63';
-
RE: ORDER Issue (repost to avoid spam filters)
Matthew Smith 2010-08-23, 15:39
Changed the script to: start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, dip:chararray, sport:int, dport:int, protocol:int, packets:int, bytes:int, flags:chararray, startTime:long, endTime:long); target = FILTER start BY sip matches '51.37.8.63'; not_null_bytes = FILTER target BY bytes is not null; dump not_null_bytes;
and dumped the expected tuples. There were plenty of records that were valid. I will attempt to revert everything to pig-0.6.0 and re run the scripts to determine if the issue is in pig-0.7.0.
Matt
-----Original Message----- From: Thejas M Nair [mailto:[EMAIL PROTECTED]] Sent: Friday, August 20, 2010 5:23 PM To: [EMAIL PROTECTED]; Matthew Smith Subject: Re: ORDER Issue (repost to avoid spam filters)
I was wondering if the bytes column is having all null values (probably because the input has formatting issues.)
Can check you if the following query gives any output -
start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, dip:chararray, sport:int, dport:int, protocol:int, packets:int, bytes:int, flags:chararray, startTime:long, endTime:long);
target = FILTER start BY sip matches '51.37.8.63';
non_null_bytes = FILTER target by bytes is not null;
dump just_bytes;
-Thejas On 8/20/10 1:56 PM, "Matthew Smith" <[EMAIL PROTECTED]> wrote:
> UPDATE: I attempted my code in the amazon cloud (aws.amazon.com) and the > script worked as intended over the data set. This leads me to believe > that the issue is with pig-0.7.0 or my configuration. I would however > like to not pay for something that is free :D. Any other ideas would be > most welcome > > > > @Thejas > > I changed the Script to: > > start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, > dip:chararray, sport:int, dport:int, protocol:int, packets:int, > bytes:int, flags:chararray, startTime:long, endTime:long); > > target = FILTER start BY sip matches '51.37.8.63'; > > just_bytes= FOREACH target GENERATE bytes; > > fail = ORDER just_bytes BY bytes DESC; > > not_reached = LIMIT fail 10; > > dump not_reached; > > > > and received the same error as before. I then changed the script to: > > > > start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, > dip:chararray, sport:int, dport:int, protocol:int, packets:int, > bytes:int, flags:chararray, startTime:long, endTime:long); > > target = FILTER start BY sip matches '51.37.8.63'; > > stored = STORE target INTO 'myoutput'; > > second_start = LOAD 'myoutput/part-m-00000' USING PigStorage('\t') AS > (sip:chararray, dip:chararray, sport:int, dport:int, protocol:int, > packets:int, bytes:int, flags:chararray, startTime:long, endTime:long); > > fail = ORDER second_start BY bytes DESC; > > not_reached = LIMIT fail 10; > > dump not_reached; > > > > and received the same error. > > > > @Mridul > > I am using local mode at the moment. I don't understand the second > question. > > > > Thanks, > > Matt > > > > > > > > From: Thejas M Nair [mailto:[EMAIL PROTECTED]] > Sent: Thursday, August 19, 2010 5:34 PM > To: [EMAIL PROTECTED]; Matthew Smith > Subject: Re: ORDER Issue (repost to avoid spam filters) > > > > I think 0.7 had an issue where order-by used to fail if the input was > empty. But that does not seem to be the case here. > I am wondering if there is a parsing/data-format issue that is causing > bytes column to be empty , though I am not aware of emtpy/null value of > sort column causing issues. > Can you try dumping just the bytes column ? > Another thing you can try is to store the output of filter and load data > again before doing order-by .. > > Please let us know what you find. > > Thanks, > Thejas > > > > > On 8/19/10 11:35 AM, "Matthew Smith" <[EMAIL PROTECTED]> wrote: > > All, > > > > I am running pig-0.7.0 and I have been running into an issue running the > ORDER command. I have attempted to run pig out of the box on 2 separate > LINUX OS (Ubuntu 10.4 and OpenSuse 11.2) and the same issue has am path org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62) org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java: org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java: org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) org.apache.hadoop.mapreduce.lib.input.InvalidInputException: org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInp org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigFileInpu org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInpu org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:115) org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner
-
RE: ORDER Issue (repost to avoid spam filters)
Matthew Smith 2010-08-23, 18:13
Update: After downloading and installing pig-0.6.0, I ran the script again over the same data set. It produced the desired results. I don't know what I am doing wrong in 0.7.0, but will be reverting back to 0.6.0 until I can sort out what went wrong in 0.7.0. Thoughts are still welcome and wanted :D
Thanks, Matt
-----Original Message----- From: Matthew Smith [mailto:[EMAIL PROTECTED]] Sent: Monday, August 23, 2010 11:39 AM To: Thejas M Nair; [EMAIL PROTECTED] Subject: RE: ORDER Issue (repost to avoid spam filters)
Changed the script to: start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, dip:chararray, sport:int, dport:int, protocol:int, packets:int, bytes:int, flags:chararray, startTime:long, endTime:long); target = FILTER start BY sip matches '51.37.8.63'; not_null_bytes = FILTER target BY bytes is not null; dump not_null_bytes;
and dumped the expected tuples. There were plenty of records that were valid. I will attempt to revert everything to pig-0.6.0 and re run the scripts to determine if the issue is in pig-0.7.0.
Matt
-----Original Message----- From: Thejas M Nair [mailto:[EMAIL PROTECTED]] Sent: Friday, August 20, 2010 5:23 PM To: [EMAIL PROTECTED]; Matthew Smith Subject: Re: ORDER Issue (repost to avoid spam filters)
I was wondering if the bytes column is having all null values (probably because the input has formatting issues.)
Can check you if the following query gives any output -
start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, dip:chararray, sport:int, dport:int, protocol:int, packets:int, bytes:int, flags:chararray, startTime:long, endTime:long);
target = FILTER start BY sip matches '51.37.8.63';
non_null_bytes = FILTER target by bytes is not null;
dump just_bytes;
-Thejas On 8/20/10 1:56 PM, "Matthew Smith" <[EMAIL PROTECTED]> wrote:
> UPDATE: I attempted my code in the amazon cloud (aws.amazon.com) and the > script worked as intended over the data set. This leads me to believe > that the issue is with pig-0.7.0 or my configuration. I would however > like to not pay for something that is free :D. Any other ideas would be > most welcome > > > > @Thejas > > I changed the Script to: > > start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, > dip:chararray, sport:int, dport:int, protocol:int, packets:int, > bytes:int, flags:chararray, startTime:long, endTime:long); > > target = FILTER start BY sip matches '51.37.8.63'; > > just_bytes= FOREACH target GENERATE bytes; > > fail = ORDER just_bytes BY bytes DESC; > > not_reached = LIMIT fail 10; > > dump not_reached; > > > > and received the same error as before. I then changed the script to: > > > > start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, > dip:chararray, sport:int, dport:int, protocol:int, packets:int, > bytes:int, flags:chararray, startTime:long, endTime:long); > > target = FILTER start BY sip matches '51.37.8.63'; > > stored = STORE target INTO 'myoutput'; > > second_start = LOAD 'myoutput/part-m-00000' USING PigStorage('\t') AS > (sip:chararray, dip:chararray, sport:int, dport:int, protocol:int, > packets:int, bytes:int, flags:chararray, startTime:long, endTime:long); > > fail = ORDER second_start BY bytes DESC; > > not_reached = LIMIT fail 10; > > dump not_reached; > > > > and received the same error. > > > > @Mridul > > I am using local mode at the moment. I don't understand the second > question. > > > > Thanks, > > Matt > > > > > > > > From: Thejas M Nair [mailto:[EMAIL PROTECTED]] > Sent: Thursday, August 19, 2010 5:34 PM > To: [EMAIL PROTECTED]; Matthew Smith > Subject: Re: ORDER Issue (repost to avoid spam filters) > > > > I think 0.7 had an issue where order-by used to fail if the input was > empty. But that does not seem to be the case here. > I am wondering if there is a parsing/data-format issue that is causing > bytes column to be empty , though I am not aware of emtpy/null value of data the separate am path org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62) org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java: org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java: org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) org.apache.hadoop.mapreduce.lib.input.InvalidInputException: org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInp org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigFileInpu org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInpu org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:115) org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner
-
Re: ORDER Issue (repost to avoid spam filters)
Thejas M Nair 2010-08-25, 16:03
Can you check if the initial MR jobs in the order-by query failed because of some other error ? (specifically the sampling MR job that is part of order-by). Maybe, for some reason(bug?) pig did not capture/log that error. -Thejas
On 8/23/10 11:13 AM, "Matthew Smith" <[EMAIL PROTECTED]> wrote:
> Update: > After downloading and installing pig-0.6.0, I ran the script again over > the same data set. It produced the desired results. I don't know what I > am doing wrong in 0.7.0, but will be reverting back to 0.6.0 until I can > sort out what went wrong in 0.7.0. Thoughts are still welcome and wanted > :D > > Thanks, > Matt > > -----Original Message----- > From: Matthew Smith [mailto:[EMAIL PROTECTED]] > Sent: Monday, August 23, 2010 11:39 AM > To: Thejas M Nair; [EMAIL PROTECTED] > Subject: RE: ORDER Issue (repost to avoid spam filters) > > Changed the script to: > start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, > dip:chararray, sport:int, dport:int, protocol:int, packets:int, > bytes:int, flags:chararray, startTime:long, endTime:long); > target = FILTER start BY sip matches '51.37.8.63'; > not_null_bytes = FILTER target BY bytes is not null; > dump not_null_bytes; > > and dumped the expected tuples. There were plenty of records that were > valid. I will attempt to revert everything to pig-0.6.0 and re run the > scripts to determine if the issue is in pig-0.7.0. > > Matt > > -----Original Message----- > From: Thejas M Nair [mailto:[EMAIL PROTECTED]] > Sent: Friday, August 20, 2010 5:23 PM > To: [EMAIL PROTECTED]; Matthew Smith > Subject: Re: ORDER Issue (repost to avoid spam filters) > > I was wondering if the bytes column is having all null values (probably > because the input has formatting issues.) > > Can check you if the following query gives any output - > > start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, > dip:chararray, sport:int, dport:int, protocol:int, packets:int, > bytes:int, flags:chararray, startTime:long, endTime:long); > > target = FILTER start BY sip matches '51.37.8.63'; > > non_null_bytes = FILTER target by bytes is not null; > > dump just_bytes; > > -Thejas > > > On 8/20/10 1:56 PM, "Matthew Smith" <[EMAIL PROTECTED]> wrote: > >> UPDATE: I attempted my code in the amazon cloud (aws.amazon.com) and > the >> script worked as intended over the data set. This leads me to believe >> that the issue is with pig-0.7.0 or my configuration. I would however >> like to not pay for something that is free :D. Any other ideas would > be >> most welcome >> >> >> >> @Thejas >> >> I changed the Script to: >> >> start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, >> dip:chararray, sport:int, dport:int, protocol:int, packets:int, >> bytes:int, flags:chararray, startTime:long, endTime:long); >> >> target = FILTER start BY sip matches '51.37.8.63'; >> >> just_bytes= FOREACH target GENERATE bytes; >> >> fail = ORDER just_bytes BY bytes DESC; >> >> not_reached = LIMIT fail 10; >> >> dump not_reached; >> >> >> >> and received the same error as before. I then changed the script to: >> >> >> >> start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray, >> dip:chararray, sport:int, dport:int, protocol:int, packets:int, >> bytes:int, flags:chararray, startTime:long, endTime:long); >> >> target = FILTER start BY sip matches '51.37.8.63'; >> >> stored = STORE target INTO 'myoutput'; >> >> second_start = LOAD 'myoutput/part-m-00000' USING PigStorage('\t') AS >> (sip:chararray, dip:chararray, sport:int, dport:int, protocol:int, >> packets:int, bytes:int, flags:chararray, startTime:long, > endTime:long); >> >> fail = ORDER second_start BY bytes DESC; >> >> not_reached = LIMIT fail 10; >> >> dump not_reached; >> >> >> >> and received the same error. >> >> >> >> @Mridul >> >> I am using local mode at the moment. I don't understand the second >> question. >> >> >> >> Thanks, >> >> Matt >> >> >> >
|
|