|
Pedro Costa
2011-12-19, 13:51
Robert Evans
2011-12-19, 14:41
Robert Evans
2011-12-19, 14:43
Pedro Costa
2012-04-03, 15:01
Pedro Costa
2012-04-03, 15:25
Owen O'Malley
2012-04-03, 15:26
Owen O'Malley
2012-04-03, 15:29
|
-
Reduce output is strangePedro Costa 2011-12-19, 13:51
Hi,
In the hadoop MapReduce, I've executed the webdatascan example, and the reduce output is in a SequeceFile. The result is shows here ( http://paste.lisp.org/display/126572). What's the trash (random characters), like "u 265 0000100 330 320 252 " \n # ; 374 5 211 V ' 340 376" in the output? Is the output correct? 0000000 S E Q 006 031 o r g . a p a c h e . 0000020 h a d o o p . i o . T e x t 031 o 0000040 r g . a p a c h e . h a d o o p 0000060 . i o . T e x t \0 \0 \0 \0 \0 \0 u 265 0000100 330 320 252 " \n # ; 374 5 211 V ' 340 376 \0 \0 0000120 \0 X \0 \0 \0 037 a p p l e a p p 0000140 l e b a n a n a a p p l e 0000160 a p p l e 7 c a r r o t c a 0000200 r r o t c a r r o t c a r r 0000220 o t a p p l e b a n a n a 0000240 c a r r o t b a n a n a 0000256 -- Thanks,
-
Re: Reduce output is strangeRobert Evans 2011-12-19, 14:41
It looks mostly correct to me. I am not an expert on sequence files, and I have not checked the text against the spec nor have I checked the binary numbers in it to be sure they add up to the correct lengths etc, but it looks good from a first glance. I can see the SEQ tag at the beginning to mark it as a sequence file and the org.apache.hadoop.io.Text as the type for both the keys and the values.
--Bobby Evans On 12/19/11 7:51 AM, "Pedro Costa" <[EMAIL PROTECTED]> wrote: Hi, In the hadoop MapReduce, I've executed the webdatascan example, and the reduce output is in a SequeceFile. The result is shows here ( http://paste.lisp.org/display/126572). What's the trash (random characters), like "u 265 0000100 330 320 252 " \n # ; 374 5 211 V ' 340 376" in the output? Is the output correct? 0000000 S E Q 006 031 o r g . a p a c h e . 0000020 h a d o o p . i o . T e x t 031 o 0000040 r g . a p a c h e . h a d o o p 0000060 . i o . T e x t \0 \0 \0 \0 \0 \0 u 265 0000100 330 320 252 " \n # ; 374 5 211 V ' 340 376 \0 \0 0000120 \0 X \0 \0 \0 037 a p p l e a p p 0000140 l e b a n a n a a p p l e 0000160 a p p l e 7 c a r r o t c a 0000200 r r o t c a r r o t c a r r 0000220 o t a p p l e b a n a n a 0000240 c a r r o t b a n a n a 0000256 -- Thanks,
-
Re: Reduce output is strangeRobert Evans 2011-12-19, 14:43
Oh I forgot to say that part of the Random Characters are actually random characters. Sequence files store a set of random characters as synch points within the file. This allows for splitting the file easily without a high risk that the random sequence appears inside the data itself just by chance.
--Bobby Evans On 12/19/11 7:51 AM, "Pedro Costa" <[EMAIL PROTECTED]> wrote: Hi, In the hadoop MapReduce, I've executed the webdatascan example, and the reduce output is in a SequeceFile. The result is shows here ( http://paste.lisp.org/display/126572). What's the trash (random characters), like "u 265 0000100 330 320 252 " \n # ; 374 5 211 V ' 340 376" in the output? Is the output correct? 0000000 S E Q 006 031 o r g . a p a c h e . 0000020 h a d o o p . i o . T e x t 031 o 0000040 r g . a p a c h e . h a d o o p 0000060 . i o . T e x t \0 \0 \0 \0 \0 \0 u 265 0000100 330 320 252 " \n # ; 374 5 211 V ' 340 376 \0 \0 0000120 \0 X \0 \0 \0 037 a p p l e a p p 0000140 l e b a n a n a a p p l e 0000160 a p p l e 7 c a r r o t c a 0000200 r r o t c a r r o t c a r r 0000220 o t a p p l e b a n a n a 0000240 c a r r o t b a n a n a 0000256 -- Thanks,
-
Re: Reduce output is strangePedro Costa 2012-04-03, 15:01
If I want to compare 2 sequence files to see if they are the same, how do I
compare? On 19 December 2011 14:43, Robert Evans <[EMAIL PROTECTED]> wrote: > Oh I forgot to say that part of the Random Characters are actually random > characters. Sequence files store a set of random characters as synch > points within the file. This allows for splitting the file easily without > a high risk that the random sequence appears inside the data itself just by > chance. > > --Bobby Evans > > On 12/19/11 7:51 AM, "Pedro Costa" <[EMAIL PROTECTED]> wrote: > > Hi, > > In the hadoop MapReduce, I've executed the webdatascan example, and the > reduce output is in a SequeceFile. The result is shows here ( > http://paste.lisp.org/display/126572). What's the trash (random > characters), like "u 265 > 0000100 330 320 252 " \n # ; 374 5 211 V ' 340 376" in the output? Is the > output correct? > > > 0000000 S E Q 006 031 o r g . a p a c h e . > 0000020 h a d o o p . i o . T e x t 031 o > 0000040 r g . a p a c h e . h a d o o p > 0000060 . i o . T e x t \0 \0 \0 \0 \0 \0 u 265 > 0000100 330 320 252 " \n # ; 374 5 211 V ' 340 376 \0 \0 > 0000120 \0 X \0 \0 \0 037 a p p l e a p p > 0000140 l e b a n a n a a p p l e > 0000160 a p p l e 7 c a r r o t c a > 0000200 r r o t c a r r o t c a r r > 0000220 o t a p p l e b a n a n a > 0000240 c a r r o t b a n a n a > 0000256 > > > -- > Thanks, > > -- Best regards,
-
Re: Reduce output is strangePedro Costa 2012-04-03, 15:25
What I want to ask is:
- how do I read the values from sequence files that are block, or record compressed, or uncompressed? - how do I know if the sequence file is block compressed, record compressed, or uncompressed? - how do I know if it's a sequence file or a Textfile? On 3 April 2012 16:01, Pedro Costa <[EMAIL PROTECTED]> wrote: > If I want to compare 2 sequence files to see if they are the same, how do > I compare? > > > > On 19 December 2011 14:43, Robert Evans <[EMAIL PROTECTED]> wrote: > >> Oh I forgot to say that part of the Random Characters are actually random >> characters. Sequence files store a set of random characters as synch >> points within the file. This allows for splitting the file easily without >> a high risk that the random sequence appears inside the data itself just by >> chance. >> >> --Bobby Evans >> >> On 12/19/11 7:51 AM, "Pedro Costa" <[EMAIL PROTECTED]> wrote: >> >> Hi, >> >> In the hadoop MapReduce, I've executed the webdatascan example, and the >> reduce output is in a SequeceFile. The result is shows here ( >> http://paste.lisp.org/display/126572). What's the trash (random >> characters), like "u 265 >> 0000100 330 320 252 " \n # ; 374 5 211 V ' 340 376" in the output? Is the >> output correct? >> >> >> 0000000 S E Q 006 031 o r g . a p a c h e . >> 0000020 h a d o o p . i o . T e x t 031 o >> 0000040 r g . a p a c h e . h a d o o p >> 0000060 . i o . T e x t \0 \0 \0 \0 \0 \0 u 265 >> 0000100 330 320 252 " \n # ; 374 5 211 V ' 340 376 \0 \0 >> 0000120 \0 X \0 \0 \0 037 a p p l e a p p >> 0000140 l e b a n a n a a p p l e >> 0000160 a p p l e 7 c a r r o t c a >> 0000200 r r o t c a r r o t c a r r >> 0000220 o t a p p l e b a n a n a >> 0000240 c a r r o t b a n a n a >> 0000256 >> >> >> -- >> Thanks, >> >> > > > -- > Best regards, > > -- Best regards,
-
Re: Reduce output is strangeOwen O'Malley 2012-04-03, 15:26
On Tue, Apr 3, 2012 at 8:01 AM, Pedro Costa <[EMAIL PROTECTED]> wrote:
> If I want to compare 2 sequence files to see if they are the same, how do I > compare? >From the command line, you can "textify" the files with: hadoop fs -text myfile.seq Of course, if you are using API you can iterate through the two Sequence files and compare them row by row. -- Owen
-
Re: Reduce output is strangeOwen O'Malley 2012-04-03, 15:29
On Tue, Apr 3, 2012 at 8:25 AM, Pedro Costa <[EMAIL PROTECTED]> wrote:
> What I want to ask is: > > - how do I read the values from sequence files that are block, or record > compressed, or uncompressed? You use the SequenceFile.Reader class. > - how do I know if the sequence file is block compressed, record > compressed, or uncompressed? You use the SequenceFile.Reader class. > > - how do I know if it's a sequence file or a Textfile? SequenceFile's always have "SEQ" followed by the version in the first 4 bytes. -- Owen |