|
|
-
Sort with customized input/output !!
Matthew John 2010-09-08, 03:13
Hey , M pretty new to Hadoop .
I need to Sort a Metafile (TBs) and thought of using Hadoop Sort (in examples) for it. My input metafile looks like this --> binary stream (only 1's and 0's). It basically contains records of 40 bytes. Every record goes like this :
long a; <key> --> 8 bytes. The rest of the structure will be the <value> --> 32 bytes long b; int c; int d; int e; int unprocessed; int compress_attempted; int gatherer; I have created a *FpMetaId.java (extends BytesWritable)* corresponding to the <value> and *FpMetadata.java (extends BytesWritable)* corresponding to the <key>.
My sole aim is to get these records (40 bytes) sorted with the fp (double) as the key. And I need to write these sorted records back into a metafile (exactly my old metafile but with sorted records----> binaries only). I also implemented ::
*MetafileInputFormat.java ( extends SequenceFileAsBinaryInputFormat) * ---> file making an input file format compatible to my record. *MetafileOutputFormat<K, V> extends SequenceFileOutputFormat* ---> file making the output file format compatible to my record. *MetafileRecordReader.java (extends SequenceFileAsBinaryInputFormat.SequenceFileAsBinaryRecordReader )* ---> file implementing the record reader compatible to my record.
MetafileRecordWriter class has been implemented with in my MetafileOutputFormat.java file.
Let me kindly get you through the sequence of events which followed :
1) I resolved all the errors in the writable classes (FpMetaId, FpMetadata) and in/out formats (MetafileInputFormat, MetafileOutputFormat,) and RecordReaders I implemented.
2) Writables I copied to /io folder. Other new files were copied to /mapred folder. I successfully built it.
3) I modified the Sort file (the function I want to run with FpMetaId as key and FpMetadata as value and imported these new classes in the file.) I changed default conf settings to these required Writables and RecordReaders.. I built hadoop using ant command after this. It successfully got built.
*Q). Does this ensure all the new changes have got reflected on the jar. ( am I ready to go execute the sort function ?? )*
4) As I had already mentioned before, I am working with sequential file format (binary) with a datastructure (key,value) repeating. So I wrote a C code which generates random values for my datastructure and populated a file , sequentially writing (binary) my (key,value)datastructure. I gave this as my input for the sort which should sort my (key,values) with respect to keys. I got the error : fp_input not a SequenceFile (fp_input is my input file). I thought Seqfiles will just be stream of binaries.. Does it contain any specific format ?
*Command used : bin/hadoop jar hadoop-0.20.2-examples.jar sort fp_input fp_output*
*Q) What does this imply ? I have no clue how to proceed further. Again, is it because my jar file used to execute doesnt have the latest libraries ? I could not get any good tutorials on this. *
It would be great if someone can offer an helping hand to this noob.
Thanks, Matthew John
-
Re: Sort with customized input/output !!
Ted Yu 2010-09-08, 03:59
Please get hadoop source code and read the comment at the beginning of SequenceFile.java: * <p>Essentially there are 3 different formats for <code>SequenceFile</code>s ...
On Tue, Sep 7, 2010 at 8:13 PM, Matthew John <[EMAIL PROTECTED]>wrote:
> Hey , > M pretty new to Hadoop . > > I need to Sort a Metafile (TBs) and thought of using Hadoop Sort (in > examples) for it. > My input metafile looks like this --> binary stream (only 1's and 0's). It > basically contains records of 40 bytes. > Every record goes like this : > > long a; <key> --> 8 bytes. The rest of the structure will be the <value> > --> > 32 bytes > long b; > int c; > int d; > int e; > int unprocessed; > int compress_attempted; > int gatherer; > > > I have created a *FpMetaId.java (extends BytesWritable)* corresponding to > the <value> and *FpMetadata.java (extends BytesWritable)* corresponding to > the <key>. > > My sole aim is to get these records (40 bytes) sorted with the fp (double) > as the key. And I need to write these sorted records back into a metafile > (exactly my old metafile but with sorted records----> binaries only). > I also implemented :: > > *MetafileInputFormat.java ( extends SequenceFileAsBinaryInputFormat) * ---> > file making an input file format compatible to my record. > *MetafileOutputFormat<K, V> extends SequenceFileOutputFormat* ---> file > making the output file format compatible to my record. > *MetafileRecordReader.java (extends > SequenceFileAsBinaryInputFormat.SequenceFileAsBinaryRecordReader )* ---> > file implementing the record reader compatible to my record. > > MetafileRecordWriter class has been implemented with in my > MetafileOutputFormat.java file. > > Let me kindly get you through the sequence of events which followed : > > 1) I resolved all the errors in the writable classes (FpMetaId, FpMetadata) > and in/out formats (MetafileInputFormat, MetafileOutputFormat,) and > RecordReaders I implemented. > > 2) Writables I copied to /io folder. Other new files were copied to /mapred > folder. I successfully built it. > > 3) I modified the Sort file (the function I want to run with FpMetaId as > key > and FpMetadata as value and imported these new classes in the file.) I > changed default conf settings to these required Writables and > RecordReaders.. I built hadoop using ant command after this. It > successfully > got built. > > *Q). Does this ensure all the new changes have got reflected on the jar. ( > am I ready to go execute the sort function ?? )* > > 4) As I had already mentioned before, I am working with sequential file > format (binary) with a datastructure (key,value) repeating. So I wrote a C > code which generates random values for my datastructure and populated a > file > , sequentially writing (binary) my (key,value)datastructure. I gave this as > my input for the sort which should sort my (key,values) with respect to > keys. I got the error : fp_input not a SequenceFile (fp_input is my input > file). I thought Seqfiles will just be stream of binaries.. Does it contain > any specific format ? > > *Command used : bin/hadoop jar hadoop-0.20.2-examples.jar sort fp_input > fp_output* > > *Q) What does this imply ? I have no clue how to proceed further. Again, is > it because my jar file used to execute doesnt have the latest libraries ? I > could not get any good tutorials on this. > * > > It would be great if someone can offer an helping hand to this noob. > > Thanks, > Matthew John >
-
Re: Sort with customized input/output !!
Matthew John 2010-09-08, 15:02
Thanks for the reply Ted !!
What I understand is that a SequenceFile will have a header followed by the records in a format : Recordlength,Keylength,Key,Value with a sync marker coming at some regular interval..
It would be great if someone can take a look at the following..
Q 1) The thing is my file is basically in the format : header ( a different one) followed by Record (Key Value). In this case the size of Record and Key is fixed.I would like to know* if I can modify the core code to make the SequenceFile format like this *. If yes what code should I look at ??
Q 2) *What is a Sync marker (can we define it )* ? Obviously my file would not be having this. Can someone suggest a way to get around this obstacle. My final aim is to take this file in , sort it with respect to Key and print the sorted file ..
Thanks, Matthew
|
|