|
Matthew John
2010-09-13, 09:15
Jeff Zhang
2010-09-13, 10:06
Matthew John
2010-09-13, 12:02
Jeff Zhang
2010-09-13, 12:41
Owen O'Malley
2010-09-13, 15:11
Matthew John
2010-09-13, 19:11
Owen O'Malley
2010-09-13, 20:42
Matthew John
2010-09-14, 04:19
Owen O'Malley
2010-09-14, 14:41
|
-
changing SequenceFile formatMatthew John 2010-09-13, 09:15
Hi guys,
I wanted to take in file with input : <key1><value1><key2><value2>...... binary sequence file (key and value length are constant) as input for the Sort (examples) . But as I understand the data in a standard Sequencefile of hadoop is in the format : <Recordlength><Keylength><Key><Value>..... . Where should I modify the code so as to use my inputfile as input to the recordreader. Please pour in your views .. Matthew
-
Re: changing SequenceFile formatJeff Zhang 2010-09-13, 10:06
I think You can modify Writer's append method, and Reader's next method.
On Mon, Sep 13, 2010 at 5:15 PM, Matthew John <[EMAIL PROTECTED]> wrote: > Hi guys, > > I wanted to take in file with input : <key1><value1><key2><value2>...... > binary sequence file (key and value length are constant) as input for the > Sort (examples) . But as I understand the data in a standard Sequencefile of > hadoop is in the format : <Recordlength><Keylength><Key><Value>..... . Where > should I modify the code so as to use my inputfile as input to the > recordreader. > > Please pour in your views .. > > Matthew > -- Best Regards Jeff Zhang
-
Re: changing SequenceFile formatMatthew John 2010-09-13, 12:02
When it comes to Writer, I can see the append, appendRaw methods.. But the
next methods (many ! ) in Reader is confusing !. Can you further info on it ? Matthew
-
Re: changing SequenceFile formatJeff Zhang 2010-09-13, 12:41
The next method has three version for three type of sequence file:
non-compression, record compression and block compression) I think maybe you should write a new class for your data format rather than modify SequenceFile, because it may be a bit complex for you (it has lots of features that you may not need) On Mon, Sep 13, 2010 at 8:02 PM, Matthew John <[EMAIL PROTECTED]> wrote: > When it comes to Writer, I can see the append, appendRaw methods.. But the > next methods (many ! ) in Reader is confusing !. > > Can you further info on it ? > > Matthew > -- Best Regards Jeff Zhang
-
Re: changing SequenceFile formatOwen O'Malley 2010-09-13, 15:11
On Sep 13, 2010, at 2:15 AM, Matthew John wrote: > Hi guys, > > I wanted to take in file with input : > <key1><value1><key2><value2>...... > binary sequence file (key and value length are constant) as input > for the > Sort (examples) . But as I understand the data in a standard > Sequencefile of > hadoop is in the format : > <Recordlength><Keylength><Key><Value>..... . Where > should I modify the code so as to use my inputfile as input to the > recordreader. Instead of modifying SequenceFile, I'd suggest that you create a new FixedRecordFile that has a fixed width for keys and values. In the terasort example in MapReduce I create an InputFormat that has 10 byte keys and 90 byte values with no markers. See http://bit.ly/9RybHw . The terasort example's InputFormat also does sampling, which you probably don't need. You will need to pay attention to the getSplits to ensure that you cut on record boundaries. -- Owen
-
Re: changing SequenceFile formatMatthew John 2010-09-13, 19:11
Thanks Owen for your reply !
The terasort input you have implemented is text type. And the input is line format where as I am dealing with sequence binary file. For my requirement I have created two writable implementables for the key and value respectively : *FpMetaId : key* public class FpMetaId extends BytesWritable { public long fp; *// 8 bytes key* public FpMetaId (long fp) { this.fp = fp; } public FpMetaId () { this (0); } public void readFields(DataInput in) throws IOException { setSize(0); // clear the old data //setSize(in.readInt()); setSize(8); in.readFully(getBytes(), 0, getSize()); fp = this.byteArrayToLong (getBytes(), 0); } public void write(DataOutput out) throws IOException { //out.writeInt(size); setSize(8); byte[] bytes = this.longToByteArray (fp); out.write(getBytes(), 0, getSize()); } public static byte[] intToByteArray(int value) { byte[] b = new byte[4]; for (int i = 0; i < 4; i++) { int offset = (b.length - 1 - i) * 8; b[i] = (byte) ((value >>> offset) & 0xFF); } return b; } public static byte[] longToByteArray(long value) { byte[] b = new byte[8]; for (int i = 0; i < 8; i++) { int offset = (b.length - 1 - i) * 8; b[i] = (byte) ((value >> offset) & 0xFF); } return b; } public static final int byteArrayToInt(byte [] b, int offset) { return (b[offset + 0] << 24) + ((b[offset + 1] & 0xFF) << 16) + ((b[offset + 2] & 0xFF) << 8) + (b[offset + 3] & 0xFF); } public static final long byteArrayToLong(byte [] b, int offset) { return (b[offset + 0] << 24) + ((b[offset + 1] & 0xFF) << 16) + ((b[offset + 2] & 0xFF) << 8) + (b[offset + 3] & 0xFF); } } ******************************************************************************************** *FpMetadata ---> value * *//32 bytes value* public class FpMetadata extends BytesWritable { public long fbn; public int ino; public int wi_gen; public int cp_count; public int unprocessed; public int compress_attempted; public int gatherer; public FpMetadata (long fbn, int ino, int wi_gen, int cp_count,int unprocessed, int compress_attempted, int gatherer) { super(); this.fbn = fbn; this.ino = ino; this.wi_gen = wi_gen; this.cp_count = cp_count; this.unprocessed = unprocessed; this.compress_attempted = compress_attempted; this.gatherer = gatherer; } public FpMetadata () { this (0,0,0,0,1,1,1); } public void readFields(DataInput in) throws IOException { setSize(0); // clear the old data setSize(32); in.readFully(getBytes(), 0,getSize()); fbn = this.byteArrayToLong (getBytes(), 0); ino = this.byteArrayToInt (getBytes(), 8); wi_gen = this.byteArrayToInt (getBytes(), 12); cp_count = this.byteArrayToInt (getBytes(), 16); unprocessed = this.byteArrayToInt (getBytes(), 20); compress_attempted = this.byteArrayToInt (getBytes(), 24); gatherer = this.byteArrayToInt (getBytes(), 28); } public void write(DataOutput out) throws IOException { //out.writeInt(size); setSize(32); byte[] bytes = this.longToByteArray (fbn); bytes = concat(bytes, intToByteArray (ino)); bytes = concat(bytes, intToByteArray (wi_gen)); bytes = concat(bytes, intToByteArray (cp_count)); bytes = concat(bytes, intToByteArray (unprocessed)); bytes = concat(bytes, intToByteArray (compress_attempted)); bytes = concat(bytes, intToByteArray (gatherer)); out.write(getBytes(), 0,getSize()); } public static byte[] intToByteArray(int value) { byte[] b = new byte[4]; for (int i = 0; i < 4; i++) { int offset = (b.length - 1 - i) * 8; b[i] = (byte) ((value >> offset) & 0xFF); } return b; } public static byte[] longToByteArray(long value) { byte[] b = new byte[8]; for (int i = 0; i < 8; i++) { int offset = (b.length - 1 - i) * 8; b[i] = (byte) ((value >> offset) & 0xFF); } return b; } public static final int byteArrayToInt(byte [] b, int offset) { return (b[offset + 0] << 24) + ((b[offset + 1] & 0xFF) << 16) + ((b[offset + 2] & 0xFF) << 8) + (b[offset + 3] & 0xFF); } public static final long byteArrayToLong(byte [] b, int offset) { return (b[offset + 0] << 24) + ((b[offset + 1] & 0xFF) << 16) + ((b[offset + 2] & 0xFF) << 8) + (b[offset + 3] & 0xFF); } public static byte[] concat(byte[] a, byte[] b) { byte[] result = new byte[a.length + b.length]; System.arraycopy(a, 0, result, 0, a.length); System.arraycopy(b, 0, result, a.length, b.length); return result; } } *******************************************************************888 I assume I should also implement a inputformat and outputformat along with these. But I am not able to figure out how to provide the respective filesplit and recordreader/writer. Also SequenceFile which is apt for binary sequence files has a record structure as <record len><key len><key><value> and this is what is implemented by all sequence supporting recordreader. In that case how can I use any of these recordreaders since my records are in the format : <key><value>... Please reply.. Thanks Matthew
-
Re: changing SequenceFile formatOwen O'Malley 2010-09-13, 20:42
On Sep 13, 2010, at 12:11 PM, Matthew John wrote: > The terasort input you have implemented is text type. And the input > is line > format where as I am dealing with sequence binary file. For my > requirement I > have created two writable implementables for the key and value > respectively I would just use BytesWritable directly. The reader/writer should insist on the fixed lengths, not the types. The only restriction is that you can't use the BytesWritable readFields and write methods. You'll need to implement them in the file reader and writer. > I assume I should also implement a inputformat and outputformat > along with > these. But I am not able to figure out how to provide the respective > filesplit and recordreader/writer. To implement InputFormat, you'll need to implement getSplits and createRecordReader. You'll need to create a RecordReader class that understands your file's reader class. Once you implement an InputFormat, just set the class as the InputFormat for your job. -- Owen
-
Re: changing SequenceFile formatMatthew John 2010-09-14, 04:19
Hey Owen,
To sum it up, I should be writing InputFormat , OutputFormat where I will be defining my RecordReader/Writer and InputSplits. Now, why cant I use the FpMetadata and FpMetaId I implemented as the value and key classes. Would not that solve a lot of problem since I have defined in.readfields and out.write there itself. Matthew
-
Re: changing SequenceFile formatOwen O'Malley 2010-09-14, 14:41
On Sep 13, 2010, at 9:19 PM, Matthew John wrote: > To sum it up, I should be writing InputFormat , OutputFormat where I > will be > defining my RecordReader/Writer and InputSplits. Now, why cant I use > the > FpMetadata and FpMetaId I implemented as the value and key classes. > Would > not that solve a lot of problem since I have defined in.readfields and > out.write there itself. You could, it just isn't very reusable. If you use BytesWritable, it is easy to make the input format parameterable to handle different size keys and values. It would work either way... -- Owen |