Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Extra 4 bytes at beginning of serialized file


Copy link to this message
-
Re: Extra 4 bytes at beginning of serialized file
Chris Douglas 2009-08-12, 18:00
Rather than calling key.write(out), use out.write(key.getBytes(), 0,  
key.getLength()) in your OutputFormat. You'll need to specify that the  
keytype is BytesWritable or BinaryComparable, rather than Writable  
(for maintainers, "generic" may not be the best way to describe this  
output format, btw).

As Todd points out, whether the output data are legible is entirely up  
to your application. -C

On Aug 11, 2009, at 7:42 PM, Todd Lipcon wrote:

> If you know you'll only have one object in the file, you could write  
> your
> own Writable implementation which doesn't write its length. The  
> problem is
> that you'll never be able to *read* it, since writables only get an  
> input
> stream and thus don't know the file size.
>
> If you choose to do this, just model it after BytesWritable but drop  
> the 4
> byte length header.
>
> -Todd
>
> On Tue, Aug 11, 2009 at 7:23 PM, Kris Jirapinyo
> <[EMAIL PROTECTED]>wrote:
>
>> Ah that explains it, thanks Todd.  Is there a way to serialize an  
>> object
>> without using BytesWritable, or some way I can have a "perfect"  
>> serialized
>> file so I won't have to keep discarding the first 4 bytes of the  
>> files?
>>
>> -- Kris.
>>
>> On Tue, Aug 11, 2009 at 7:03 PM, Todd Lipcon <[EMAIL PROTECTED]>  
>> wrote:
>>
>>> BytesWritable serializes itself by first outputting the array  
>>> length, and
>>> then outputting the array itself. The 4 bytes at the top of the  
>>> file are
>>> the
>>> length of the value itself.
>>>
>>> Hope that helps
>>> -Todd
>>>
>>> On Tue, Aug 11, 2009 at 6:33 PM, Kris Jirapinyo <[EMAIL PROTECTED]
>>>> wrote:
>>>
>>>> Hi all,
>>>>  I was wondering if anyone's encountered 4 extra bytes at the
>> beginning
>>> of
>>>> the serialized object file using MultipleOutputFormat.  
>>>> Basically, I am
>>>> using BytesWritable to write the serialized byte arrays in the  
>>>> reducer
>>>> phase.  My writer is a generic one:
>>>>
>>>> public class GenericOutputFormat extends FileOutputFormat<Writable,
>>>> Writable>  {
>>>>
>>>>   @Override
>>>>   public RecordWriter<Writable, Writable>  
>>>> getRecordWriter(FileSystem
>>>> ignored, JobConf job, String name, Progressable progress)
>>>>       throws IOException {
>>>>         Path file = FileOutputFormat.getTaskOutputPath(job, name);
>>>>         FileSystem fs = file.getFileSystem(job);
>>>>         FSDataOutputStream fileOut = fs.create(file, progress);
>>>>       return new GenericWriter(fileOut);
>>>>   }
>>>>
>>>>   static class GenericWriter implements RecordWriter<Writable,
>> Writable>
>>> {
>>>>       protected DataOutputStream out;
>>>>
>>>>       GenericWriter(DataOutputStream out) {
>>>>           this.out = out;
>>>>       }
>>>>
>>>>       @Override
>>>>       public synchronized void close(Reporter reporter) throws
>>> IOException
>>>> {
>>>>           out.close();
>>>>       }
>>>>
>>>>       @Override
>>>>       public synchronized void write(Writable key, Writable value)
>>> throws
>>>> IOException {
>>>>           key.write(out);
>>>>       }
>>>>   }
>>>> }
>>>>
>>>> Basically, it'll just write out whatever is in the  
>>>> DataOutputStream.
>>> When
>>>> i
>>>> debugged, I printed out the size of the byte array in the
>> BytesWritable,
>>>> and
>>>> the resulting file is always 4 bytes larger than that number.  Any
>> ideas?
>>>>
>>>> -- Kris.
>>>>
>>>
>>