HDFS, mail # user - Re: Streaming value of (200MB) from a SequenceFile

Re: Streaming value of (200MB) from a SequenceFile
Sandy Ryza 2013-04-01, 05:59
Hi Rahul,

I don't think saving the stream for later use would work - I was just
suggesting that if only some aggregate statistics needed to be calculated,
they could be calculated at read time instead of in the mapper.  Nothing
requires a Writable to contain all the data that it reads.

That's a good point that you can pass the locations of the files.  A
drawback of this is that Hadoop attempts to co-locate mappers with where
their input data is stored, and this approach would negate the locality

200 MB is not too small a file for Hadoop.  A typical HDFS block size is 64
MB or 128 MB, so a file that's larger than that is not unreasonable.


On Sun, Mar 31, 2013 at 8:56 PM, Rahul Bhattacharjee <

> Sorry for the multiple replies.
> There is one more thing that can be done (I guess) for streaming the
> values rather then constructing the whole object itself.We can store the
> value in hdfs as file and have the location as value of the mapper.Mapper
> can open a stream using the location specified.
> Not sure if 200 MB file would qualify as small file wrt hadoop or if too
> many 200 MB size files would have any impact to the NN.
> Thanks,
> Rahul
> On Mon, Apr 1, 2013 at 9:02 AM, Rahul Bhattacharjee <
>> Hi Sandy,
>> I am also new to Hadoop and have a question here.
>> The writable does have a DataInput stream so that the objects can be
>> constructed from the byte stream.
>> Are you suggesting to save the stream for later use ,but late we cannot
>> ascertain the state of the stream.
>> For a large value , I think we can actually take the useful part and
>> emmit it out of from a mapper , we might also have a custom input format to
>> do this thing so that large value doesn't even reach the mapper.
>> Am I missing anything here?
>> Thanks,
>> Rahul
>> On Sat, Mar 30, 2013 at 11:22 PM, Jerry Lam <[EMAIL PROTECTED]> wrote:
>>> Hi everyone,
>>> I'm having a problem to stream individual key-value pair of 200MB to 1GB
>>> from a MapFile.
>>> I need to stream the large value to an outputstream instead of reading
>>> the entire value before processing because it potentially uses too much
>>> memory.
>>> I read the API for MapFile, the next(WritableComparable key, Writable
>>> val) does not return an input stream.
>>> How can I accomplish this?
>>> Thanks,
>>> Jerry