-Re: Streaming value of (200MB) from a SequenceFile
Sandy Ryza 2013-04-01, 05:59
I don't think saving the stream for later use would work - I was just
suggesting that if only some aggregate statistics needed to be calculated,
they could be calculated at read time instead of in the mapper. Nothing
requires a Writable to contain all the data that it reads.
That's a good point that you can pass the locations of the files. A
drawback of this is that Hadoop attempts to co-locate mappers with where
their input data is stored, and this approach would negate the locality
200 MB is not too small a file for Hadoop. A typical HDFS block size is 64
MB or 128 MB, so a file that's larger than that is not unreasonable.
On Sun, Mar 31, 2013 at 8:56 PM, Rahul Bhattacharjee <
[EMAIL PROTECTED]> wrote:
> Sorry for the multiple replies.
> There is one more thing that can be done (I guess) for streaming the
> values rather then constructing the whole object itself.We can store the
> value in hdfs as file and have the location as value of the mapper.Mapper
> can open a stream using the location specified.
> Not sure if 200 MB file would qualify as small file wrt hadoop or if too
> many 200 MB size files would have any impact to the NN.
> On Mon, Apr 1, 2013 at 9:02 AM, Rahul Bhattacharjee <
> [EMAIL PROTECTED]> wrote:
>> Hi Sandy,
>> I am also new to Hadoop and have a question here.
>> The writable does have a DataInput stream so that the objects can be
>> constructed from the byte stream.
>> Are you suggesting to save the stream for later use ,but late we cannot
>> ascertain the state of the stream.
>> For a large value , I think we can actually take the useful part and
>> emmit it out of from a mapper , we might also have a custom input format to
>> do this thing so that large value doesn't even reach the mapper.
>> Am I missing anything here?
>> On Sat, Mar 30, 2013 at 11:22 PM, Jerry Lam <[EMAIL PROTECTED]> wrote:
>>> Hi everyone,
>>> I'm having a problem to stream individual key-value pair of 200MB to 1GB
>>> from a MapFile.
>>> I need to stream the large value to an outputstream instead of reading
>>> the entire value before processing because it potentially uses too much
>>> I read the API for MapFile, the next(WritableComparable key, Writable
>>> val) does not return an input stream.
>>> How can I accomplish this?