-Re: basic question about rack awareness and computation migration
I might have missed something but is there a reason for the input of the
mappers to be a list of files and not the files themselves?
The usual way is to provide a path to the files that should be processed
and then Hadoop will figure for you how to best use data locality.
Is there a reason for not doing that?
How big is each image file? How are they stored?
You could create an input format not splittable (it is a simple property),
that way you are sure that a mapper will process the whole file.
And then trivially your mapper compresses the provided image, Hadoop will
use a mapper per file and deals with data locality by itself.
On Wed, Mar 6, 2013 at 4:43 AM, Julian Bui <[EMAIL PROTECTED]> wrote:
> Thanks Harsh,
> > Are your input lists big (for each compressed output)? And is the list
> arbitrary or a defined list per goal?
> I dictate what my inputs will look like. If they need to be list of image
> files, then I can do that. If they need to be the images themselves as you
> suggest, then I can do that too but I'm not exactly sure what that would
> look like. Basically, I will try to format my inputs in the way that makes
> the most sense from a locality point of view.
> Since all the keys must be writable, I explored the Writable interface and
> found the interesting sub-classes:
> - FileSplit
> - BlockLocation
> - BytesWritable
> These all look somewhat promising as they kind of reveal the location
> information of the files.
> I'm not exactly sure how I would use these to hint at the data locations.
> Since these chunks of the file appear to be somewhat arbitrary in size and
> offset, I don't know how I could perform imagery operations on them. For
> example, if I knew that bytes 0x100-0x400 lie on node X, then that makes it
> difficult for me to use that information to give to my image libraries -
> does 0x100-0x400 correspond to some region/MBR within the image? I'm not
> sure how to make use of this information.
> The responses I've gotten so far indicate to me that HDFS kind of does the
> computation migration for me but that I have to give it enough information
> to work with. If someone could point to some detailed reading about this
> subject that would be pretty helpful, as I just can't find the
> documentation for it.
> Thanks again,
> On Tue, Mar 5, 2013 at 5:39 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>> Your concern is correct: If your input is a list of files, rather than
>> the files themselves, then the tasks would not be data-local - since
>> the task input would just be the list of files, and the files' data
>> may reside on any node/rack of the cluster.
>> However, your job will still run as the HDFS reads do remote reads
>> transparently without developer intervention and all will still work
>> as you've written it to. If a block is found local to the DN, it is
>> read locally as well - all of this is automatic.
>> Are your input lists big (for each compressed output)? And is the list
>> arbitrary or a defined list per goal?
>> On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui <[EMAIL PROTECTED]> wrote:
>> > Hi hadoop users,
>> > I'm trying to find out if computation migration is something the
>> > needs to worry about or if it's supposed to be hidden.
>> > I would like to use hadoop to take in a list of image paths in the hdfs
>> > then have each task compress these large, raw images into something much
>> > smaller - say jpeg files.
>> > Input: list of paths
>> > Output: compressed jpeg
>> > Since I don't really need a reduce task (I'm more using hadoop for its
>> > reliability and orchestration aspects), my mapper ought to just take the
>> > list of image paths and then work on them. As I understand it, each
>> > will likely be on multiple data nodes.
>> > My question is how will each mapper task "migrate the computation" to
>> > data nodes? I recall reading that the namenode is supposed to deal with