Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS, mail # user - Re: basic question about rack awareness and computation migration


Copy link to this message
-
Re: basic question about rack awareness and computation migration
Bertrand Dechoux 2013-03-07, 12:35
I might have missed something but is there a reason for the input of the
mappers to be a list of files and not the files themselves?
The usual way is to provide a path to the files that should be processed
and then Hadoop will figure for you how to best use data locality.
Is there a reason for not doing that?

How big is each image file? How are they stored?

You could create an input format not splittable (it is a simple property),
that way you are sure that a mapper will process the whole file.
And then trivially your mapper compresses the provided image, Hadoop will
use a mapper per file and deals with data locality by itself.

Regards

Bertrand

On Wed, Mar 6, 2013 at 4:43 AM, Julian Bui <[EMAIL PROTECTED]> wrote:

> Thanks Harsh,
>
> > Are your input lists big (for each compressed output)? And is the list
> arbitrary or a defined list per goal?
>
> I dictate what my inputs will look like.  If they need to be list of image
> files, then I can do that.  If they need to be the images themselves as you
> suggest, then I can do that too but I'm not exactly sure what that would
> look like.  Basically, I will try to format my inputs in the way that makes
> the most sense from a locality point of view.
>
> Since all the keys must be writable, I explored the Writable interface and
> found the interesting sub-classes:
>
>    - FileSplit
>    - BlockLocation
>    - BytesWritable
>
> These all look somewhat promising as they kind of reveal the location
> information of the files.
>
> I'm not exactly sure how I would use these to hint at the data locations.
>  Since these chunks of the file appear to be somewhat arbitrary in size and
> offset, I don't know how I could perform imagery operations on them.  For
> example, if I knew that bytes 0x100-0x400 lie on node X, then that makes it
> difficult for me to use that information to give to my image libraries -
> does 0x100-0x400 correspond to some region/MBR within the image?  I'm not
> sure how to make use of this information.
>
> The responses I've gotten so far indicate to me that HDFS kind of does the
> computation migration for me but that I have to give it enough information
> to work with.  If someone could point to some detailed reading about this
> subject that would be pretty helpful, as I just can't find the
> documentation for it.
>
> Thanks again,
> -Julian
>
> On Tue, Mar 5, 2013 at 5:39 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>
>> Your concern is correct: If your input is a list of files, rather than
>> the files themselves, then the tasks would not be data-local - since
>> the task input would just be the list of files, and the files' data
>> may reside on any node/rack of the cluster.
>>
>> However, your job will still run as the HDFS reads do remote reads
>> transparently without developer intervention and all will still work
>> as you've written it to. If a block is found local to the DN, it is
>> read locally as well - all of this is automatic.
>>
>> Are your input lists big (for each compressed output)? And is the list
>> arbitrary or a defined list per goal?
>>
>> On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui <[EMAIL PROTECTED]> wrote:
>> > Hi hadoop users,
>> >
>> > I'm trying to find out if computation migration is something the
>> developer
>> > needs to worry about or if it's supposed to be hidden.
>> >
>> > I would like to use hadoop to take in a list of image paths in the hdfs
>> and
>> > then have each task compress these large, raw images into something much
>> > smaller - say jpeg  files.
>> >
>> > Input: list of paths
>> > Output: compressed jpeg
>> >
>> > Since I don't really need a reduce task (I'm more using hadoop for its
>> > reliability and orchestration aspects), my mapper ought to just take the
>> > list of image paths and then work on them.  As I understand it, each
>> image
>> > will likely be on multiple data nodes.
>> >
>> > My question is how will each mapper task "migrate the computation" to
>> the
>> > data nodes?  I recall reading that the namenode is supposed to deal with
+
Shumin Guo 2013-03-07, 16:05