Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Reading from HDFS from inside the mapper


Copy link to this message
-
Re: Reading from HDFS from inside the mapper
Sigurd Spieckermann 2012-09-17, 13:50
OK, I see... Is there any way to change this? I need guaranteed order in
order for the map-side join to work correctly and I need the standalone
mode for debugging code that is executed on the mapper/reducer nodes.

2012/9/17 Harsh J <[EMAIL PROTECTED]>

> Sigurd,
>
> The implementation of fs -ls in the LocalFileSystem relies on Java's
> File#list
> http://docs.oracle.com/javase/6/docs/api/java/io/File.html#list()
> which states "There is no guarantee that the name strings in the
> resulting array will appear in any specific order; they are not, in
> particular, guaranteed to appear in alphabetical order.". That may
> just be what is biting you, since standalone mode uses LFS.
>
> On Mon, Sep 17, 2012 at 6:45 PM, Sigurd Spieckermann
> <[EMAIL PROTECTED]> wrote:
> > I've tracked down the problem to only occur in standalone mode. In
> > pseudo-distributed mode, everything works fine. My underlying OS is
> Ubuntu
> > 12.04 64bit. When I access the directory in linux directly, everything
> looks
> > normal. It's just when I access it through hadoop. Has anyone seen this
> > problem before and knows a solution?
> >
> > Thanks,
> > Sigurd
> >
> >
> > 2012/9/17 Sigurd Spieckermann <[EMAIL PROTECTED]>
> >>
> >> I'm experiencing a strange problem right now. I'm writing part-files to
> >> the HDFS providing initial data and (which should actually not make a
> >> difference anyway) write them in ascending order, i.e. part-00000,
> >> part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz",
> they
> >> are in the order part-00001, part-00000, part-00002, part-00003 etc.
> How is
> >> that possible? Why aren't they shown in natural order? Also the map-side
> >> join package considers them in this order which causes problems.
> >>
> >>
> >> 2012/9/10 Sigurd Spieckermann <[EMAIL PROTECTED]>
> >>>
> >>> OK, interesting. Just to confirm: is it okay to distribute quite large
> >>> files through the DistributedCache? Dataset B could be on the order of
> >>> gigabytes. Also, if I have much fewer nodes than elements/blocks in A,
> then
> >>> the probability that every node will have to read (almost) every block
> of B
> >>> is quite high so given DC is okay here in general, it would be more
> >>> efficient to use DC over HDFS reading. How about the case though that
> I have
> >>> m*n nodes, then every node would receive all of B while only needing a
> small
> >>> fraction, right? Could you maybe elaborate on this in a few sentence
> just to
> >>> be sure I understand Hadoop correctly?
> >>>
> >>> Thanks,
> >>> Sigurd
> >>>
> >>> 2012/9/10 Harsh J <[EMAIL PROTECTED]>
> >>>>
> >>>> Sigurd,
> >>>>
> >>>> Hemanth's recommendation of DistributedCache does fit your requirement
> >>>> - it is a generic way of distributing files and archives to tasks of a
> >>>> job. It is not something that pushes things automatically in memory,
> >>>> but on the local disk of the TaskTracker your task runs on. You can
> >>>> choose to then use a LocalFileSystem impl. to read it out from there,
> >>>> which would end up being (slightly) faster than your same approach
> >>>> applied to MapFiles on HDFS.
> >>>>
> >>>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
> >>>>
> >>>> <[EMAIL PROTECTED]> wrote:
> >>>> > I checked DistributedCache, but in general I have to assume that
> none
> >>>> > of the
> >>>> > datasets fits in memory... That's why I was considering map-side
> join,
> >>>> > but
> >>>> > by default it doesn't fit to my problem. I could probably get it to
> >>>> > work
> >>>> > though, but I would have to enforce the requirements of the map-side
> >>>> > join.
> >>>> >
> >>>> >
> >>>> > 2012/9/10 Hemanth Yamijala <[EMAIL PROTECTED]>
> >>>> >>
> >>>> >> Hi,
> >>>> >>
> >>>> >> You could check DistributedCache
> >>>> >>
> >>>> >> (
> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache
> ).
> >>>> >> It would allow you to distribute data to the nodes where your tasks