Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Reading from HDFS from inside the mapper


Copy link to this message
-
Re: Reading from HDFS from inside the mapper
OK, I see... Is there any way to change this? I need guaranteed order in
order for the map-side join to work correctly and I need the standalone
mode for debugging code that is executed on the mapper/reducer nodes.

2012/9/17 Harsh J <[EMAIL PROTECTED]>

> Sigurd,
>
> The implementation of fs -ls in the LocalFileSystem relies on Java's
> File#list
> http://docs.oracle.com/javase/6/docs/api/java/io/File.html#list()
> which states "There is no guarantee that the name strings in the
> resulting array will appear in any specific order; they are not, in
> particular, guaranteed to appear in alphabetical order.". That may
> just be what is biting you, since standalone mode uses LFS.
>
> On Mon, Sep 17, 2012 at 6:45 PM, Sigurd Spieckermann
> <[EMAIL PROTECTED]> wrote:
> > I've tracked down the problem to only occur in standalone mode. In
> > pseudo-distributed mode, everything works fine. My underlying OS is
> Ubuntu
> > 12.04 64bit. When I access the directory in linux directly, everything
> looks
> > normal. It's just when I access it through hadoop. Has anyone seen this
> > problem before and knows a solution?
> >
> > Thanks,
> > Sigurd
> >
> >
> > 2012/9/17 Sigurd Spieckermann <[EMAIL PROTECTED]>
> >>
> >> I'm experiencing a strange problem right now. I'm writing part-files to
> >> the HDFS providing initial data and (which should actually not make a
> >> difference anyway) write them in ascending order, i.e. part-00000,
> >> part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz",
> they
> >> are in the order part-00001, part-00000, part-00002, part-00003 etc.
> How is
> >> that possible? Why aren't they shown in natural order? Also the map-side
> >> join package considers them in this order which causes problems.
> >>
> >>
> >> 2012/9/10 Sigurd Spieckermann <[EMAIL PROTECTED]>
> >>>
> >>> OK, interesting. Just to confirm: is it okay to distribute quite large
> >>> files through the DistributedCache? Dataset B could be on the order of
> >>> gigabytes. Also, if I have much fewer nodes than elements/blocks in A,
> then
> >>> the probability that every node will have to read (almost) every block
> of B
> >>> is quite high so given DC is okay here in general, it would be more
> >>> efficient to use DC over HDFS reading. How about the case though that
> I have
> >>> m*n nodes, then every node would receive all of B while only needing a
> small
> >>> fraction, right? Could you maybe elaborate on this in a few sentence
> just to
> >>> be sure I understand Hadoop correctly?
> >>>
> >>> Thanks,
> >>> Sigurd
> >>>
> >>> 2012/9/10 Harsh J <[EMAIL PROTECTED]>
> >>>>
> >>>> Sigurd,
> >>>>
> >>>> Hemanth's recommendation of DistributedCache does fit your requirement
> >>>> - it is a generic way of distributing files and archives to tasks of a
> >>>> job. It is not something that pushes things automatically in memory,
> >>>> but on the local disk of the TaskTracker your task runs on. You can
> >>>> choose to then use a LocalFileSystem impl. to read it out from there,
> >>>> which would end up being (slightly) faster than your same approach
> >>>> applied to MapFiles on HDFS.
> >>>>
> >>>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann
> >>>>
> >>>> <[EMAIL PROTECTED]> wrote:
> >>>> > I checked DistributedCache, but in general I have to assume that
> none
> >>>> > of the
> >>>> > datasets fits in memory... That's why I was considering map-side
> join,
> >>>> > but
> >>>> > by default it doesn't fit to my problem. I could probably get it to
> >>>> > work
> >>>> > though, but I would have to enforce the requirements of the map-side
> >>>> > join.
> >>>> >
> >>>> >
> >>>> > 2012/9/10 Hemanth Yamijala <[EMAIL PROTECTED]>
> >>>> >>
> >>>> >> Hi,
> >>>> >>
> >>>> >> You could check DistributedCache
> >>>> >>
> >>>> >> (
> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache
> ).
> >>>> >> It would allow you to distribute data to the nodes where your tasks
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB