|
Sigurd Spieckermann
2012-09-10, 09:57
Hemanth Yamijala
2012-09-10, 10:06
Harsh J
2012-09-10, 11:41
Sigurd Spieckermann
2012-09-10, 11:54
Sigurd Spieckermann
2012-09-17, 12:47
Sigurd Spieckermann
2012-09-17, 13:15
Harsh J
2012-09-17, 13:46
Sigurd Spieckermann
2012-09-17, 13:50
|
-
Reading from HDFS from inside the mapperSigurd Spieckermann 2012-09-10, 09:57
Hi,
I would like to perform a map-side join of two large datasets where dataset A consists of m*n elements and dataset B consists of n elements. For the join, every element in dataset B needs to be accessed m times. Each mapper would join one element from A with the corresponding element from B. Elements here are actually data blocks. Is there a performance problem (and difference compared to a slightly modified map-side join using the join-package) if I set dataset A as the map-reduce input and load the relevant element from dataset B directly from the HDFS inside the mapper? I could store the elements of B in a MapFile for faster random access. In the second case without the join-package I would not have to partition the datasets manually which would allow a bit more flexibility, but I'm wondering if HDFS access from inside a mapper is strictly bad. Also, does Hadoop have a cache for such situations by any chance? I appreciate any comments! Sigurd +
Sigurd Spieckermann 2012-09-10, 09:57
-
Re: Reading from HDFS from inside the mapperHemanth Yamijala 2012-09-10, 10:06
Hi,
You could check DistributedCache ( http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache). It would allow you to distribute data to the nodes where your tasks are run. Thanks Hemanth On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann < [EMAIL PROTECTED]> wrote: > Hi, > > I would like to perform a map-side join of two large datasets where > dataset A consists of m*n elements and dataset B consists of n elements. > For the join, every element in dataset B needs to be accessed m times. Each > mapper would join one element from A with the corresponding element from B. > Elements here are actually data blocks. Is there a performance problem (and > difference compared to a slightly modified map-side join using the > join-package) if I set dataset A as the map-reduce input and load the > relevant element from dataset B directly from the HDFS inside the mapper? I > could store the elements of B in a MapFile for faster random access. In the > second case without the join-package I would not have to partition the > datasets manually which would allow a bit more flexibility, but I'm > wondering if HDFS access from inside a mapper is strictly bad. Also, does > Hadoop have a cache for such situations by any chance? > > I appreciate any comments! > > Sigurd > +
Hemanth Yamijala 2012-09-10, 10:06
-
Re: Reading from HDFS from inside the mapperHarsh J 2012-09-10, 11:41
Sigurd,
Hemanth's recommendation of DistributedCache does fit your requirement - it is a generic way of distributing files and archives to tasks of a job. It is not something that pushes things automatically in memory, but on the local disk of the TaskTracker your task runs on. You can choose to then use a LocalFileSystem impl. to read it out from there, which would end up being (slightly) faster than your same approach applied to MapFiles on HDFS. On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann <[EMAIL PROTECTED]> wrote: > I checked DistributedCache, but in general I have to assume that none of the > datasets fits in memory... That's why I was considering map-side join, but > by default it doesn't fit to my problem. I could probably get it to work > though, but I would have to enforce the requirements of the map-side join. > > > 2012/9/10 Hemanth Yamijala <[EMAIL PROTECTED]> >> >> Hi, >> >> You could check DistributedCache >> (http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache). >> It would allow you to distribute data to the nodes where your tasks are run. >> >> Thanks >> Hemanth >> >> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann >> <[EMAIL PROTECTED]> wrote: >>> >>> Hi, >>> >>> I would like to perform a map-side join of two large datasets where >>> dataset A consists of m*n elements and dataset B consists of n elements. For >>> the join, every element in dataset B needs to be accessed m times. Each >>> mapper would join one element from A with the corresponding element from B. >>> Elements here are actually data blocks. Is there a performance problem (and >>> difference compared to a slightly modified map-side join using the >>> join-package) if I set dataset A as the map-reduce input and load the >>> relevant element from dataset B directly from the HDFS inside the mapper? I >>> could store the elements of B in a MapFile for faster random access. In the >>> second case without the join-package I would not have to partition the >>> datasets manually which would allow a bit more flexibility, but I'm >>> wondering if HDFS access from inside a mapper is strictly bad. Also, does >>> Hadoop have a cache for such situations by any chance? >>> >>> I appreciate any comments! >>> >>> Sigurd >> >> > -- Harsh J +
Harsh J 2012-09-10, 11:41
-
Re: Reading from HDFS from inside the mapperSigurd Spieckermann 2012-09-10, 11:54
OK, interesting. Just to confirm: is it okay to distribute quite large
files through the DistributedCache? Dataset B could be on the order of gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then the probability that every node will have to read (almost) every block of B is quite high so given DC is okay here in general, it would be more efficient to use DC over HDFS reading. How about the case though that I have m*n nodes, then every node would receive all of B while only needing a small fraction, right? Could you maybe elaborate on this in a few sentence just to be sure I understand Hadoop correctly? Thanks, Sigurd 2012/9/10 Harsh J <[EMAIL PROTECTED]> > Sigurd, > > Hemanth's recommendation of DistributedCache does fit your requirement > - it is a generic way of distributing files and archives to tasks of a > job. It is not something that pushes things automatically in memory, > but on the local disk of the TaskTracker your task runs on. You can > choose to then use a LocalFileSystem impl. to read it out from there, > which would end up being (slightly) faster than your same approach > applied to MapFiles on HDFS. > > On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann > <[EMAIL PROTECTED]> wrote: > > I checked DistributedCache, but in general I have to assume that none of > the > > datasets fits in memory... That's why I was considering map-side join, > but > > by default it doesn't fit to my problem. I could probably get it to work > > though, but I would have to enforce the requirements of the map-side > join. > > > > > > 2012/9/10 Hemanth Yamijala <[EMAIL PROTECTED]> > >> > >> Hi, > >> > >> You could check DistributedCache > >> ( > http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache > ). > >> It would allow you to distribute data to the nodes where your tasks are > run. > >> > >> Thanks > >> Hemanth > >> > >> > >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann > >> <[EMAIL PROTECTED]> wrote: > >>> > >>> Hi, > >>> > >>> I would like to perform a map-side join of two large datasets where > >>> dataset A consists of m*n elements and dataset B consists of n > elements. For > >>> the join, every element in dataset B needs to be accessed m times. Each > >>> mapper would join one element from A with the corresponding element > from B. > >>> Elements here are actually data blocks. Is there a performance problem > (and > >>> difference compared to a slightly modified map-side join using the > >>> join-package) if I set dataset A as the map-reduce input and load the > >>> relevant element from dataset B directly from the HDFS inside the > mapper? I > >>> could store the elements of B in a MapFile for faster random access. > In the > >>> second case without the join-package I would not have to partition the > >>> datasets manually which would allow a bit more flexibility, but I'm > >>> wondering if HDFS access from inside a mapper is strictly bad. Also, > does > >>> Hadoop have a cache for such situations by any chance? > >>> > >>> I appreciate any comments! > >>> > >>> Sigurd > >> > >> > > > > > > -- > Harsh J > +
Sigurd Spieckermann 2012-09-10, 11:54
-
Re: Reading from HDFS from inside the mapperSigurd Spieckermann 2012-09-17, 12:47
I'm experiencing a strange problem right now. I'm writing part-files to the
HDFS providing initial data and (which should actually not make a difference anyway) write them in ascending order, i.e. part-00000, part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz", they are in the order part-00001, part-00000, part-00002, part-00003 etc. How is that possible? Why aren't they shown in natural order? Also the map-side join package considers them in this order which causes problems. 2012/9/10 Sigurd Spieckermann <[EMAIL PROTECTED]> > OK, interesting. Just to confirm: is it okay to distribute quite large > files through the DistributedCache? Dataset B could be on the order of > gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then > the probability that every node will have to read (almost) every block of B > is quite high so given DC is okay here in general, it would be more > efficient to use DC over HDFS reading. How about the case though that I > have m*n nodes, then every node would receive all of B while only needing a > small fraction, right? Could you maybe elaborate on this in a few sentence > just to be sure I understand Hadoop correctly? > > Thanks, > Sigurd > > 2012/9/10 Harsh J <[EMAIL PROTECTED]> > >> Sigurd, >> >> Hemanth's recommendation of DistributedCache does fit your requirement >> - it is a generic way of distributing files and archives to tasks of a >> job. It is not something that pushes things automatically in memory, >> but on the local disk of the TaskTracker your task runs on. You can >> choose to then use a LocalFileSystem impl. to read it out from there, >> which would end up being (slightly) faster than your same approach >> applied to MapFiles on HDFS. >> >> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann >> >> <[EMAIL PROTECTED]> wrote: >> > I checked DistributedCache, but in general I have to assume that none >> of the >> > datasets fits in memory... That's why I was considering map-side join, >> but >> > by default it doesn't fit to my problem. I could probably get it to work >> > though, but I would have to enforce the requirements of the map-side >> join. >> > >> > >> > 2012/9/10 Hemanth Yamijala <[EMAIL PROTECTED]> >> >> >> >> Hi, >> >> >> >> You could check DistributedCache >> >> ( >> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache >> ). >> >> It would allow you to distribute data to the nodes where your tasks >> are run. >> >> >> >> Thanks >> >> Hemanth >> >> >> >> >> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann >> >> <[EMAIL PROTECTED]> wrote: >> >>> >> >>> Hi, >> >>> >> >>> I would like to perform a map-side join of two large datasets where >> >>> dataset A consists of m*n elements and dataset B consists of n >> elements. For >> >>> the join, every element in dataset B needs to be accessed m times. >> Each >> >>> mapper would join one element from A with the corresponding element >> from B. >> >>> Elements here are actually data blocks. Is there a performance >> problem (and >> >>> difference compared to a slightly modified map-side join using the >> >>> join-package) if I set dataset A as the map-reduce input and load the >> >>> relevant element from dataset B directly from the HDFS inside the >> mapper? I >> >>> could store the elements of B in a MapFile for faster random access. >> In the >> >>> second case without the join-package I would not have to partition the >> >>> datasets manually which would allow a bit more flexibility, but I'm >> >>> wondering if HDFS access from inside a mapper is strictly bad. Also, >> does >> >>> Hadoop have a cache for such situations by any chance? >> >>> >> >>> I appreciate any comments! >> >>> >> >>> Sigurd >> >> >> >> >> > >> >> >> >> -- >> Harsh J >> > > +
Sigurd Spieckermann 2012-09-17, 12:47
-
Re: Reading from HDFS from inside the mapperSigurd Spieckermann 2012-09-17, 13:15
I've tracked down the problem to only occur in standalone mode. In
pseudo-distributed mode, everything works fine. My underlying OS is Ubuntu 12.04 64bit. When I access the directory in linux directly, everything looks normal. It's just when I access it through hadoop. Has anyone seen this problem before and knows a solution? Thanks, Sigurd 2012/9/17 Sigurd Spieckermann <[EMAIL PROTECTED]> > I'm experiencing a strange problem right now. I'm writing part-files to > the HDFS providing initial data and (which should actually not make a > difference anyway) write them in ascending order, i.e. part-00000, > part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz", they > are in the order part-00001, part-00000, part-00002, part-00003 etc. How is > that possible? Why aren't they shown in natural order? Also the map-side > join package considers them in this order which causes problems. > > > 2012/9/10 Sigurd Spieckermann <[EMAIL PROTECTED]> > >> OK, interesting. Just to confirm: is it okay to distribute quite large >> files through the DistributedCache? Dataset B could be on the order of >> gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then >> the probability that every node will have to read (almost) every block of B >> is quite high so given DC is okay here in general, it would be more >> efficient to use DC over HDFS reading. How about the case though that I >> have m*n nodes, then every node would receive all of B while only needing a >> small fraction, right? Could you maybe elaborate on this in a few sentence >> just to be sure I understand Hadoop correctly? >> >> Thanks, >> Sigurd >> >> 2012/9/10 Harsh J <[EMAIL PROTECTED]> >> >>> Sigurd, >>> >>> Hemanth's recommendation of DistributedCache does fit your requirement >>> - it is a generic way of distributing files and archives to tasks of a >>> job. It is not something that pushes things automatically in memory, >>> but on the local disk of the TaskTracker your task runs on. You can >>> choose to then use a LocalFileSystem impl. to read it out from there, >>> which would end up being (slightly) faster than your same approach >>> applied to MapFiles on HDFS. >>> >>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann >>> >>> <[EMAIL PROTECTED]> wrote: >>> > I checked DistributedCache, but in general I have to assume that none >>> of the >>> > datasets fits in memory... That's why I was considering map-side join, >>> but >>> > by default it doesn't fit to my problem. I could probably get it to >>> work >>> > though, but I would have to enforce the requirements of the map-side >>> join. >>> > >>> > >>> > 2012/9/10 Hemanth Yamijala <[EMAIL PROTECTED]> >>> >> >>> >> Hi, >>> >> >>> >> You could check DistributedCache >>> >> ( >>> http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache >>> ). >>> >> It would allow you to distribute data to the nodes where your tasks >>> are run. >>> >> >>> >> Thanks >>> >> Hemanth >>> >> >>> >> >>> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann >>> >> <[EMAIL PROTECTED]> wrote: >>> >>> >>> >>> Hi, >>> >>> >>> >>> I would like to perform a map-side join of two large datasets where >>> >>> dataset A consists of m*n elements and dataset B consists of n >>> elements. For >>> >>> the join, every element in dataset B needs to be accessed m times. >>> Each >>> >>> mapper would join one element from A with the corresponding element >>> from B. >>> >>> Elements here are actually data blocks. Is there a performance >>> problem (and >>> >>> difference compared to a slightly modified map-side join using the >>> >>> join-package) if I set dataset A as the map-reduce input and load the >>> >>> relevant element from dataset B directly from the HDFS inside the >>> mapper? I >>> >>> could store the elements of B in a MapFile for faster random access. >>> In the >>> >>> second case without the join-package I would not have to partition > +
Sigurd Spieckermann 2012-09-17, 13:15
-
Re: Reading from HDFS from inside the mapperHarsh J 2012-09-17, 13:46
Sigurd,
The implementation of fs -ls in the LocalFileSystem relies on Java's File#list http://docs.oracle.com/javase/6/docs/api/java/io/File.html#list() which states "There is no guarantee that the name strings in the resulting array will appear in any specific order; they are not, in particular, guaranteed to appear in alphabetical order.". That may just be what is biting you, since standalone mode uses LFS. On Mon, Sep 17, 2012 at 6:45 PM, Sigurd Spieckermann <[EMAIL PROTECTED]> wrote: > I've tracked down the problem to only occur in standalone mode. In > pseudo-distributed mode, everything works fine. My underlying OS is Ubuntu > 12.04 64bit. When I access the directory in linux directly, everything looks > normal. It's just when I access it through hadoop. Has anyone seen this > problem before and knows a solution? > > Thanks, > Sigurd > > > 2012/9/17 Sigurd Spieckermann <[EMAIL PROTECTED]> >> >> I'm experiencing a strange problem right now. I'm writing part-files to >> the HDFS providing initial data and (which should actually not make a >> difference anyway) write them in ascending order, i.e. part-00000, >> part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz", they >> are in the order part-00001, part-00000, part-00002, part-00003 etc. How is >> that possible? Why aren't they shown in natural order? Also the map-side >> join package considers them in this order which causes problems. >> >> >> 2012/9/10 Sigurd Spieckermann <[EMAIL PROTECTED]> >>> >>> OK, interesting. Just to confirm: is it okay to distribute quite large >>> files through the DistributedCache? Dataset B could be on the order of >>> gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then >>> the probability that every node will have to read (almost) every block of B >>> is quite high so given DC is okay here in general, it would be more >>> efficient to use DC over HDFS reading. How about the case though that I have >>> m*n nodes, then every node would receive all of B while only needing a small >>> fraction, right? Could you maybe elaborate on this in a few sentence just to >>> be sure I understand Hadoop correctly? >>> >>> Thanks, >>> Sigurd >>> >>> 2012/9/10 Harsh J <[EMAIL PROTECTED]> >>>> >>>> Sigurd, >>>> >>>> Hemanth's recommendation of DistributedCache does fit your requirement >>>> - it is a generic way of distributing files and archives to tasks of a >>>> job. It is not something that pushes things automatically in memory, >>>> but on the local disk of the TaskTracker your task runs on. You can >>>> choose to then use a LocalFileSystem impl. to read it out from there, >>>> which would end up being (slightly) faster than your same approach >>>> applied to MapFiles on HDFS. >>>> >>>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann >>>> >>>> <[EMAIL PROTECTED]> wrote: >>>> > I checked DistributedCache, but in general I have to assume that none >>>> > of the >>>> > datasets fits in memory... That's why I was considering map-side join, >>>> > but >>>> > by default it doesn't fit to my problem. I could probably get it to >>>> > work >>>> > though, but I would have to enforce the requirements of the map-side >>>> > join. >>>> > >>>> > >>>> > 2012/9/10 Hemanth Yamijala <[EMAIL PROTECTED]> >>>> >> >>>> >> Hi, >>>> >> >>>> >> You could check DistributedCache >>>> >> >>>> >> (http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache). >>>> >> It would allow you to distribute data to the nodes where your tasks >>>> >> are run. >>>> >> >>>> >> Thanks >>>> >> Hemanth >>>> >> >>>> >> >>>> >> On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann >>>> >> <[EMAIL PROTECTED]> wrote: >>>> >>> >>>> >>> Hi, >>>> >>> >>>> >>> I would like to perform a map-side join of two large datasets where >>>> >>> dataset A consists of m*n elements and dataset B consists of n >>>> >>> elements. For >>>> >>> the join, every element in dataset B needs to be accessed m times. Harsh J +
Harsh J 2012-09-17, 13:46
-
Re: Reading from HDFS from inside the mapperSigurd Spieckermann 2012-09-17, 13:50
OK, I see... Is there any way to change this? I need guaranteed order in
order for the map-side join to work correctly and I need the standalone mode for debugging code that is executed on the mapper/reducer nodes. 2012/9/17 Harsh J <[EMAIL PROTECTED]> > Sigurd, > > The implementation of fs -ls in the LocalFileSystem relies on Java's > File#list > http://docs.oracle.com/javase/6/docs/api/java/io/File.html#list() > which states "There is no guarantee that the name strings in the > resulting array will appear in any specific order; they are not, in > particular, guaranteed to appear in alphabetical order.". That may > just be what is biting you, since standalone mode uses LFS. > > On Mon, Sep 17, 2012 at 6:45 PM, Sigurd Spieckermann > <[EMAIL PROTECTED]> wrote: > > I've tracked down the problem to only occur in standalone mode. In > > pseudo-distributed mode, everything works fine. My underlying OS is > Ubuntu > > 12.04 64bit. When I access the directory in linux directly, everything > looks > > normal. It's just when I access it through hadoop. Has anyone seen this > > problem before and knows a solution? > > > > Thanks, > > Sigurd > > > > > > 2012/9/17 Sigurd Spieckermann <[EMAIL PROTECTED]> > >> > >> I'm experiencing a strange problem right now. I'm writing part-files to > >> the HDFS providing initial data and (which should actually not make a > >> difference anyway) write them in ascending order, i.e. part-00000, > >> part-00001 etc. -- in that order. But when I do "hadoop dfs -ls xyz", > they > >> are in the order part-00001, part-00000, part-00002, part-00003 etc. > How is > >> that possible? Why aren't they shown in natural order? Also the map-side > >> join package considers them in this order which causes problems. > >> > >> > >> 2012/9/10 Sigurd Spieckermann <[EMAIL PROTECTED]> > >>> > >>> OK, interesting. Just to confirm: is it okay to distribute quite large > >>> files through the DistributedCache? Dataset B could be on the order of > >>> gigabytes. Also, if I have much fewer nodes than elements/blocks in A, > then > >>> the probability that every node will have to read (almost) every block > of B > >>> is quite high so given DC is okay here in general, it would be more > >>> efficient to use DC over HDFS reading. How about the case though that > I have > >>> m*n nodes, then every node would receive all of B while only needing a > small > >>> fraction, right? Could you maybe elaborate on this in a few sentence > just to > >>> be sure I understand Hadoop correctly? > >>> > >>> Thanks, > >>> Sigurd > >>> > >>> 2012/9/10 Harsh J <[EMAIL PROTECTED]> > >>>> > >>>> Sigurd, > >>>> > >>>> Hemanth's recommendation of DistributedCache does fit your requirement > >>>> - it is a generic way of distributing files and archives to tasks of a > >>>> job. It is not something that pushes things automatically in memory, > >>>> but on the local disk of the TaskTracker your task runs on. You can > >>>> choose to then use a LocalFileSystem impl. to read it out from there, > >>>> which would end up being (slightly) faster than your same approach > >>>> applied to MapFiles on HDFS. > >>>> > >>>> On Mon, Sep 10, 2012 at 4:15 PM, Sigurd Spieckermann > >>>> > >>>> <[EMAIL PROTECTED]> wrote: > >>>> > I checked DistributedCache, but in general I have to assume that > none > >>>> > of the > >>>> > datasets fits in memory... That's why I was considering map-side > join, > >>>> > but > >>>> > by default it doesn't fit to my problem. I could probably get it to > >>>> > work > >>>> > though, but I would have to enforce the requirements of the map-side > >>>> > join. > >>>> > > >>>> > > >>>> > 2012/9/10 Hemanth Yamijala <[EMAIL PROTECTED]> > >>>> >> > >>>> >> Hi, > >>>> >> > >>>> >> You could check DistributedCache > >>>> >> > >>>> >> ( > http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache > ). > >>>> >> It would allow you to distribute data to the nodes where your tasks +
Sigurd Spieckermann 2012-09-17, 13:50
|