|
Ondřej Klimpera
2012-03-29, 15:05
Deniz Demir
2012-03-29, 15:43
Ondřej Klimpera
2012-03-29, 18:26
Ondřej Klimpera
2012-03-30, 10:07
Ioan Eugen Stan
2012-03-30, 10:49
Ondřej Klimpera
2012-03-30, 11:15
Ondřej Klimpera
2012-03-30, 11:30
Ioan Eugen Stan
2012-04-02, 09:34
Ondřej Klimpera
2012-04-02, 10:00
Ioan Eugen Stan
2012-04-02, 11:01
|
-
Working with MapFilesOndřej Klimpera 2012-03-29, 15:05
Hello,
I have a MapFile as a product of MapReduce job, and what I need to do is: 1. If MapReduce produced more spilts as Output, merge them to single file. 2. Copy this merged MapFile to another HDFS location and use it as a Distributed cache file for another MapReduce job. I'm wondering if it is even possible to merge MapFiles according to their nature and use them as Distributed cache file. What I'm trying to achieve is repeatedly fast search in this file during another MapReduce job. If my idea is absolute wrong, can you give me any tip how to do it? The file is supposed to be 20MB large. I'm using Hadoop 0.20.203. Thanks for your reply:) Ondrej Klimpera
-
Re: Working with MapFilesDeniz Demir 2012-03-29, 15:43
Not sure if this helps in your use case but you can put all output file into distributed cache and then access them in the subsequent map-reduce job (in driver code):
// previous mr-job's output String pstr = "hdfs://<output_path/"; FileStatus[] files = fs.listStatus(new Path(pstr)); for (FileStatus f : files) { if (!f.isDir()) { DistributedCache.addCacheFile(f.getPath().toUri(), job.getConfiguration()); } } I think you can also copy these files to a different location in dfs and then put into distributed cache. Deniz On Mar 29, 2012, at 8:05 AM, Ondřej Klimpera wrote: > Hello, > > I have a MapFile as a product of MapReduce job, and what I need to do is: > > 1. If MapReduce produced more spilts as Output, merge them to single file. > > 2. Copy this merged MapFile to another HDFS location and use it as a Distributed cache file for another MapReduce job. > > I'm wondering if it is even possible to merge MapFiles according to their nature and use them as Distributed cache file. > > What I'm trying to achieve is repeatedly fast search in this file during another MapReduce job. > If my idea is absolute wrong, can you give me any tip how to do it? > > The file is supposed to be 20MB large. > I'm using Hadoop 0.20.203. > > Thanks for your reply:) > > Ondrej Klimpera
-
Re: Working with MapFilesOndřej Klimpera 2012-03-29, 18:26
Thanks for your fast reply, I'll try this approach:)
On 03/29/2012 05:43 PM, Deniz Demir wrote: > Not sure if this helps in your use case but you can put all output file into distributed cache and then access them in the subsequent map-reduce job (in driver code): > > // previous mr-job's output > String pstr = "hdfs://<output_path/"; > FileStatus[] files = fs.listStatus(new Path(pstr)); > for (FileStatus f : files) { > if (!f.isDir()) { > DistributedCache.addCacheFile(f.getPath().toUri(), job.getConfiguration()); > } > } > > I think you can also copy these files to a different location in dfs and then put into distributed cache. > > > Deniz > > > On Mar 29, 2012, at 8:05 AM, Ondřej Klimpera wrote: > >> Hello, >> >> I have a MapFile as a product of MapReduce job, and what I need to do is: >> >> 1. If MapReduce produced more spilts as Output, merge them to single file. >> >> 2. Copy this merged MapFile to another HDFS location and use it as a Distributed cache file for another MapReduce job. >> >> I'm wondering if it is even possible to merge MapFiles according to their nature and use them as Distributed cache file. >> >> What I'm trying to achieve is repeatedly fast search in this file during another MapReduce job. >> If my idea is absolute wrong, can you give me any tip how to do it? >> >> The file is supposed to be 20MB large. >> I'm using Hadoop 0.20.203. >> >> Thanks for your reply:) >> >> Ondrej Klimpera >
-
Re: Working with MapFilesOndřej Klimpera 2012-03-30, 10:07
Hello, I've got one more question, how is seek() (or get()) method
implemented in MapFile.Reader, does it use hashCode, compareTo() or another mechanism to find a match in MapFile's index. Thanks for your reply. Ondrej Klimpera On 03/29/2012 08:26 PM, Ondřej Klimpera wrote: > Thanks for your fast reply, I'll try this approach:) > > On 03/29/2012 05:43 PM, Deniz Demir wrote: >> Not sure if this helps in your use case but you can put all output >> file into distributed cache and then access them in the subsequent >> map-reduce job (in driver code): >> >> // previous mr-job's output >> String pstr = "hdfs://<output_path/"; >> FileStatus[] files = fs.listStatus(new Path(pstr)); >> for (FileStatus f : files) { >> if (!f.isDir()) { >> DistributedCache.addCacheFile(f.getPath().toUri(), >> job.getConfiguration()); >> } >> } >> >> I think you can also copy these files to a different location in dfs >> and then put into distributed cache. >> >> >> Deniz >> >> >> On Mar 29, 2012, at 8:05 AM, Ondřej Klimpera wrote: >> >>> Hello, >>> >>> I have a MapFile as a product of MapReduce job, and what I need to >>> do is: >>> >>> 1. If MapReduce produced more spilts as Output, merge them to single >>> file. >>> >>> 2. Copy this merged MapFile to another HDFS location and use it as a >>> Distributed cache file for another MapReduce job. >>> >>> I'm wondering if it is even possible to merge MapFiles according to >>> their nature and use them as Distributed cache file. >>> >>> What I'm trying to achieve is repeatedly fast search in this file >>> during another MapReduce job. >>> If my idea is absolute wrong, can you give me any tip how to do it? >>> >>> The file is supposed to be 20MB large. >>> I'm using Hadoop 0.20.203. >>> >>> Thanks for your reply:) >>> >>> Ondrej Klimpera >> >
-
Re: Working with MapFilesIoan Eugen Stan 2012-03-30, 10:49
Hello Ondrej,
Pe 29.03.2012 18:05, Ondřej Klimpera a scris: > Hello, > > I have a MapFile as a product of MapReduce job, and what I need to do is: > > 1. If MapReduce produced more spilts as Output, merge them to single file. > > 2. Copy this merged MapFile to another HDFS location and use it as a > Distributed cache file for another MapReduce job. > I'm wondering if it is even possible to merge MapFiles according to > their nature and use them as Distributed cache file. A MapFile is actually two files [1]: one SequanceFile (with sorted keys) and a small index for that file. The map file does a version of binary search to find your key and performs seek() to go to the byte offset in the file. > What I'm trying to achieve is repeatedly fast search in this file during > another MapReduce job. > If my idea is absolute wrong, can you give me any tip how to do it? > > The file is supposed to be 20MB large. > I'm using Hadoop 0.20.203. If the file is that small you could load it all in memory to avoid network IO. Do that in the setup() method of the map reduce job. The distributed cache will also use HDFS [2] and I don't think it will provide you with any benefits. > Thanks for your reply:) > > Ondrej Klimpera [1] http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html [2] http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html -- Ioan Eugen Stan http://ieugen.blogspot.com
-
Re: Working with MapFilesOndřej Klimpera 2012-03-30, 11:15
Hello,
I'm not sure what you mean by using map reduce setup()? "If the file is that small you could load it all in memory to avoid network IO. Do that in the setup() method of the map reduce job." Can you please explain little bit more? Thanks On 03/30/2012 12:49 PM, Ioan Eugen Stan wrote: > Hello Ondrej, > > > Pe 29.03.2012 18:05, Ondřej Klimpera a scris: >> Hello, >> >> I have a MapFile as a product of MapReduce job, and what I need to do >> is: >> >> 1. If MapReduce produced more spilts as Output, merge them to single >> file. >> >> 2. Copy this merged MapFile to another HDFS location and use it as a >> Distributed cache file for another MapReduce job. >> I'm wondering if it is even possible to merge MapFiles according to >> their nature and use them as Distributed cache file. > > A MapFile is actually two files [1]: one SequanceFile (with sorted > keys) and a small index for that file. The map file does a version of > binary search to find your key and performs seek() to go to the byte > offset in the file. > >> What I'm trying to achieve is repeatedly fast search in this file during >> another MapReduce job. >> If my idea is absolute wrong, can you give me any tip how to do it? >> >> The file is supposed to be 20MB large. >> I'm using Hadoop 0.20.203. > > If the file is that small you could load it all in memory to avoid > network IO. Do that in the setup() method of the map reduce job. > > The distributed cache will also use HDFS [2] and I don't think it will > provide you with any benefits. > >> Thanks for your reply:) >> >> Ondrej Klimpera > > [1] > http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html > [2] > http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html
-
Re: Working with MapFilesOndřej Klimpera 2012-03-30, 11:30
And one more question, is it even possible to add a MapFile (as it
consits of index and data file) to Distributed cache? Thanks On 03/30/2012 01:15 PM, Ondřej Klimpera wrote: > Hello, > > I'm not sure what you mean by using map reduce setup()? > > "If the file is that small you could load it all in memory to avoid > network IO. Do that in the setup() method of the map reduce job." > > Can you please explain little bit more? > > Thanks > > > On 03/30/2012 12:49 PM, Ioan Eugen Stan wrote: >> Hello Ondrej, >> >> >> Pe 29.03.2012 18:05, Ondřej Klimpera a scris: >>> Hello, >>> >>> I have a MapFile as a product of MapReduce job, and what I need to >>> do is: >>> >>> 1. If MapReduce produced more spilts as Output, merge them to single >>> file. >>> >>> 2. Copy this merged MapFile to another HDFS location and use it as a >>> Distributed cache file for another MapReduce job. >>> I'm wondering if it is even possible to merge MapFiles according to >>> their nature and use them as Distributed cache file. >> >> A MapFile is actually two files [1]: one SequanceFile (with sorted >> keys) and a small index for that file. The map file does a version of >> binary search to find your key and performs seek() to go to the byte >> offset in the file. >> >>> What I'm trying to achieve is repeatedly fast search in this file >>> during >>> another MapReduce job. >>> If my idea is absolute wrong, can you give me any tip how to do it? >>> >>> The file is supposed to be 20MB large. >>> I'm using Hadoop 0.20.203. >> >> If the file is that small you could load it all in memory to avoid >> network IO. Do that in the setup() method of the map reduce job. >> >> The distributed cache will also use HDFS [2] and I don't think it >> will provide you with any benefits. >> >>> Thanks for your reply:) >>> >>> Ondrej Klimpera >> >> [1] >> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html >> [2] >> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html >
-
Re: Working with MapFilesIoan Eugen Stan 2012-04-02, 09:34
Hi Ondrej,
Pe 30.03.2012 14:30, Ondřej Klimpera a scris: > And one more question, is it even possible to add a MapFile (as it > consits of index and data file) to Distributed cache? > Thanks Should be no problem, they are just two files. > On 03/30/2012 01:15 PM, Ondřej Klimpera wrote: >> Hello, >> >> I'm not sure what you mean by using map reduce setup()? >> >> "If the file is that small you could load it all in memory to avoid >> network IO. Do that in the setup() method of the map reduce job." >> >> Can you please explain little bit more? Check the javadocs[1]: setup is called once per task so you can read the file from HDFS then or perform other initializations. [1] http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Mapper.html Reading 20 MB in ram should not be a problem and is preferred if you need to make many requests against that data. It really depends on your use case so think carefully or just go ahead and test it. >> >> Thanks >> >> >> On 03/30/2012 12:49 PM, Ioan Eugen Stan wrote: >>> Hello Ondrej, >>> >>> >>> Pe 29.03.2012 18:05, Ondřej Klimpera a scris: >>>> Hello, >>>> >>>> I have a MapFile as a product of MapReduce job, and what I need to >>>> do is: >>>> >>>> 1. If MapReduce produced more spilts as Output, merge them to single >>>> file. >>>> >>>> 2. Copy this merged MapFile to another HDFS location and use it as a >>>> Distributed cache file for another MapReduce job. >>>> I'm wondering if it is even possible to merge MapFiles according to >>>> their nature and use them as Distributed cache file. >>> >>> A MapFile is actually two files [1]: one SequanceFile (with sorted >>> keys) and a small index for that file. The map file does a version of >>> binary search to find your key and performs seek() to go to the byte >>> offset in the file. >>> >>>> What I'm trying to achieve is repeatedly fast search in this file >>>> during >>>> another MapReduce job. >>>> If my idea is absolute wrong, can you give me any tip how to do it? >>>> >>>> The file is supposed to be 20MB large. >>>> I'm using Hadoop 0.20.203. >>> >>> If the file is that small you could load it all in memory to avoid >>> network IO. Do that in the setup() method of the map reduce job. >>> >>> The distributed cache will also use HDFS [2] and I don't think it >>> will provide you with any benefits. >>> >>>> Thanks for your reply:) >>>> >>>> Ondrej Klimpera >>> >>> [1] >>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html >>> >>> [2] >>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html >>> >> > -- Ioan Eugen Stan http://ieugen.blogspot.com
-
Re: Working with MapFilesOndřej Klimpera 2012-04-02, 10:00
Ok, thanks.
I missed setup() method because of using older version of hadoop, so I suppose that method configure() does the same in hadoop 0.20.203. Now I'm able to load a map file inside configure() method to MapFile.Reader instance as a class private variable, all works fine, just wondering if the MapFile is replicated on HDFS and data are read locally, or if reading from this file will increase the network bandwidth because of getting it's data from another computer node in the hadoop cluster. Hopefully last question to bother you is, if reading files from DistributedCache (normal text file) is limited to particular job. Before running a job I add a file to DistCache. When getting the file in Reducer implementation, can it access DistCache files from another jobs? In another words what will list this command: //Reducer impl. public void configure(JobConf job) { URI[] distCacheFileUris = DistributedCache.getCacheFiles(job); } will the distCacheFileUris variable contain only URIs for this job, or for any job running on Hadoop cluster? Hope it's understandable. Thanks. On 04/02/2012 11:34 AM, Ioan Eugen Stan wrote: > Hi Ondrej, > > Pe 30.03.2012 14:30, Ondřej Klimpera a scris: >> And one more question, is it even possible to add a MapFile (as it >> consits of index and data file) to Distributed cache? >> Thanks > > Should be no problem, they are just two files. > >> On 03/30/2012 01:15 PM, Ondřej Klimpera wrote: >>> Hello, >>> >>> I'm not sure what you mean by using map reduce setup()? >>> >>> "If the file is that small you could load it all in memory to avoid >>> network IO. Do that in the setup() method of the map reduce job." >>> >>> Can you please explain little bit more? > > > Check the javadocs[1]: setup is called once per task so you can read > the file from HDFS then or perform other initializations. > > [1] > http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Mapper.html > > > Reading 20 MB in ram should not be a problem and is preferred if you > need to make many requests against that data. It really depends on > your use case so think carefully or just go ahead and test it. > >>> >>> Thanks >>> >>> >>> On 03/30/2012 12:49 PM, Ioan Eugen Stan wrote: >>>> Hello Ondrej, >>>> >>>> >>>> Pe 29.03.2012 18:05, Ondřej Klimpera a scris: >>>>> Hello, >>>>> >>>>> I have a MapFile as a product of MapReduce job, and what I need to >>>>> do is: >>>>> >>>>> 1. If MapReduce produced more spilts as Output, merge them to single >>>>> file. >>>>> >>>>> 2. Copy this merged MapFile to another HDFS location and use it as a >>>>> Distributed cache file for another MapReduce job. >>>>> I'm wondering if it is even possible to merge MapFiles according to >>>>> their nature and use them as Distributed cache file. >>>> >>>> A MapFile is actually two files [1]: one SequanceFile (with sorted >>>> keys) and a small index for that file. The map file does a version of >>>> binary search to find your key and performs seek() to go to the byte >>>> offset in the file. >>>> >>>>> What I'm trying to achieve is repeatedly fast search in this file >>>>> during >>>>> another MapReduce job. >>>>> If my idea is absolute wrong, can you give me any tip how to do it? >>>>> >>>>> The file is supposed to be 20MB large. >>>>> I'm using Hadoop 0.20.203. >>>> >>>> If the file is that small you could load it all in memory to avoid >>>> network IO. Do that in the setup() method of the map reduce job. >>>> >>>> The distributed cache will also use HDFS [2] and I don't think it >>>> will provide you with any benefits. >>>> >>>>> Thanks for your reply:) >>>>> >>>>> Ondrej Klimpera >>>> >>>> [1] >>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html >>>> >>>> >>>> [2] >>>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html >>>> >>>> >>> >> > >
-
Re: Working with MapFilesIoan Eugen Stan 2012-04-02, 11:01
Hi Ondrej,
Pe 02.04.2012 13:00, Ondřej Klimpera a scris: > Ok, thanks. > > I missed setup() method because of using older version of hadoop, so I > suppose that method configure() does the same in hadoop 0.20.203. Aha, if it's possible, try upgrading. I don't know how support is for versions older then hadoop 0.20 branch. > Now I'm able to load a map file inside configure() method to > MapFile.Reader instance as a class private variable, all works fine, > just wondering if the MapFile is replicated on HDFS and data are read > locally, or if reading from this file will increase the network > bandwidth because of getting it's data from another computer node in the > hadoop cluster. > You could use a method variable instead of a class private if you load the file. If the MapFile is wrote to HDFS then yes it is replicated, and you can configure the replication factor at file creation (and later maybe). If you use DistributedCache then the files are not written in HDFS, but in mapred.local.dir [1] folder on every node. The folder size is configurable so it's possible that the data will be available there for the next MR job but don't rely on this. Please read the docs, I may get things wrong. RTFM will save you life ;). [1] http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata [2] https://forums.aws.amazon.com/message.jspa?messageID=152538 > Hopefully last question to bother you is, if reading files from > DistributedCache (normal text file) is limited to particular job. > Before running a job I add a file to DistCache. When getting the file in > Reducer implementation, can it access DistCache files from another jobs? > In another words what will list this command: > > //Reducer impl. > public void configure(JobConf job) { > > URI[] distCacheFileUris = DistributedCache.getCacheFiles(job); > > } > > will the distCacheFileUris variable contain only URIs for this job, or > for any job running on Hadoop cluster? > > Hope it's understandable. > Thanks. > It's -- Ioan Eugen Stan http://ieugen.blogspot.com |