Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Hadoop--store a sequence file in distributed cache?


Copy link to this message
-
RE: Hadoop--store a sequence file in distributed cache?

Sofia,

I was about to say that if your file is already on hdfs, you should just be able to open it.
But as I type this, I have this thing kicking me in the back of the head reminding me that you may not be able to access the hdfs file at the same time someone else is accessing it? (Going from memory, is there an exclusive lock on the file when you open it in HDFS?)

If not, you can just use your file.
If so, you will need to use distributed cache which copies a copy of the file to some place local on each node running the task. Within your task you need to query the distributed cache for your file and get the path to the file so you can open it.
Depending on the size of your index... which can get large, you need to open the file once and just reset to the beginning of the file.

My suggestion is to consider putting your RTree into HBase. So HBase contains your index.
> Date: Sat, 13 Aug 2011 03:02:32 -0700
> From: [EMAIL PROTECTED]
> Subject: Re: Hadoop--store a sequence file in distributed cache?
> To: [EMAIL PROTECTED]
>
> Good morning,
>
> I am a little confused, I have to say.
>
> A summury of the project first: I want to examine how an Rtree on HDFS would speed up spatial queries like point/range queries, that normally target a very small part of the original input.
>
> I have built my Rtree on HDFS, and now I need to answer queries using it. I thought I could make an MR Job that takes as input a text file where each line is a query (for example we have 20000 queries). To answer the queries efficiently, I need to check some information about the root nodes of the tree, which are stored in R files (R=the #reducers of the previous job). These files are small in size and are read from every mapper, thus the idea of distributed cache fits, right?
>
> I have built an ArrayList during setup() to avoid opening all the files in distributed cache, and open only 3-4 of them for example. I agree, though, that opening and closing these files so many times is an important overhead. I think however, that opening these files from HDFS rather than distributed cache would be even worse, since the file accessing operations in HDFS are much more "expensive" than accessing files locally.
>
> Thank you all for your response, I would be glad to have more feedback.
> Sofia
>
>
>
>
>
> ________________________________
> From: "GOEKE, MATTHEW (AG/1000)" <[EMAIL PROTECTED]>
> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Sent: Friday, August 12, 2011 7:05 PM
> Subject: RE: Hadoop--store a sequence file in distributed cache?
>
> Sofia correct me if I am wrong, but Mike I think this thread was about using the output of a previous job, in this case already in sequence file format, as in memory join data for another job.
>
> Side note: does anyone know what the rule of thumb on file size is when using the distributed cache vs just reading from HDFS (join data not binary files)? I always thought that having a setup phase on a mapper read directly from HDFS was a asking for trouble and that you should always distribute to each node but I am hearing more and more people say to just read directly from HDFS for larger file sizes to avoid the IO cost of the distributed cache.
>
> Matt
>
> -----Original Message-----
> From: Ian Michael Gumby [mailto:[EMAIL PROTECTED]]
> Sent: Friday, August 12, 2011 10:54 AM
> To: [EMAIL PROTECTED]
> Subject: RE: Hadoop--store a sequence file in distributed cache?
>
>
> This whole thread doesn't make a lot of sense.
>
> If your first m/r job creates the sequence files, which you then use as input files to your second job, you don't need to use distributed cache since the output of the first m/r job is going to be in HDFS.
> (Dino is correct on that account.)
>
> Sofia replied saying that she needed to open and close the sequence file to access the data in each Mapper.map() call.
> Without knowing more about the specific app, Ashook is correct that you could read the file in Mapper.setup() and then access it in memory.