You understood correctly the scenario.
I see your rationale and thanks for your suggestions.
To better explain the problem and my point of view let me make an example.
I want to read two files. In the first one the rows are composed as
Airport_Id, User_Id, Time
and indicates user positions in airports at specific time. This file
is very large.
The second file contains tuples of this form:
and summarize all the flights timetable with the respective airports.
Now, I want a job that takes the first file as input and computes all
the possible flights a user may have taken.
My solution, according to what I wrote in the previous mails, would be
to emits tuples from the first file partitioned by Airport_Id.
Then, we know that all the tuples with the same Airport_ID go the same
reducer and we can perform an in-memory load of the part of the second
file related to those airports this reducers is receiving keys.
I think this is much faster than perform a MR join, right?
On 29 March 2013 04:47, Hemanth Yamijala <[EMAIL PROTECTED]> wrote:
> The way I understand your requirement - you have a file that contains a set
> of keys. You want to read this file on every reducer and take only those
> entries of the set, whose keys correspond to the current reducer.
> If the above summary is correct, can I assume that you are potentially
> reading the entire intermediate output key space on every reducer. Would
> that even work (considering memory constraints, etc).
> It seemed to me that your solution is implementing what the framework can
> already do for you. That was the rationale behind my suggestion. Maybe you
> should try and implement both approaches to see which one works better for
> On Thu, Mar 28, 2013 at 6:37 PM, Alberto Cordioli
> <[EMAIL PROTECTED]> wrote:
>> Yes, that is a possible solution.
>> But since the MR job has another scope, the mappers already read other
>> files (very large) and output tuples.
>> You cannot control the number of mappers and hence the risk is that a
>> lot of mappers will be created, and each of them read also the other
>> file instead of a small number of reducers.
>> Do you think that the solution I proposed is not so elegant or efficient?
>> On 28 March 2013 13:12, Hemanth Yamijala <[EMAIL PROTECTED]>
>> > Hmm. That feels like a join. Can't you read the input file on the map
>> > side
>> > and output those keys along with the original map output keys.. That way
>> > the
>> > reducer would automatically get both together ?
>> > On Thu, Mar 28, 2013 at 5:20 PM, Alberto Cordioli
>> > <[EMAIL PROTECTED]> wrote:
>> >> Hi Hemanth,
>> >> thanks for your reply.
>> >> Yes, this partially answered to my question. I know how hash
>> >> partitioner works and I guessed something similar.
>> >> The piece that I missed was that mapred.task.partition returns the
>> >> partition number of the reducer.
>> >> So, putting al the pieces together I undersand that: for each key in
>> >> the file I have to call the HashPartitioner.
>> >> Then I have to compare the returned index with the one retrieved by
>> >> Configuration.getInt("mapred.task.partition").
>> >> If it is equal then such a key will be served by that reducer. Is this
>> >> correct?
>> >> To answer to your question:
>> >> In a reduce side of a MR job, I want to load from file some data in a
>> >> in-memory structure. Actually, I don't need to store the whole file
>> >> for each reducer, but only the lines that are related to such keys a
>> >> particular reducers will receive.
>> >> So, my intention is to know the keys in the setup method to store only
>> >> the needed lines.
>> >> Thanks,
>> >> Alberto
>> >> On 28 March 2013 11:01, Hemanth Yamijala <[EMAIL PROTECTED]>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > Not sure if I am answering your question, but this is the background.