-Re: Data locality of map-side join
Harsh J 2012-10-25, 05:49
>From what I've generally noticed, the client-end frameworks (Hive,
Pig, etc.) have gotten much more cleverness and efficiency packed in
their join parts than the MR join package which probably exists to
serve as an example or utility today more than anything else (but
works well for what it does).
Per the code in the join package, there are no such estimates made
today. There is zero use of DistributedCache - the only decisions are
made based on the expression (i.e. to select which form of joining
record reader to use).
Enhancements to this may be accepted though, so feel free to file some
JIRAs if you have something to suggest/contribute. Hopefully one day
we could have a unified library between client-end tools for common
use-cases such as joins, etc. over MR, but there isn't such a thing
right now (AFAIK).
On Tue, Oct 23, 2012 at 2:52 PM, Sigurd Spieckermann
<[EMAIL PROTECTED]> wrote:
> Interesting to know that Hive and Pig are doing something in this direction.
> I'm dealing with the Hadoop join-package which doesn't use DistributedCache
> though but it rather pulls the other partition over the network before
> launching the map task. This is under the assumption that both partitions
> are too big to load into DC or it's just undesirable to use DC. Is there a
> similar mechanism implemented in the join-package that considers the size of
> the two partitions to be joined trying to execute the map task on the
> datanode that holds the bigger partition?
> 2012/10/23 Bejoy KS <[EMAIL PROTECTED]>
>> Hi Sigurd
>> Mapside joins are efficiently implemented in Hive and Pig. I'm talking in
>> terms of how mapside joins are implemented in hive.
>> In map side join, the smaller data set is first loaded into
>> DistributedCache. The larger dataset is streamed as usual and the smaller
>> dataset in memory. For every record in larger data set the look up is made
>> in memory on the smaller set and there by joins are done.
>> In later versions of hive the hive framework itself intelligently
>> determines the smaller data set. In older versions you can specify the
>> smaller data set using some hints in query.
>> Bejoy KS
>> Sent from handheld, please excuse typos.
>> -----Original Message-----
>> From: Sigurd Spieckermann <[EMAIL PROTECTED]>
>> Date: Mon, 22 Oct 2012 22:29:15
>> To: <[EMAIL PROTECTED]>
>> Reply-To: [EMAIL PROTECTED]
>> Subject: Data locality of map-side join
>> Hi guys,
>> I've been trying to figure out whether a map-side join using the
>> join-package does anything clever regarding data locality with respect
>> to at least one of the partitions to join. To be more specific, if I
>> want to join two datasets and some partition of dataset A is larger than
>> the corresponding partition of dataset B, does Hadoop account for this
>> and try to ensure that the map task is executed on the datanode storing
>> the bigger partition thus reducing data transfer (if the other partition
>> does not happen to be located on that same datanode)? I couldn't
>> conclude the one or the other behavior from the source code and I
>> couldn't find any documentation about this detail.
>> Thanks for clarifying!