Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Extension points available for data locality


Copy link to this message
-
Re: Extension points available for data locality
Interesting....

You have a cluster of MySQL which is a bit different from a single data source.

When you say data locality, you want to run the job you mean that you want to launch your job and then have each mapper pull data from the local shard.

So you have a couple of issues.

1) You will need to set up Hadoop on the same cluster.
This is doable, you just have to account for the memory and disk on your system.

2) You will need to look at the HTable Input Format class.  (What's the difference between looking at a RS versus a shard?)

3) You will need to make sure that you have enough metadata to help determine where your data is located.
Outside of that, its doable.
Right?
Note that since you're not running HBase, Hadoop is a bit more tolerant of swapping, but not by much.

Good luck.

On Aug 21, 2012, at 7:44 AM, Tharindu Mathew <[EMAIL PROTECTED]> wrote:

> Dino, Feng,
>
> Thanks for the options, but I guess I need to do it myself.
>
> Harsh,
>
> What you said was the initial impression I got, but I thought I need to do something more with the name node. Thanks for clearing that out.
>
> My guess is that this probably works by using getLocations and mapping this location ip (or host) with the ip (or host) of the task tracker? Is this correct?
>
>
> On Tue, Aug 21, 2012 at 3:14 PM, feng lu <[EMAIL PROTECTED]> wrote:
> Hi Tharindu
>
> May you can try the Gora,The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce support.
>
> Now it support MySQL in gora-sql model.
>
>  http://gora.apache.org/
>
>
> On Tue, Aug 21, 2012 at 5:39 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> Tharindu,
>
> (Am assuming you've done enough research to know that there's benefit
> in what you're attempting to do.)
>
> Locality of tasks are determined by the job's InputFormat class.
> Specifically, the locality information returned by the InputSplit
> objects via InputFormat#getSplits(…) API is what the MR scheduler
> looks at when trying to launch data local tasks.
>
> You can tweak your InputFormat (the one that uses this DB as input?)
> to return relevant locations based on your "DB Cluster", in order to
> achieve this.
>
> On Tue, Aug 21, 2012 at 2:36 PM, Tharindu Mathew <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > I'm doing some research that involves pulling data stored in a mysql cluster
> > directly for a map reduce job, without storing the data in HDFS.
> >
> > I'd like to run hadoop task tracker nodes directly on the mysql cluster
> > nodes. The purpose of this being, starting mappers directly in the node
> > closest to the data if possible (data locality).
> >
> > I notice that with HDFS, since the name node knows exactly where each data
> > block is, it uses this to achieve data locality.
> >
> > Is there a way to achieve my requirement possibly by extending the name node
> > or otherwise?
> >
> > Thanks in advance.
> >
> > --
> > Regards,
> >
> > Tharindu
> >
> > blog: http://mackiemathew.com/
> >
>
>
>
> --
> Harsh J
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>
>
>
> --
> Regards,
>
> Tharindu
>
> blog: http://mackiemathew.com/
>