|
Tharindu Mathew
2012-08-21, 09:06
Dino Kečo
2012-08-21, 09:22
Harsh J
2012-08-21, 09:39
feng lu
2012-08-21, 09:44
Tharindu Mathew
2012-08-21, 12:44
Michael Segel
2012-08-21, 13:28
Tharindu Mathew
2012-08-21, 13:54
Michael Segel
2012-08-21, 14:19
Tharindu Mathew
2012-08-21, 18:40
Minh Duc Nguyen
2012-08-21, 19:17
Harsh J
2012-08-22, 02:16
Tharindu Mathew
2012-08-22, 06:30
|
-
Extension points available for data localityTharindu Mathew 2012-08-21, 09:06
Hi,
I'm doing some research that involves pulling data stored in a mysql cluster directly for a map reduce job, without storing the data in HDFS. I'd like to run hadoop task tracker nodes directly on the mysql cluster nodes. The purpose of this being, starting mappers directly in the node closest to the data if possible (data locality). I notice that with HDFS, since the name node knows exactly where each data block is, it uses this to achieve data locality. Is there a way to achieve my requirement possibly by extending the name node or otherwise? Thanks in advance. -- Regards, Tharindu blog: http://mackiemathew.com/
-
Re: Extension points available for data localityDino Kečo 2012-08-21, 09:22
Hi Mathew,
You should check out this project http://db.cs.yale.edu/hadoopdb/hadoopdb.html It uses Hadoop and RDMBS for analytics. Regards, Dino Kečo msn: [EMAIL PROTECTED] mail: [EMAIL PROTECTED] skype: dino.keco phone: +387 61 507 851 On Tue, Aug 21, 2012 at 11:06 AM, Tharindu Mathew <[EMAIL PROTECTED]>wrote: > Hi, > > I'm doing some research that involves pulling data stored in a mysql > cluster directly for a map reduce job, without storing the data in HDFS. > > I'd like to run hadoop task tracker nodes directly on the mysql cluster > nodes. The purpose of this being, starting mappers directly in the node > closest to the data if possible (data locality). > > I notice that with HDFS, since the name node knows exactly where each data > block is, it uses this to achieve data locality. > > Is there a way to achieve my requirement possibly by extending the name > node or otherwise? > > Thanks in advance. > > -- > Regards, > > Tharindu > > blog: http://mackiemathew.com/ > >
-
Re: Extension points available for data localityHarsh J 2012-08-21, 09:39
Tharindu,
(Am assuming you've done enough research to know that there's benefit in what you're attempting to do.) Locality of tasks are determined by the job's InputFormat class. Specifically, the locality information returned by the InputSplit objects via InputFormat#getSplits(…) API is what the MR scheduler looks at when trying to launch data local tasks. You can tweak your InputFormat (the one that uses this DB as input?) to return relevant locations based on your "DB Cluster", in order to achieve this. On Tue, Aug 21, 2012 at 2:36 PM, Tharindu Mathew <[EMAIL PROTECTED]> wrote: > Hi, > > I'm doing some research that involves pulling data stored in a mysql cluster > directly for a map reduce job, without storing the data in HDFS. > > I'd like to run hadoop task tracker nodes directly on the mysql cluster > nodes. The purpose of this being, starting mappers directly in the node > closest to the data if possible (data locality). > > I notice that with HDFS, since the name node knows exactly where each data > block is, it uses this to achieve data locality. > > Is there a way to achieve my requirement possibly by extending the name node > or otherwise? > > Thanks in advance. > > -- > Regards, > > Tharindu > > blog: http://mackiemathew.com/ > -- Harsh J
-
Re: Extension points available for data localityfeng lu 2012-08-21, 09:44
Hi Tharindu
May you can try the Gora,The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce support. Now it support MySQL in gora-sql model. http://gora.apache.org/ On Tue, Aug 21, 2012 at 5:39 PM, Harsh J <[EMAIL PROTECTED]> wrote: > Tharindu, > > (Am assuming you've done enough research to know that there's benefit > in what you're attempting to do.) > > Locality of tasks are determined by the job's InputFormat class. > Specifically, the locality information returned by the InputSplit > objects via InputFormat#getSplits(…) API is what the MR scheduler > looks at when trying to launch data local tasks. > > You can tweak your InputFormat (the one that uses this DB as input?) > to return relevant locations based on your "DB Cluster", in order to > achieve this. > > On Tue, Aug 21, 2012 at 2:36 PM, Tharindu Mathew <[EMAIL PROTECTED]> > wrote: > > Hi, > > > > I'm doing some research that involves pulling data stored in a mysql > cluster > > directly for a map reduce job, without storing the data in HDFS. > > > > I'd like to run hadoop task tracker nodes directly on the mysql cluster > > nodes. The purpose of this being, starting mappers directly in the node > > closest to the data if possible (data locality). > > > > I notice that with HDFS, since the name node knows exactly where each > data > > block is, it uses this to achieve data locality. > > > > Is there a way to achieve my requirement possibly by extending the name > node > > or otherwise? > > > > Thanks in advance. > > > > -- > > Regards, > > > > Tharindu > > > > blog: http://mackiemathew.com/ > > > > > > -- > Harsh J > -- Don't Grow Old, Grow Up... :-)
-
Re: Extension points available for data localityTharindu Mathew 2012-08-21, 12:44
Dino, Feng,
Thanks for the options, but I guess I need to do it myself. Harsh, What you said was the initial impression I got, but I thought I need to do something more with the name node. Thanks for clearing that out. My guess is that this probably works by using getLocations and mapping this location ip (or host) with the ip (or host) of the task tracker? Is this correct? On Tue, Aug 21, 2012 at 3:14 PM, feng lu <[EMAIL PROTECTED]> wrote: > Hi Tharindu > > May you can try the Gora,The Apache Gora open source framework provides an > in-memory data model and persistence for big data. Gora supports persisting > to column stores, key value stores, document stores and RDBMSs, and > analyzing the data with extensive Apache Hadoop MapReduce support. > > Now it support MySQL in gora-sql model. > > http://gora.apache.org/ > > > On Tue, Aug 21, 2012 at 5:39 PM, Harsh J <[EMAIL PROTECTED]> wrote: > >> Tharindu, >> >> (Am assuming you've done enough research to know that there's benefit >> in what you're attempting to do.) >> >> Locality of tasks are determined by the job's InputFormat class. >> Specifically, the locality information returned by the InputSplit >> objects via InputFormat#getSplits(…) API is what the MR scheduler >> looks at when trying to launch data local tasks. >> >> You can tweak your InputFormat (the one that uses this DB as input?) >> to return relevant locations based on your "DB Cluster", in order to >> achieve this. >> >> On Tue, Aug 21, 2012 at 2:36 PM, Tharindu Mathew <[EMAIL PROTECTED]> >> wrote: >> > Hi, >> > >> > I'm doing some research that involves pulling data stored in a mysql >> cluster >> > directly for a map reduce job, without storing the data in HDFS. >> > >> > I'd like to run hadoop task tracker nodes directly on the mysql cluster >> > nodes. The purpose of this being, starting mappers directly in the node >> > closest to the data if possible (data locality). >> > >> > I notice that with HDFS, since the name node knows exactly where each >> data >> > block is, it uses this to achieve data locality. >> > >> > Is there a way to achieve my requirement possibly by extending the name >> node >> > or otherwise? >> > >> > Thanks in advance. >> > >> > -- >> > Regards, >> > >> > Tharindu >> > >> > blog: http://mackiemathew.com/ >> > >> >> >> >> -- >> Harsh J >> > > > > -- > Don't Grow Old, Grow Up... :-) > -- Regards, Tharindu blog: http://mackiemathew.com/
-
Re: Extension points available for data localityMichael Segel 2012-08-21, 13:28
Interesting....
You have a cluster of MySQL which is a bit different from a single data source. When you say data locality, you want to run the job you mean that you want to launch your job and then have each mapper pull data from the local shard. So you have a couple of issues. 1) You will need to set up Hadoop on the same cluster. This is doable, you just have to account for the memory and disk on your system. 2) You will need to look at the HTable Input Format class. (What's the difference between looking at a RS versus a shard?) 3) You will need to make sure that you have enough metadata to help determine where your data is located. Outside of that, its doable. Right? Note that since you're not running HBase, Hadoop is a bit more tolerant of swapping, but not by much. Good luck. On Aug 21, 2012, at 7:44 AM, Tharindu Mathew <[EMAIL PROTECTED]> wrote: > Dino, Feng, > > Thanks for the options, but I guess I need to do it myself. > > Harsh, > > What you said was the initial impression I got, but I thought I need to do something more with the name node. Thanks for clearing that out. > > My guess is that this probably works by using getLocations and mapping this location ip (or host) with the ip (or host) of the task tracker? Is this correct? > > > On Tue, Aug 21, 2012 at 3:14 PM, feng lu <[EMAIL PROTECTED]> wrote: > Hi Tharindu > > May you can try the Gora,The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce support. > > Now it support MySQL in gora-sql model. > > http://gora.apache.org/ > > > On Tue, Aug 21, 2012 at 5:39 PM, Harsh J <[EMAIL PROTECTED]> wrote: > Tharindu, > > (Am assuming you've done enough research to know that there's benefit > in what you're attempting to do.) > > Locality of tasks are determined by the job's InputFormat class. > Specifically, the locality information returned by the InputSplit > objects via InputFormat#getSplits(…) API is what the MR scheduler > looks at when trying to launch data local tasks. > > You can tweak your InputFormat (the one that uses this DB as input?) > to return relevant locations based on your "DB Cluster", in order to > achieve this. > > On Tue, Aug 21, 2012 at 2:36 PM, Tharindu Mathew <[EMAIL PROTECTED]> wrote: > > Hi, > > > > I'm doing some research that involves pulling data stored in a mysql cluster > > directly for a map reduce job, without storing the data in HDFS. > > > > I'd like to run hadoop task tracker nodes directly on the mysql cluster > > nodes. The purpose of this being, starting mappers directly in the node > > closest to the data if possible (data locality). > > > > I notice that with HDFS, since the name node knows exactly where each data > > block is, it uses this to achieve data locality. > > > > Is there a way to achieve my requirement possibly by extending the name node > > or otherwise? > > > > Thanks in advance. > > > > -- > > Regards, > > > > Tharindu > > > > blog: http://mackiemathew.com/ > > > > > > -- > Harsh J > > > > -- > Don't Grow Old, Grow Up... :-) > > > > -- > Regards, > > Tharindu > > blog: http://mackiemathew.com/ >
-
Re: Extension points available for data localityTharindu Mathew 2012-08-21, 13:54
Yes, Micheal. You are thinking along the right lines.
I just want to understand the inner workings of this, so I can rule out guess work when it comes to making my implementation reliable. For example, if a node in the mysql cluster goes down and the failover node takes over, I want to make sure Hadoop picks the failover node to pull the data from and doesn't fail the job because the original node is unavailable. Hence, my extensive questions on this matter. As you said, of course you need to have the meta data to know which node holds what. Let's assume that meta data is available. On Tue, Aug 21, 2012 at 6:58 PM, Michael Segel <[EMAIL PROTECTED]>wrote: > Interesting.... > > You have a cluster of MySQL which is a bit different from a single data > source. > > When you say data locality, you want to run the job you mean that you want > to launch your job and then have each mapper pull data from the local > shard. > > So you have a couple of issues. > > 1) You will need to set up Hadoop on the same cluster. > This is doable, you just have to account for the memory and disk on your > system. > > 2) You will need to look at the HTable Input Format class. (What's the > difference between looking at a RS versus a shard?) > > 3) You will need to make sure that you have enough metadata to help > determine where your data is located. > > > Outside of that, its doable. > Right? > > > Note that since you're not running HBase, Hadoop is a bit more tolerant of > swapping, but not by much. > > Good luck. > > On Aug 21, 2012, at 7:44 AM, Tharindu Mathew <[EMAIL PROTECTED]> wrote: > > Dino, Feng, > > Thanks for the options, but I guess I need to do it myself. > > Harsh, > > What you said was the initial impression I got, but I thought I need to do > something more with the name node. Thanks for clearing that out. > > My guess is that this probably works by using getLocations and mapping > this location ip (or host) with the ip (or host) of the task tracker? Is > this correct? > > > On Tue, Aug 21, 2012 at 3:14 PM, feng lu <[EMAIL PROTECTED]> wrote: > >> Hi Tharindu >> >> May you can try the Gora,The Apache Gora open source framework provides >> an in-memory data model and persistence for big data. Gora supports >> persisting to column stores, key value stores, document stores and RDBMSs, >> and analyzing the data with extensive Apache Hadoop MapReduce support. >> >> Now it support MySQL in gora-sql model. >> >> http://gora.apache.org/ >> >> >> On Tue, Aug 21, 2012 at 5:39 PM, Harsh J <[EMAIL PROTECTED]> wrote: >> >>> Tharindu, >>> >>> (Am assuming you've done enough research to know that there's benefit >>> in what you're attempting to do.) >>> >>> Locality of tasks are determined by the job's InputFormat class. >>> Specifically, the locality information returned by the InputSplit >>> objects via InputFormat#getSplits(…) API is what the MR scheduler >>> looks at when trying to launch data local tasks. >>> >>> You can tweak your InputFormat (the one that uses this DB as input?) >>> to return relevant locations based on your "DB Cluster", in order to >>> achieve this. >>> >>> On Tue, Aug 21, 2012 at 2:36 PM, Tharindu Mathew <[EMAIL PROTECTED]> >>> wrote: >>> > Hi, >>> > >>> > I'm doing some research that involves pulling data stored in a mysql >>> cluster >>> > directly for a map reduce job, without storing the data in HDFS. >>> > >>> > I'd like to run hadoop task tracker nodes directly on the mysql cluster >>> > nodes. The purpose of this being, starting mappers directly in the node >>> > closest to the data if possible (data locality). >>> > >>> > I notice that with HDFS, since the name node knows exactly where each >>> data >>> > block is, it uses this to achieve data locality. >>> > >>> > Is there a way to achieve my requirement possibly by extending the >>> name node >>> > or otherwise? >>> > >>> > Thanks in advance. >>> > >>> > -- >>> > Regards, >>> > >>> > Tharindu >>> > >>> > blog: http://mackiemathew.com/ >>> > Regards, Tharindu blog: http://mackiemathew.com/
-
Re: Extension points available for data localityMichael Segel 2012-08-21, 14:19
On Aug 21, 2012, at 8:54 AM, Tharindu Mathew <[EMAIL PROTECTED]> wrote: > Yes, Micheal. You are thinking along the right lines. > > I just want to understand the inner workings of this, so I can rule out guess work when it comes to making my implementation reliable. > > For example, if a node in the mysql cluster goes down and the failover node takes over, I want to make sure Hadoop picks the failover node to pull the data from and doesn't fail the job because the original node is unavailable. > That could be problematic. I mean, just shooting from the hip now... If you are running a job, and mid stream you lose connection to that shard, then your task will time out and fail. As it gets restarted it could be that you catch an exception that indicates the server is down and to then go to the backup. Again this code will be all yours so you would have to write this feature in. > Hence, my extensive questions on this matter. As you said, of course you need to have the meta data to know which node holds what. Let's assume that meta data is available. > At a minimum, the metadata should be available. How else do you partition the data in the first place? Also your cluster's configuration data has to be available. HTH > On Tue, Aug 21, 2012 at 6:58 PM, Michael Segel <[EMAIL PROTECTED]> wrote: > Interesting.... > > You have a cluster of MySQL which is a bit different from a single data source. > > When you say data locality, you want to run the job you mean that you want to launch your job and then have each mapper pull data from the local shard. > > So you have a couple of issues. > > 1) You will need to set up Hadoop on the same cluster. > This is doable, you just have to account for the memory and disk on your system. > > 2) You will need to look at the HTable Input Format class. (What's the difference between looking at a RS versus a shard?) > > 3) You will need to make sure that you have enough metadata to help determine where your data is located. > > > Outside of that, its doable. > Right? > > > Note that since you're not running HBase, Hadoop is a bit more tolerant of swapping, but not by much. > > Good luck. > > On Aug 21, 2012, at 7:44 AM, Tharindu Mathew <[EMAIL PROTECTED]> wrote: > >> Dino, Feng, >> >> Thanks for the options, but I guess I need to do it myself. >> >> Harsh, >> >> What you said was the initial impression I got, but I thought I need to do something more with the name node. Thanks for clearing that out. >> >> My guess is that this probably works by using getLocations and mapping this location ip (or host) with the ip (or host) of the task tracker? Is this correct? >> >> >> On Tue, Aug 21, 2012 at 3:14 PM, feng lu <[EMAIL PROTECTED]> wrote: >> Hi Tharindu >> >> May you can try the Gora,The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce support. >> >> Now it support MySQL in gora-sql model. >> >> http://gora.apache.org/ >> >> >> On Tue, Aug 21, 2012 at 5:39 PM, Harsh J <[EMAIL PROTECTED]> wrote: >> Tharindu, >> >> (Am assuming you've done enough research to know that there's benefit >> in what you're attempting to do.) >> >> Locality of tasks are determined by the job's InputFormat class. >> Specifically, the locality information returned by the InputSplit >> objects via InputFormat#getSplits(…) API is what the MR scheduler >> looks at when trying to launch data local tasks. >> >> You can tweak your InputFormat (the one that uses this DB as input?) >> to return relevant locations based on your "DB Cluster", in order to >> achieve this. >> >> On Tue, Aug 21, 2012 at 2:36 PM, Tharindu Mathew <[EMAIL PROTECTED]> wrote: >> > Hi, >> > >> > I'm doing some research that involves pulling data stored in a mysql cluster >> > directly for a map reduce job, without storing the data in HDFS.
-
Re: Extension points available for data localityTharindu Mathew 2012-08-21, 18:40
On Tue, Aug 21, 2012 at 7:49 PM, Michael Segel <[EMAIL PROTECTED]>wrote:
> > On Aug 21, 2012, at 8:54 AM, Tharindu Mathew <[EMAIL PROTECTED]> wrote: > > Yes, Micheal. You are thinking along the right lines. > > I just want to understand the inner workings of this, so I can rule out > guess work when it comes to making my implementation reliable. > > For example, if a node in the mysql cluster goes down and the failover > node takes over, I want to make sure Hadoop picks the failover node to pull > the data from and doesn't fail the job because the original node is > unavailable. > > That could be problematic. > I mean, just shooting from the hip now... If you are running a job, and > mid stream you lose connection to that shard, then your task will time out > and fail. As it gets restarted it could be that you catch an exception that > indicates the server is down and to then go to the backup. > > Again this code will be all yours so you would have to write this feature > in. > That feels inefficient. I assume the FileInputFormat handles it much efficiently that this. This is the reason I ask, whether I have to modify the namenode, so that it inherently knows the replicated locations of my data. OTOH, based on the answers in this thread I assume through the InputFormat API I can feed the available node dynamically, if a node goes down. > > Hence, my extensive questions on this matter. As you said, of course you > need to have the meta data to know which node holds what. Let's assume that > meta data is available. > > At a minimum, the metadata should be available. How else do you partition > the data in the first place? > Also your cluster's configuration data has to be available. > > HTH > > On Tue, Aug 21, 2012 at 6:58 PM, Michael Segel <[EMAIL PROTECTED]>wrote: > >> Interesting.... >> >> You have a cluster of MySQL which is a bit different from a single data >> source. >> >> When you say data locality, you want to run the job you mean that you >> want to launch your job and then have each mapper pull data from the local >> shard. >> >> So you have a couple of issues. >> >> 1) You will need to set up Hadoop on the same cluster. >> This is doable, you just have to account for the memory and disk on your >> system. >> >> 2) You will need to look at the HTable Input Format class. (What's the >> difference between looking at a RS versus a shard?) >> >> 3) You will need to make sure that you have enough metadata to help >> determine where your data is located. >> >> >> Outside of that, its doable. >> Right? >> >> >> Note that since you're not running HBase, Hadoop is a bit more tolerant >> of swapping, but not by much. >> >> Good luck. >> >> On Aug 21, 2012, at 7:44 AM, Tharindu Mathew <[EMAIL PROTECTED]> wrote: >> >> Dino, Feng, >> >> Thanks for the options, but I guess I need to do it myself. >> >> Harsh, >> >> What you said was the initial impression I got, but I thought I need to >> do something more with the name node. Thanks for clearing that out. >> >> My guess is that this probably works by using getLocations and mapping >> this location ip (or host) with the ip (or host) of the task tracker? Is >> this correct? >> >> >> On Tue, Aug 21, 2012 at 3:14 PM, feng lu <[EMAIL PROTECTED]> wrote: >> >>> Hi Tharindu >>> >>> May you can try the Gora,The Apache Gora open source framework provides >>> an in-memory data model and persistence for big data. Gora supports >>> persisting to column stores, key value stores, document stores and RDBMSs, >>> and analyzing the data with extensive Apache Hadoop MapReduce support. >>> >>> Now it support MySQL in gora-sql model. >>> >>> http://gora.apache.org/ >>> >>> >>> On Tue, Aug 21, 2012 at 5:39 PM, Harsh J <[EMAIL PROTECTED]> wrote: >>> >>>> Tharindu, >>>> >>>> (Am assuming you've done enough research to know that there's benefit >>>> in what you're attempting to do.) >>>> >>>> Locality of tasks are determined by the job's InputFormat class. >>>> Specifically, the locality information returned by the InputSplit Regards, Tharindu blog: http://mackiemathew.com/
-
Re: Extension points available for data localityMinh Duc Nguyen 2012-08-21, 19:17
Tharindu, have you considered using something like Sqoop? For efficiency,
your idea is to run a Hadoop cluster on the same nodes as your MySQL cluster, in effect, moving your processing to your data. If you use something like Sqoop, you could move your data to your Hadoop cluster. While it may not make sense for what you're trying to accomplish, I thought I'd at least offer up the idea. HTH, Minh On Tue, Aug 21, 2012 at 5:06 AM, Tharindu Mathew <[EMAIL PROTECTED]>wrote: > Hi, > > I'm doing some research that involves pulling data stored in a mysql > cluster directly for a map reduce job, without storing the data in HDFS. > > I'd like to run hadoop task tracker nodes directly on the mysql cluster > nodes. The purpose of this being, starting mappers directly in the node > closest to the data if possible (data locality). > > I notice that with HDFS, since the name node knows exactly where each data > block is, it uses this to achieve data locality. > > Is there a way to achieve my requirement possibly by extending the name > node or otherwise? > > Thanks in advance. > > -- > Regards, > > Tharindu > > blog: http://mackiemathew.com/ > >
-
Re: Extension points available for data localityHarsh J 2012-08-22, 02:16
Hi,
On Tue, Aug 21, 2012 at 6:14 PM, Tharindu Mathew <[EMAIL PROTECTED]> wrote: > Harsh, > > What you said was the initial impression I got, but I thought I need to do > something more with the name node. Thanks for clearing that out. > > My guess is that this probably works by using getLocations and mapping this > location ip (or host) with the ip (or host) of the task tracker? Is this > correct? Yes this is correct, the TT's location (hostname/IP) is what it would map to. -- Harsh J
-
Re: Extension points available for data localityTharindu Mathew 2012-08-22, 06:30
On Wed, Aug 22, 2012 at 7:46 AM, Harsh J <[EMAIL PROTECTED]> wrote:
> Hi, > > On Tue, Aug 21, 2012 at 6:14 PM, Tharindu Mathew <[EMAIL PROTECTED]> > wrote: > > Harsh, > > > > What you said was the initial impression I got, but I thought I need to > do > > something more with the name node. Thanks for clearing that out. > > > > My guess is that this probably works by using getLocations and mapping > this > > location ip (or host) with the ip (or host) of the task tracker? Is this > > correct? > > Yes this is correct, the TT's location (hostname/IP) is what it would map > to. > Thanks Harsh. Exactly what I needed to here. > > -- > Harsh J > -- Regards, Tharindu blog: http://mackiemathew.com/ |