|
Brian McSweeney
2011-01-08, 23:33
Sonal Goyal
2011-01-09, 02:57
Konstantin Boudnik
2011-01-09, 03:18
Ted Dunning
2011-01-09, 08:55
Black, Michael
2011-01-09, 12:20
Brian McSweeney
2011-01-09, 13:04
Brian McSweeney
2011-01-09, 13:21
Brian McSweeney
2011-01-09, 13:26
Brian McSweeney
2011-01-09, 13:30
Black, Michael
2011-01-09, 13:51
Ted Dunning
2011-01-09, 21:18
arvind@...)
2011-01-09, 21:37
Jeff Hammerbacher
2011-01-10, 00:00
Brian McSweeney
2011-01-10, 01:19
Brian McSweeney
2011-01-10, 01:23
Brian McSweeney
2011-01-10, 01:27
Brian McSweeney
2011-01-10, 01:27
Black, Michael
2011-01-10, 13:21
Brian
2011-01-10, 20:00
Black, Michael
2011-01-10, 20:46
Ted Dunning
2011-01-10, 21:51
Brian McSweeney
2011-01-10, 23:19
Brian McSweeney
2011-01-11, 00:54
Mark Kerzner
2011-01-14, 06:02
Brian McSweeney
2011-01-14, 20:24
|
-
Import data from mysqlBrian McSweeney 2011-01-08, 23:33
Hi folks,
I'm a TOTAL newbie on hadoop. I have an existing webapp that has a growing number of rows in a mysql database that I have to compare against one another once a day from a batch job. This is an exponential problem as every row must be compared against every other row. I was thinking of parallelizing this computation via hadoop. As such, I was thinking that perhaps the first thing to look at is how to bring info from a database to a hadoop job and vise versa. I have seen the following relevant info https://issues.apache.org/jira/browse/HADOOP-2536 and also http://architects.dzone.com/articles/tools-moving-sql-database any advice on what approach to use? cheers, Brian
-
Re: Import data from mysqlSonal Goyal 2011-01-09, 02:57
Hi Brian,
You can check HIHO at https://github.com/sonalgoyal/hiho which can help you load data from any JDBC database to the Hadoop file system. If your table has a date or id field, or any indicator for modified/newly added rows, you can import only the altered rows every day. Please let me know if you need help. Thanks and Regards, Sonal <https://github.com/sonalgoyal/hiho>Connect Hadoop with databases, Salesforce, FTP servers and others <https://github.com/sonalgoyal/hiho> Nube Technologies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal> On Sun, Jan 9, 2011 at 5:03 AM, Brian McSweeney <[EMAIL PROTECTED]>wrote: > Hi folks, > > I'm a TOTAL newbie on hadoop. I have an existing webapp that has a growing > number of rows in a mysql database that I have to compare against one > another once a day from a batch job. This is an exponential problem as > every > row must be compared against every other row. I was thinking of > parallelizing this computation via hadoop. As such, I was thinking that > perhaps the first thing to look at is how to bring info from a database to > a > hadoop job and vise versa. I have seen the following relevant info > > https://issues.apache.org/jira/browse/HADOOP-2536 > > and also > > http://architects.dzone.com/articles/tools-moving-sql-database > > any advice on what approach to use? > > cheers, > Brian >
-
Re: Import data from mysqlKonstantin Boudnik 2011-01-09, 03:18
There's a supported tool with all bells and whistles:
http://www.cloudera.com/downloads/sqoop/ -- Take care, Konstantin (Cos) Boudnik On Sat, Jan 8, 2011 at 18:57, Sonal Goyal <[EMAIL PROTECTED]> wrote: > Hi Brian, > > You can check HIHO at https://github.com/sonalgoyal/hiho which can help you > load data from any JDBC database to the Hadoop file system. If your table > has a date or id field, or any indicator for modified/newly added rows, you > can import only the altered rows every day. Please let me know if you need > help. > > Thanks and Regards, > Sonal > <https://github.com/sonalgoyal/hiho>Connect Hadoop with databases, > Salesforce, FTP servers and others <https://github.com/sonalgoyal/hiho> > Nube Technologies <http://www.nubetech.co> > > <http://in.linkedin.com/in/sonalgoyal> > > > > > > On Sun, Jan 9, 2011 at 5:03 AM, Brian McSweeney > <[EMAIL PROTECTED]>wrote: > >> Hi folks, >> >> I'm a TOTAL newbie on hadoop. I have an existing webapp that has a growing >> number of rows in a mysql database that I have to compare against one >> another once a day from a batch job. This is an exponential problem as >> every >> row must be compared against every other row. I was thinking of >> parallelizing this computation via hadoop. As such, I was thinking that >> perhaps the first thing to look at is how to bring info from a database to >> a >> hadoop job and vise versa. I have seen the following relevant info >> >> https://issues.apache.org/jira/browse/HADOOP-2536 >> >> and also >> >> http://architects.dzone.com/articles/tools-moving-sql-database >> >> any advice on what approach to use? >> >> cheers, >> Brian >> >
-
Re: Import data from mysqlTed Dunning 2011-01-09, 08:55
It is, of course, only quadratic, even if you compare all rows to all other
rows. You can reduce this cost to O(n log n) by ordinary sorting and you can reduce further reduce the cost to O(n) using radix sort on hashes. Practically speaking, in either the parallel or non parallel setting try sorting each batch of inputs and then doing a merge pass to find duplicated rows. Hashing your rows and doing the sort will make things faster if your rows are very long or if you use radix sort. Unless your data is vast, this would probably work on a single machine with no need for parallelism since sorting billions of items would require <10 passes through your data with a 2^16 way radix sort. To do this with hadoop, simply do the hashing as before and run a typical word count. Then the rows that duplicate are simply the ones with count > 1 and these can be preferentially output by the reducer. On Sat, Jan 8, 2011 at 3:33 PM, Brian McSweeney <[EMAIL PROTECTED]>wrote: > I'm a TOTAL newbie on hadoop. I have an existing webapp that has a growing > number of rows in a mysql database that I have to compare against one > another once a day from a batch job. This is an exponential problem as > every > row must be compared against every other row. I was thinking of > parallelizing this computation via hadoop. >
-
RE:Import data from mysqlBlack, Michael 2011-01-09, 12:20
What kind of compare do you have to do?
You should be able to compute a checksum or such for each row when you insert them and only have to look at the subset that matches if you're doing some sort of substring or such. Michael D. Black Senior Scientist Advanced Analytics Directorate Northrop Grumman Information Systems ________________________________ From: Brian McSweeney [mailto:[EMAIL PROTECTED]] Sent: Sat 1/8/2011 5:33 PM To: [EMAIL PROTECTED] Subject: EXTERNAL:Import data from mysql Hi folks, I'm a TOTAL newbie on hadoop. I have an existing webapp that has a growing number of rows in a mysql database that I have to compare against one another once a day from a batch job. This is an exponential problem as every row must be compared against every other row. I was thinking of parallelizing this computation via hadoop. As such, I was thinking that perhaps the first thing to look at is how to bring info from a database to a hadoop job and vise versa. I have seen the following relevant info https://issues.apache.org/jira/browse/HADOOP-2536 and also http://architects.dzone.com/articles/tools-moving-sql-database any advice on what approach to use? cheers, Brian
-
Re: Import data from mysqlBrian McSweeney 2011-01-09, 13:04
thanks Sonal,
I'll check it out On Sun, Jan 9, 2011 at 2:57 AM, Sonal Goyal <[EMAIL PROTECTED]> wrote: > Hi Brian, > > You can check HIHO at https://github.com/sonalgoyal/hiho which can help > you > load data from any JDBC database to the Hadoop file system. If your table > has a date or id field, or any indicator for modified/newly added rows, you > can import only the altered rows every day. Please let me know if you need > help. > > Thanks and Regards, > Sonal > <https://github.com/sonalgoyal/hiho>Connect Hadoop with databases, > Salesforce, FTP servers and others <https://github.com/sonalgoyal/hiho> > Nube Technologies <http://www.nubetech.co> > > <http://in.linkedin.com/in/sonalgoyal> > > > > > > On Sun, Jan 9, 2011 at 5:03 AM, Brian McSweeney > <[EMAIL PROTECTED]>wrote: > > > Hi folks, > > > > I'm a TOTAL newbie on hadoop. I have an existing webapp that has a > growing > > number of rows in a mysql database that I have to compare against one > > another once a day from a batch job. This is an exponential problem as > > every > > row must be compared against every other row. I was thinking of > > parallelizing this computation via hadoop. As such, I was thinking that > > perhaps the first thing to look at is how to bring info from a database > to > > a > > hadoop job and vise versa. I have seen the following relevant info > > > > https://issues.apache.org/jira/browse/HADOOP-2536 > > > > and also > > > > http://architects.dzone.com/articles/tools-moving-sql-database > > > > any advice on what approach to use? > > > > cheers, > > Brian > > > -- ----------------------------------------- Brian McSweeney Technology Director Smarter Technology web: http://www.smarter.ie phone: +353868578212 -----------------------------------------
-
Re: Import data from mysqlBrian McSweeney 2011-01-09, 13:21
Thanks Konstantin,
I had seen sqoop. I wonder is it normally used as a once off process or can it also be effectively used on a live database system on a daily basis to batch export. Are there performance issues with this approach? Or how would it compare to some of the other classes that I have seen such as those in the database library http://hadoop.apache.org/mapreduce/docs/current/api/ I have also seen a few alternatives out there such as cascading and cascading-dbmigrate http://architects.dzone.com/articles/tools-moving-sql-database But from the hadoop api above it also seems that some of this functionality is perhaps now in the main api. I suppose any experience people have is welcome. I would want to run a batch job to export every day, perform my map reduce, and then import the results back into mysql afterwards. cheers, Brian On Sun, Jan 9, 2011 at 3:18 AM, Konstantin Boudnik <[EMAIL PROTECTED]> wrote: > There's a supported tool with all bells and whistles: > http://www.cloudera.com/downloads/sqoop/ > > -- > Take care, > Konstantin (Cos) Boudnik > > On Sat, Jan 8, 2011 at 18:57, Sonal Goyal <[EMAIL PROTECTED]> wrote: > > Hi Brian, > > > > You can check HIHO at https://github.com/sonalgoyal/hiho which can help > you > > load data from any JDBC database to the Hadoop file system. If your table > > has a date or id field, or any indicator for modified/newly added rows, > you > > can import only the altered rows every day. Please let me know if you > need > > help. > > > > Thanks and Regards, > > Sonal > > <https://github.com/sonalgoyal/hiho>Connect Hadoop with databases, > > Salesforce, FTP servers and others <https://github.com/sonalgoyal/hiho> > > Nube Technologies <http://www.nubetech.co> > > > > <http://in.linkedin.com/in/sonalgoyal> > > > > > > > > > > > > On Sun, Jan 9, 2011 at 5:03 AM, Brian McSweeney > > <[EMAIL PROTECTED]>wrote: > > > >> Hi folks, > >> > >> I'm a TOTAL newbie on hadoop. I have an existing webapp that has a > growing > >> number of rows in a mysql database that I have to compare against one > >> another once a day from a batch job. This is an exponential problem as > >> every > >> row must be compared against every other row. I was thinking of > >> parallelizing this computation via hadoop. As such, I was thinking that > >> perhaps the first thing to look at is how to bring info from a database > to > >> a > >> hadoop job and vise versa. I have seen the following relevant info > >> > >> https://issues.apache.org/jira/browse/HADOOP-2536 > >> > >> and also > >> > >> http://architects.dzone.com/articles/tools-moving-sql-database > >> > >> any advice on what approach to use? > >> > >> cheers, > >> Brian > >> > > > -- ----------------------------------------- Brian McSweeney Technology Director Smarter Technology web: http://www.smarter.ie phone: +353868578212 -----------------------------------------
-
Re: Import data from mysqlBrian McSweeney 2011-01-09, 13:26
Thanks Ted,
You're right but I suppose I was too brief in my initial statement. I should have said that I have to run an operation on all rows with respect to each other. It's not a case of just comparing them and thus sorting them so unfortunately I don't think this will help much. Some of the values in the rows have to be multiplied together, some have to be compared, some have to have a function run against them etc. cheers, Brian On Sun, Jan 9, 2011 at 8:55 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > It is, of course, only quadratic, even if you compare all rows to all other > rows. You can reduce this cost to O(n log n) by ordinary sorting and you > can reduce further reduce the cost to O(n) using radix sort on hashes. > > Practically speaking, in either the parallel or non parallel setting try > sorting each batch of inputs and then doing a merge pass to find duplicated > rows. Hashing your rows and doing the sort will make things faster if your > rows are very long or if you use radix sort. Unless your data is vast, > this > would probably work on a single machine with no need for parallelism since > sorting billions of items would require <10 passes through your data with a > 2^16 way radix sort. > > To do this with hadoop, simply do the hashing as before and run a typical > word count. Then the rows that duplicate are simply the ones with count > > 1 > and these can be preferentially output by the reducer. > > On Sat, Jan 8, 2011 at 3:33 PM, Brian McSweeney > <[EMAIL PROTECTED]>wrote: > > > I'm a TOTAL newbie on hadoop. I have an existing webapp that has a > growing > > number of rows in a mysql database that I have to compare against one > > another once a day from a batch job. This is an exponential problem as > > every > > row must be compared against every other row. I was thinking of > > parallelizing this computation via hadoop. > > > -- ----------------------------------------- Brian McSweeney Technology Director Smarter Technology web: http://www.smarter.ie phone: +353868578212 -----------------------------------------
-
Re: Import data from mysqlBrian McSweeney 2011-01-09, 13:30
Hi Michael,
yeah, sorry, I shouldn't have said a compare as that would be a simplified problem. For each two rows I have to calculate a score based on multiplying some of the column values together, running some functions against each other etc. I could do this as the rows are entered into the db, cutting down the problem, however unforunately the values in the existing rows change every day, therefore I think the only thing to do is export the lot and run a job once a day to come up with the new scores. This is why I'm looking at hadoop as it has become too big a job doing it in a serial processing way. cheers, Brian On Sun, Jan 9, 2011 at 12:20 PM, Black, Michael (IS) <[EMAIL PROTECTED] > wrote: > What kind of compare do you have to do? > > You should be able to compute a checksum or such for each row when you > insert them and only have to look at the subset that matches if you're doing > some sort of substring or such. > > Michael D. Black > Senior Scientist > Advanced Analytics Directorate > Northrop Grumman Information Systems > > > ________________________________ > > From: Brian McSweeney [mailto:[EMAIL PROTECTED]] > Sent: Sat 1/8/2011 5:33 PM > To: [EMAIL PROTECTED] > Subject: EXTERNAL:Import data from mysql > > > > Hi folks, > > I'm a TOTAL newbie on hadoop. I have an existing webapp that has a growing > number of rows in a mysql database that I have to compare against one > another once a day from a batch job. This is an exponential problem as > every > row must be compared against every other row. I was thinking of > parallelizing this computation via hadoop. As such, I was thinking that > perhaps the first thing to look at is how to bring info from a database to > a > hadoop job and vise versa. I have seen the following relevant info > > https://issues.apache.org/jira/browse/HADOOP-2536 > > and also > > http://architects.dzone.com/articles/tools-moving-sql-database > > any advice on what approach to use? > > cheers, > Brian > > > -- ----------------------------------------- Brian McSweeney Technology Director Smarter Technology web: http://www.smarter.ie phone: +353868578212 -----------------------------------------
-
Re: Import data from mysqlBlack, Michael 2011-01-09, 13:51
All you're doing is delaying the inevitable by going to hadoop. There's no magic to hadoop. It doens't run as fast as individual processes. There's just the ability to split jobs across a cluster which works for some problems. You won't even get a linear improvement in speed.
At least I assume you don't have some magical-automatically-growing-forrest-of-computers. Do ALL the values change every day? You still would be be better off doing it as updates are made. You can multithread your application with OpenMP really easily and if you've got 8 cores get close to an 8X improvement with hardly any effort at all. It sounds like you have an exploding data problem which means you need to readdress what you're doing so you're not in N^2 space any more. That's completely untennable which you're starting to see. You quite obviously cannot keep this up for long... So...if you want to open up your kimono a bit and show an example of what your'e doing maybe we can help. Michael D. Black Senior Scientist Advanced Analytics Directorate Northrop Grumman Information Systems ________________________________ From: Brian McSweeney [mailto:[EMAIL PROTECTED]] Sent: Sun 1/9/2011 7:30 AM To: [EMAIL PROTECTED] Subject: EXTERNAL:Re: Import data from mysql Hi Michael, yeah, sorry, I shouldn't have said a compare as that would be a simplified problem. For each two rows I have to calculate a score based on multiplying some of the column values together, running some functions against each other etc. I could do this as the rows are entered into the db, cutting down the problem, however unforunately the values in the existing rows change every day, therefore I think the only thing to do is export the lot and run a job once a day to come up with the new scores. This is why I'm looking at hadoop as it has become too big a job doing it in a serial processing way. cheers, Brian
-
Re: Import data from mysqlTed Dunning 2011-01-09, 21:18
You still have to knock down the quadratic cost.
Any equality checks you have in your problem can be used to limit the problem to growing quadratically in the number of records equal by that comparison. That may be enough to fix things (for now). Unfortunately heavily skewed data are very common so this smaller quadratic will be orders of magnitude smaller than the original, but still unscalable. Hadoop makes this grouping by equality much easier of course and the internal scan can be done by conventional techniques. Beyond that, you need to look at more interesting techniques to really make this a viable option. I would recommend: - if the multiplication is part of a cosine similarity measurement, then look at expressing it as a difference instead and bound the largest component of the different. - take a look at locality sensitive hashing. This gives you an approximate nearest neighbor solution that will allow good probabilistic bounds on the number of cases that you miss in return of a scalable solution. The error bounds can be made fairly tight. See http://www.mit.edu/~andoni/LSH/ - if you decide that LSH is the way to go, check out Mahout which has a minhash clustering implementation. - if you can't restate the problem as non-quadratic, then start over. Quadratic algorithms are not scalable as Michael Black has stated eloquently enough in another thread. - consider tell the group more about your problem. You get more if you give more. On Sun, Jan 9, 2011 at 5:26 AM, Brian McSweeney <[EMAIL PROTECTED]>wrote: > Thanks Ted, > > You're right but I suppose I was too brief in my initial statement. I > should > have said that I have to run an operation on all rows with respect to each > other. It's not a case of just comparing them and thus sorting them so > unfortunately I don't think this will help much. Some of the values in the > rows have to be multiplied together, some have to be compared, some have to > have a function run against them etc. > > cheers, > Brian > > On Sun, Jan 9, 2011 at 8:55 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > It is, of course, only quadratic, even if you compare all rows to all > other > > rows. You can reduce this cost to O(n log n) by ordinary sorting and you > > can reduce further reduce the cost to O(n) using radix sort on hashes. > > > > Practically speaking, in either the parallel or non parallel setting try > > sorting each batch of inputs and then doing a merge pass to find > duplicated > > rows. Hashing your rows and doing the sort will make things faster if > your > > rows are very long or if you use radix sort. Unless your data is vast, > > this > > would probably work on a single machine with no need for parallelism > since > > sorting billions of items would require <10 passes through your data with > a > > 2^16 way radix sort. > > > > To do this with hadoop, simply do the hashing as before and run a typical > > word count. Then the rows that duplicate are simply the ones with count > > > > 1 > > and these can be preferentially output by the reducer. > > > > On Sat, Jan 8, 2011 at 3:33 PM, Brian McSweeney > > <[EMAIL PROTECTED]>wrote: > > > > > I'm a TOTAL newbie on hadoop. I have an existing webapp that has a > > growing > > > number of rows in a mysql database that I have to compare against one > > > another once a day from a batch job. This is an exponential problem as > > > every > > > row must be compared against every other row. I was thinking of > > > parallelizing this computation via hadoop. > > > > > > > > > -- > ----------------------------------------- > Brian McSweeney > > Technology Director > Smarter Technology > web: http://www.smarter.ie > phone: +353868578212 > ----------------------------------------- >
-
Re: Import data from mysqlarvind@...) 2011-01-09, 21:37
Hi Brian,
Sqoop supports incremental imports that can be run against a live database system on a daily basis for importing the new data. Unless your data is large and cannot be split into comparable slices for parallel imports, I do not see any concerns regarding performance. Regarding the database library you have pointed out, it is fundamentally very close to what Sqoop does. However, Sqoop goes way beyond these libraries to ensure that you can address your use-case out of the box without having to modify anything. If on the other hand, you are more inclined to coding your own solution, then perhaps the other tools or these low leve APIs may come in handy. Arvind On Sun, Jan 9, 2011 at 5:21 AM, Brian McSweeney <[EMAIL PROTECTED]>wrote: > Thanks Konstantin, > > I had seen sqoop. I wonder is it normally used as a once off process or can > it also be effectively used on a live database system on a daily basis to > batch export. Are there performance issues with this approach? Or how would > it compare to some of the other classes that I have seen such as those in > the database library http://hadoop.apache.org/mapreduce/docs/current/api/ > > I have also seen a few alternatives out there such as cascading and > cascading-dbmigrate > > http://architects.dzone.com/articles/tools-moving-sql-database > > But from the hadoop api above it also seems that some of this functionality > is perhaps now in the main api. I suppose any experience people have is > welcome. I would want to run a batch job to export every day, perform my > map > reduce, and then import the results back into mysql afterwards. > > cheers, > Brian > > On Sun, Jan 9, 2011 at 3:18 AM, Konstantin Boudnik <[EMAIL PROTECTED]> wrote: > > > There's a supported tool with all bells and whistles: > > http://www.cloudera.com/downloads/sqoop/ > > > > -- > > Take care, > > Konstantin (Cos) Boudnik > > > > On Sat, Jan 8, 2011 at 18:57, Sonal Goyal <[EMAIL PROTECTED]> wrote: > > > Hi Brian, > > > > > > You can check HIHO at https://github.com/sonalgoyal/hiho which can > help > > you > > > load data from any JDBC database to the Hadoop file system. If your > table > > > has a date or id field, or any indicator for modified/newly added rows, > > you > > > can import only the altered rows every day. Please let me know if you > > need > > > help. > > > > > > Thanks and Regards, > > > Sonal > > > <https://github.com/sonalgoyal/hiho>Connect Hadoop with databases, > > > Salesforce, FTP servers and others <https://github.com/sonalgoyal/hiho > > > > > Nube Technologies <http://www.nubetech.co> > > > > > > <http://in.linkedin.com/in/sonalgoyal> > > > > > > > > > > > > > > > > > > On Sun, Jan 9, 2011 at 5:03 AM, Brian McSweeney > > > <[EMAIL PROTECTED]>wrote: > > > > > >> Hi folks, > > >> > > >> I'm a TOTAL newbie on hadoop. I have an existing webapp that has a > > growing > > >> number of rows in a mysql database that I have to compare against one > > >> another once a day from a batch job. This is an exponential problem as > > >> every > > >> row must be compared against every other row. I was thinking of > > >> parallelizing this computation via hadoop. As such, I was thinking > that > > >> perhaps the first thing to look at is how to bring info from a > database > > to > > >> a > > >> hadoop job and vise versa. I have seen the following relevant info > > >> > > >> https://issues.apache.org/jira/browse/HADOOP-2536 > > >> > > >> and also > > >> > > >> http://architects.dzone.com/articles/tools-moving-sql-database > > >> > > >> any advice on what approach to use? > > >> > > >> cheers, > > >> Brian > > >> > > > > > > > > > -- > ----------------------------------------- > Brian McSweeney > > Technology Director > Smarter Technology > web: http://www.smarter.ie > phone: +353868578212 > ----------------------------------------- >
-
Re: Import data from mysqlJeff Hammerbacher 2011-01-10, 00:00
Hey Brian,
One final point about Sqoop: it's a part of Cloudera's Distribution for Hadoop, so it's Apache 2.0 licensed and tightly integrated with the other platform components. This means, for example, that we have added a Sqoop action to Oozie, which makes integrating data import and export into your workflows trivial; see http://archive.cloudera.com/cdh/3/oozie-2.2.1+82/WorkflowActionExtensionsSpec.html#AE.2_Sqoop_Actionfor more details. For further discussion of Sqoop, I'd recommend using the Sqoop user list at https://groups.google.com/a/cloudera.org/group/sqoop-user. For questions about CDH in general, see https://groups.google.com/a/cloudera.org/group/cdh-user. Regards, Jeff On Sun, Jan 9, 2011 at 1:37 PM, [EMAIL PROTECTED] <[EMAIL PROTECTED]>wrote: > Hi Brian, > > Sqoop supports incremental imports that can be run against a live database > system on a daily basis for importing the new data. Unless your data is > large and cannot be split into comparable slices for parallel imports, I do > not see any concerns regarding performance. > > Regarding the database library you have pointed out, it is fundamentally > very close to what Sqoop does. However, Sqoop goes way beyond these > libraries to ensure that you can address your use-case out of the box > without having to modify anything. If on the other hand, you are more > inclined to coding your own solution, then perhaps the other tools or these > low leve APIs may come in handy. > > Arvind > > On Sun, Jan 9, 2011 at 5:21 AM, Brian McSweeney > <[EMAIL PROTECTED]>wrote: > > > Thanks Konstantin, > > > > I had seen sqoop. I wonder is it normally used as a once off process or > can > > it also be effectively used on a live database system on a daily basis to > > batch export. Are there performance issues with this approach? Or how > would > > it compare to some of the other classes that I have seen such as those in > > the database library > http://hadoop.apache.org/mapreduce/docs/current/api/ > > > > I have also seen a few alternatives out there such as cascading and > > cascading-dbmigrate > > > > http://architects.dzone.com/articles/tools-moving-sql-database > > > > But from the hadoop api above it also seems that some of this > functionality > > is perhaps now in the main api. I suppose any experience people have is > > welcome. I would want to run a batch job to export every day, perform my > > map > > reduce, and then import the results back into mysql afterwards. > > > > cheers, > > Brian > > > > On Sun, Jan 9, 2011 at 3:18 AM, Konstantin Boudnik <[EMAIL PROTECTED]> > wrote: > > > > > There's a supported tool with all bells and whistles: > > > http://www.cloudera.com/downloads/sqoop/ > > > > > > -- > > > Take care, > > > Konstantin (Cos) Boudnik > > > > > > On Sat, Jan 8, 2011 at 18:57, Sonal Goyal <[EMAIL PROTECTED]> > wrote: > > > > Hi Brian, > > > > > > > > You can check HIHO at https://github.com/sonalgoyal/hiho which can > > help > > > you > > > > load data from any JDBC database to the Hadoop file system. If your > > table > > > > has a date or id field, or any indicator for modified/newly added > rows, > > > you > > > > can import only the altered rows every day. Please let me know if you > > > need > > > > help. > > > > > > > > Thanks and Regards, > > > > Sonal > > > > <https://github.com/sonalgoyal/hiho>Connect Hadoop with databases, > > > > Salesforce, FTP servers and others < > https://github.com/sonalgoyal/hiho > > > > > > > Nube Technologies <http://www.nubetech.co> > > > > > > > > <http://in.linkedin.com/in/sonalgoyal> > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Jan 9, 2011 at 5:03 AM, Brian McSweeney > > > > <[EMAIL PROTECTED]>wrote: > > > > > > > >> Hi folks, > > > >> > > > >> I'm a TOTAL newbie on hadoop. I have an existing webapp that has a > > > growing > > > >> number of rows in a mysql database that I have to compare against > one > > > >> another once a day from a batch job. This is an exponential problem
-
Re: Import data from mysqlBrian McSweeney 2011-01-10, 01:19
Hi Michael,
Firstly, thanks for the reply. Secondly, I have to give you credit for the first person who has ever asked me if I want to open up my kimono a little and also the first person on a tech list who has ever made me laugh out loud. :) Ok, I hear you, and you raise some very valid issues so I'll show a little leg :) So, my application is a dating application and my problem is with respect to users in the system. The row comparison I was referring to is between users in my system. Each user has a set of profile attributes (age, location, gender, race, religion etc etc). Each user also has a set of preferences in terms of ideal dates. In order to determine if people are a good fit for each other, user's preferences are compared with other user's profile attributes, and a score is created. In an ideal world, each user would create a score for each other user. This would be, as you have pointed out, a N^2 problem. Also, there is a baysian factor that is applied to every user on a daily basis based on a number of factors such as activity on the site. This is why I said that all users must be compared on a daily basis. So, where am I at the moment at this. I have also realised that this is an unrealistic strategy long term as the numbers grow, therefore I have looked at partitioning the space, so that only groups of users under a certain limit are compared...eg, users in the same state...or under a maximum limit. Thus, I was hoping that if I put that limit at say comparing, 1000 users (say 1000 men and 1000 women)...thus this is 1 million ranks, then I could push each one of these comparisons to hadoop, which could be run in parallel and therefore quicker than running several batch comparisons of 1000 users sequentially on one box. I hope this makes sense and I hope I have opened up my kimono enough for you to get a sense of what I'm talking about :) thanks very much, Brian On Sun, Jan 9, 2011 at 1:51 PM, Black, Michael (IS) <[EMAIL PROTECTED]>wrote: > All you're doing is delaying the inevitable by going to hadoop. There's no > magic to hadoop. It doens't run as fast as individual processes. There's > just the ability to split jobs across a cluster which works for some > problems. You won't even get a linear improvement in speed. > > At least I assume you don't have some > magical-automatically-growing-forrest-of-computers. > > Do ALL the values change every day? You still would be be better off doing > it as updates are made. You can multithread your application with OpenMP > really easily and if you've got 8 cores get close to an 8X improvement with > hardly any effort at all. > > It sounds like you have an exploding data problem which means you need to > readdress what you're doing so you're not in N^2 space any more. That's > completely untennable which you're starting to see. You quite obviously > cannot keep this up for long... > > So...if you want to open up your kimono a bit and show an example of what > your'e doing maybe we can help. > > Michael D. Black > Senior Scientist > Advanced Analytics Directorate > Northrop Grumman Information Systems > > > ________________________________ > > From: Brian McSweeney [mailto:[EMAIL PROTECTED]] > Sent: Sun 1/9/2011 7:30 AM > To: [EMAIL PROTECTED] > Subject: EXTERNAL:Re: Import data from mysql > > > > Hi Michael, > > yeah, sorry, I shouldn't have said a compare as that would be a simplified > problem. For each two rows I have to calculate a score based on multiplying > some of the column values together, running some functions against each > other etc. I could do this as the rows are entered into the db, cutting > down > the problem, however unforunately the values in the existing rows change > every day, therefore I think the only thing to do is export the lot and run > a job once a day to come up with the new scores. This is why I'm looking at > hadoop as it has become too big a job doing it in a serial processing way. > > cheers, > Brian Brian McSweeney Technology Director Smarter Technology web: http://www.smarter.ie phone: +353868578212
-
Re: Import data from mysqlBrian McSweeney 2011-01-10, 01:23
Hi Ted,
I agree about reducing the quadratic cost and hopefully my reply to Michael will show what my idea has been in this regard. I really appreciate the pointers on LSH and Mahoot and I'll read up on it and see if it helps out. thanks very much for your help. cheers, Brian On Sun, Jan 9, 2011 at 9:18 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > You still have to knock down the quadratic cost. > > Any equality checks you have in your problem can be used to limit the > problem to growing quadratically in the number of records equal by that > comparison. That may be enough to fix things (for now). Unfortunately > heavily skewed data are very common so this smaller quadratic will be > orders > of magnitude smaller than the original, but still unscalable. Hadoop makes > this grouping by equality much easier of course and the internal scan can > be > done by conventional techniques. > > Beyond that, you need to look at more interesting techniques to really make > this a viable option. > > I would recommend: > > - if the multiplication is part of a cosine similarity measurement, then > look at expressing it as a difference instead and bound the largest > component of the different. > > - take a look at locality sensitive hashing. This gives you an approximate > nearest neighbor solution that will allow good probabilistic bounds on the > number of cases that you miss in return of a scalable solution. The error > bounds can be made fairly tight. See http://www.mit.edu/~andoni/LSH/<http://www.mit.edu/%7Eandoni/LSH/> > > - if you decide that LSH is the way to go, check out Mahout which has a > minhash clustering implementation. > > - if you can't restate the problem as non-quadratic, then start over. > Quadratic algorithms are not scalable as Michael Black has stated > eloquently enough in another thread. > > - consider tell the group more about your problem. You get more if you > give > more. > > On Sun, Jan 9, 2011 at 5:26 AM, Brian McSweeney > <[EMAIL PROTECTED]>wrote: > > > Thanks Ted, > > > > You're right but I suppose I was too brief in my initial statement. I > > should > > have said that I have to run an operation on all rows with respect to > each > > other. It's not a case of just comparing them and thus sorting them so > > unfortunately I don't think this will help much. Some of the values in > the > > rows have to be multiplied together, some have to be compared, some have > to > > have a function run against them etc. > > > > cheers, > > Brian > > > > On Sun, Jan 9, 2011 at 8:55 AM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > > > > It is, of course, only quadratic, even if you compare all rows to all > > other > > > rows. You can reduce this cost to O(n log n) by ordinary sorting and > you > > > can reduce further reduce the cost to O(n) using radix sort on hashes. > > > > > > Practically speaking, in either the parallel or non parallel setting > try > > > sorting each batch of inputs and then doing a merge pass to find > > duplicated > > > rows. Hashing your rows and doing the sort will make things faster if > > your > > > rows are very long or if you use radix sort. Unless your data is vast, > > > this > > > would probably work on a single machine with no need for parallelism > > since > > > sorting billions of items would require <10 passes through your data > with > > a > > > 2^16 way radix sort. > > > > > > To do this with hadoop, simply do the hashing as before and run a > typical > > > word count. Then the rows that duplicate are simply the ones with > count > > > > > > 1 > > > and these can be preferentially output by the reducer. > > > > > > On Sat, Jan 8, 2011 at 3:33 PM, Brian McSweeney > > > <[EMAIL PROTECTED]>wrote: > > > > > > > I'm a TOTAL newbie on hadoop. I have an existing webapp that has a > > > growing > > > > number of rows in a mysql database that I have to compare against one > > > > another once a day from a batch job. This is an exponential problem > as > > > > every Brian McSweeney Technology Director Smarter Technology web: http://www.smarter.ie phone: +353868578212
-
Re: Import data from mysqlBrian McSweeney 2011-01-10, 01:27
Thanks Jeff,
Great info and I really appreciate it. cheers, Brian On Mon, Jan 10, 2011 at 12:00 AM, Jeff Hammerbacher <[EMAIL PROTECTED]>wrote: > Hey Brian, > > One final point about Sqoop: it's a part of Cloudera's Distribution for > Hadoop, so it's Apache 2.0 licensed and tightly integrated with the other > platform components. This means, for example, that we have added a Sqoop > action to Oozie, which makes integrating data import and export into your > workflows trivial; see > > http://archive.cloudera.com/cdh/3/oozie-2.2.1+82/WorkflowActionExtensionsSpec.html#AE.2_Sqoop_Actionfor > more details. > > For further discussion of Sqoop, I'd recommend using the Sqoop user list at > https://groups.google.com/a/cloudera.org/group/sqoop-user. For questions > about CDH in general, see > https://groups.google.com/a/cloudera.org/group/cdh-user. > > Regards, > Jeff > > On Sun, Jan 9, 2011 at 1:37 PM, [EMAIL PROTECTED] <[EMAIL PROTECTED] > >wrote: > > > Hi Brian, > > > > Sqoop supports incremental imports that can be run against a live > database > > system on a daily basis for importing the new data. Unless your data is > > large and cannot be split into comparable slices for parallel imports, I > do > > not see any concerns regarding performance. > > > > Regarding the database library you have pointed out, it is fundamentally > > very close to what Sqoop does. However, Sqoop goes way beyond these > > libraries to ensure that you can address your use-case out of the box > > without having to modify anything. If on the other hand, you are more > > inclined to coding your own solution, then perhaps the other tools or > these > > low leve APIs may come in handy. > > > > Arvind > > > > On Sun, Jan 9, 2011 at 5:21 AM, Brian McSweeney > > <[EMAIL PROTECTED]>wrote: > > > > > Thanks Konstantin, > > > > > > I had seen sqoop. I wonder is it normally used as a once off process or > > can > > > it also be effectively used on a live database system on a daily basis > to > > > batch export. Are there performance issues with this approach? Or how > > would > > > it compare to some of the other classes that I have seen such as those > in > > > the database library > > http://hadoop.apache.org/mapreduce/docs/current/api/ > > > > > > I have also seen a few alternatives out there such as cascading and > > > cascading-dbmigrate > > > > > > http://architects.dzone.com/articles/tools-moving-sql-database > > > > > > But from the hadoop api above it also seems that some of this > > functionality > > > is perhaps now in the main api. I suppose any experience people have is > > > welcome. I would want to run a batch job to export every day, perform > my > > > map > > > reduce, and then import the results back into mysql afterwards. > > > > > > cheers, > > > Brian > > > > > > On Sun, Jan 9, 2011 at 3:18 AM, Konstantin Boudnik <[EMAIL PROTECTED]> > > wrote: > > > > > > > There's a supported tool with all bells and whistles: > > > > http://www.cloudera.com/downloads/sqoop/ > > > > > > > > -- > > > > Take care, > > > > Konstantin (Cos) Boudnik > > > > > > > > On Sat, Jan 8, 2011 at 18:57, Sonal Goyal <[EMAIL PROTECTED]> > > wrote: > > > > > Hi Brian, > > > > > > > > > > You can check HIHO at https://github.com/sonalgoyal/hiho which can > > > help > > > > you > > > > > load data from any JDBC database to the Hadoop file system. If your > > > table > > > > > has a date or id field, or any indicator for modified/newly added > > rows, > > > > you > > > > > can import only the altered rows every day. Please let me know if > you > > > > need > > > > > help. > > > > > > > > > > Thanks and Regards, > > > > > Sonal > > > > > <https://github.com/sonalgoyal/hiho>Connect Hadoop with databases, > > > > > Salesforce, FTP servers and others < > > https://github.com/sonalgoyal/hiho > > > > > > > > > Nube Technologies <http://www.nubetech.co> > > > > > > > > > > <http://in.linkedin.com/in/sonalgoyal> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Brian McSweeney Technology Director Smarter Technology web: http://www.smarter.ie phone: +353868578212
-
Re: Import data from mysqlBrian McSweeney 2011-01-10, 01:27
Hi Arvind,
thanks very much for that. Very good to know. Sounds like Sqoop is just what I'm looking for. cheers, Brian On Sun, Jan 9, 2011 at 9:37 PM, [EMAIL PROTECTED] <[EMAIL PROTECTED]>wrote: > Hi Brian, > > Sqoop supports incremental imports that can be run against a live database > system on a daily basis for importing the new data. Unless your data is > large and cannot be split into comparable slices for parallel imports, I do > not see any concerns regarding performance. > > Regarding the database library you have pointed out, it is fundamentally > very close to what Sqoop does. However, Sqoop goes way beyond these > libraries to ensure that you can address your use-case out of the box > without having to modify anything. If on the other hand, you are more > inclined to coding your own solution, then perhaps the other tools or these > low leve APIs may come in handy. > > Arvind > > On Sun, Jan 9, 2011 at 5:21 AM, Brian McSweeney > <[EMAIL PROTECTED]>wrote: > > > Thanks Konstantin, > > > > I had seen sqoop. I wonder is it normally used as a once off process or > can > > it also be effectively used on a live database system on a daily basis to > > batch export. Are there performance issues with this approach? Or how > would > > it compare to some of the other classes that I have seen such as those in > > the database library > http://hadoop.apache.org/mapreduce/docs/current/api/ > > > > I have also seen a few alternatives out there such as cascading and > > cascading-dbmigrate > > > > http://architects.dzone.com/articles/tools-moving-sql-database > > > > But from the hadoop api above it also seems that some of this > functionality > > is perhaps now in the main api. I suppose any experience people have is > > welcome. I would want to run a batch job to export every day, perform my > > map > > reduce, and then import the results back into mysql afterwards. > > > > cheers, > > Brian > > > > On Sun, Jan 9, 2011 at 3:18 AM, Konstantin Boudnik <[EMAIL PROTECTED]> > wrote: > > > > > There's a supported tool with all bells and whistles: > > > http://www.cloudera.com/downloads/sqoop/ > > > > > > -- > > > Take care, > > > Konstantin (Cos) Boudnik > > > > > > On Sat, Jan 8, 2011 at 18:57, Sonal Goyal <[EMAIL PROTECTED]> > wrote: > > > > Hi Brian, > > > > > > > > You can check HIHO at https://github.com/sonalgoyal/hiho which can > > help > > > you > > > > load data from any JDBC database to the Hadoop file system. If your > > table > > > > has a date or id field, or any indicator for modified/newly added > rows, > > > you > > > > can import only the altered rows every day. Please let me know if you > > > need > > > > help. > > > > > > > > Thanks and Regards, > > > > Sonal > > > > <https://github.com/sonalgoyal/hiho>Connect Hadoop with databases, > > > > Salesforce, FTP servers and others < > https://github.com/sonalgoyal/hiho > > > > > > > Nube Technologies <http://www.nubetech.co> > > > > > > > > <http://in.linkedin.com/in/sonalgoyal> > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Jan 9, 2011 at 5:03 AM, Brian McSweeney > > > > <[EMAIL PROTECTED]>wrote: > > > > > > > >> Hi folks, > > > >> > > > >> I'm a TOTAL newbie on hadoop. I have an existing webapp that has a > > > growing > > > >> number of rows in a mysql database that I have to compare against > one > > > >> another once a day from a batch job. This is an exponential problem > as > > > >> every > > > >> row must be compared against every other row. I was thinking of > > > >> parallelizing this computation via hadoop. As such, I was thinking > > that > > > >> perhaps the first thing to look at is how to bring info from a > > database > > > to > > > >> a > > > >> hadoop job and vise versa. I have seen the following relevant info > > > >> > > > >> https://issues.apache.org/jira/browse/HADOOP-2536 > > > >> > > > >> and also > > > >> > > > >> http://architects.dzone.com/articles/tools-moving-sql-database > > > >> > > > >> any advice on what approach to use? Brian McSweeney Technology Director Smarter Technology web: http://www.smarter.ie phone: +353868578212
-
Re: Import data from mysqlBlack, Michael 2011-01-10, 13:21
I had no idea the kimono comment would be so applicable to your problem...
Everything makes sense except the Bayesian computation. Your "score" can be computed on subsets....in particular you only need to do it on "new" and "changed" records. Most of which should be pretty static (age needs to be by birthdate so it's computable). Note that comparing a new person to the database is at worst N assuming they don't have a sexual preference and approximately N/2 if they do. Lots better than N^2 every day. So just set a flag in your database for new/changed and process just those records every day. Computing any other records has already been done on previous days. Even if you allow your users a custom scoring threshold that can all be done in the SQL query. Then whenever there's a change or new addition you simply take the new values and perform a query to get the people that match best. You do this from by adding your own compare function to the database so you can use it on a query. You don't need to do the other-way-round by comparing EVERYBODY to the new stuff. You get the same answer with a much smaller set (linear by the way) if you only look at new/changed. http://dev.mysql.com/doc/refman/5.5/en/adding-functions.html Your Bayesian is another story though...wazzup wit' 'dat...but again...sounds like it only needs to be done on "active" users so again is not N^2 and is only a computation internal to the record and not cross-record, yes? If so it's linear and not N^2 and can be done on a subset of active users....otherwise it hasn't changed has it? Michael D. Black Senior Scientist Advanced Analytics Directorate Northrop Grumman Information Systems ________________________________ From: Brian McSweeney [mailto:[EMAIL PROTECTED]] Sent: Sun 1/9/2011 7:19 PM To: [EMAIL PROTECTED] Subject: EXTERNAL:Re: Import data from mysql Hi Michael, Firstly, thanks for the reply. Secondly, I have to give you credit for the first person who has ever asked me if I want to open up my kimono a little and also the first person on a tech list who has ever made me laugh out loud. :) Ok, I hear you, and you raise some very valid issues so I'll show a little leg :) So, my application is a dating application and my problem is with respect to users in the system. The row comparison I was referring to is between users in my system. Each user has a set of profile attributes (age, location, gender, race, religion etc etc). Each user also has a set of preferences in terms of ideal dates. In order to determine if people are a good fit for each other, user's preferences are compared with other user's profile attributes, and a score is created. In an ideal world, each user would create a score for each other user. This would be, as you have pointed out, a N^2 problem. Also, there is a baysian factor that is applied to every user on a daily basis based on a number of factors such as activity on the site. This is why I said that all users must be compared on a daily basis. So, where am I at the moment at this. I have also realised that this is an unrealistic strategy long term as the numbers grow, therefore I have looked at partitioning the space, so that only groups of users under a certain limit are compared...eg, users in the same state...or under a maximum limit. Thus, I was hoping that if I put that limit at say comparing, 1000 users (say 1000 men and 1000 women)...thus this is 1 million ranks, then I could push each one of these comparisons to hadoop, which could be run in parallel and therefore quicker than running several batch comparisons of 1000 users sequentially on one box. I hope this makes sense and I hope I have opened up my kimono enough for you to get a sense of what I'm talking about :) thanks very much, Brian
-
Re: Import data from mysqlBrian 2011-01-10, 20:00
Hi Michael,
that all makes total sense and I very much appreciate your help. Leaving the bayesian issue asside for a moment, I still think I'm stuck with a potentially big calculating problem, even if it is not quadratic. For example, imagine I've got 10,000 users of each gender. If only 100 update their preferences and another 100 join, i'm still talking about 2 million calculations for the new/updated users to score everyone else and another 2 million for existing users to create scores for the new users. Thus, I would greatly appreciate your opinion on whether or not using hadoop for this would make sense in order to parallelize the task if it gets too slow. Thanks again, Brian On 10 Jan 2011, at 13:21, "Black, Michael (IS)" <[EMAIL PROTECTED]> wrote: > I had no idea the kimono comment would be so applicable to your > problem... > > Everything makes sense except the Bayesian computation. > > Your "score" can be computed on subsets....in particular you only > need to do it on "new" and "changed" records. Most of which should > be pretty static (age needs to be by birthdate so it's computable). > Note that comparing a new person to the database is at worst N > assuming they don't have a sexual preference and approximately N/2 > if they do. Lots better than N^2 every day. > > So just set a flag in your database for new/changed and process just > those records every day. Computing any other records has already > been done on previous days. Even if you allow your users a custom > scoring threshold that can all be done in the SQL query. > > > Then whenever there's a change or new addition you simply take the > new values and perform a query to get the people that match best. > You do this from by adding your own compare function to the database > so you can use it on a query. You don't need to do the other-way- > round by comparing EVERYBODY to the new stuff. You get the same > answer with a much smaller set (linear by the way) if you only look > at new/changed. > http://dev.mysql.com/doc/refman/5.5/en/adding-functions.html > > Your Bayesian is another story though...wazzup wit' 'dat...but > again...sounds like it only needs to be done on "active" users so > again is not N^2 and is only a computation internal to the record > and not cross-record, yes? If so it's linear and not N^2 and can be > done on a subset of active users....otherwise it hasn't changed has > it? > > Michael D. Black > Senior Scientist > Advanced Analytics Directorate > Northrop Grumman Information Systems > > > ________________________________ > > From: Brian McSweeney [mailto:[EMAIL PROTECTED]] > Sent: Sun 1/9/2011 7:19 PM > To: [EMAIL PROTECTED] > Subject: EXTERNAL:Re: Import data from mysql > > > > Hi Michael, > > Firstly, thanks for the reply. Secondly, I have to give you credit > for the > first person who has ever asked me if I want to open up my kimono a > little > and also the first person on a tech list who has ever made me laugh > out > loud. :) > > Ok, I hear you, and you raise some very valid issues so I'll show a > little > leg :) > > So, my application is a dating application and my problem is with > respect to > users in the system. The row comparison I was referring to is > between users > in my system. Each user has a set of profile attributes (age, > location, > gender, race, religion etc etc). > Each user also has a set of preferences in terms of ideal dates. In > order to > determine if people are a good fit for each other, user's > preferences are > compared with other user's profile attributes, and a score is created. > > In an ideal world, each user would create a score for each other > user. This > would be, as you have pointed out, a N^2 problem. Also, there is a > baysian > factor that is applied to every user on a daily basis based on a > number of > factors such as activity on the site. This is why I said that all
-
Re: Import data from mysqlBlack, Michael 2011-01-10, 20:46
You need to stop looking at this as an all-or-nothing...and look at it more like real-time.
You only need to do an absolute max of 1*10,000 at a time. And...you actually only need to do considerably less than that with age preference and other factors for the users....and doing the computation via a built-in in your database will prevent having to retrieve all the data...split it...start jvms...reduce it...spit out the file...read in the file...etc....saving lots more time than using hadoop. You should be able to do 10,000 computations inside MySQL in less than a second. It will take you a minute or more to do it in hadoop. Just do them as they occur and don't worry about the once-per-day thing. Then you're left with a linear growth pattern which can be overcome by using MySQL in a cluster rather than N^2 using hadoop. Give it a try and see how it performs for you... The whole thing will boil down to one SQL statement where you add potential matches via the compare function. Something like this: idcur = current id of add/change select id,score(idcur,id) from people where religion='RELIGIONX' and SEX='M' and AGE BETWEEN X and Y You then update the match table for the users returned with some score threshold (I would assume there's a threshold) I don't know if you care to elucidate your "score" as I don't see a whole lot of numeric flexibiliy in matching people...unlees you're doing personality profiles too. Michael D. Black Senior Scientist Advanced Analytics Directorate Northrop Grumman Information Systems ________________________________ From: Brian [mailto:[EMAIL PROTECTED]] Sent: Mon 1/10/2011 2:00 PM To: [EMAIL PROTECTED] Cc: <[EMAIL PROTECTED]> Subject: EXTERNAL:Re: Import data from mysql Hi Michael, that all makes total sense and I very much appreciate your help. Leaving the bayesian issue asside for a moment, I still think I'm stuck with a potentially big calculating problem, even if it is not quadratic. For example, imagine I've got 10,000 users of each gender. If only 100 update their preferences and another 100 join, i'm still talking about 2 million calculations for the new/updated users to score everyone else and another 2 million for existing users to create scores for the new users. Thus, I would greatly appreciate your opinion on whether or not using hadoop for this would make sense in order to parallelize the task if it gets too slow. Thanks again, Brian
-
Re: Import data from mysqlTed Dunning 2011-01-10, 21:51
Yes. Hadoop can definitely help with this.
On Mon, Jan 10, 2011 at 12:00 PM, Brian <[EMAIL PROTECTED]> wrote: > Thus, I would greatly appreciate your opinion on whether or not using > hadoop for this would make sense in order to parallelize the task if it gets > too slow.
-
Re: Import data from mysqlBrian McSweeney 2011-01-10, 23:19
Thanks Michael,
As you say, I'll give your suggestion a try and see how it performs. thanks for all your help. I really appreciate it, Brian On Mon, Jan 10, 2011 at 8:46 PM, Black, Michael (IS) <[EMAIL PROTECTED] > wrote: > You need to stop looking at this as an all-or-nothing...and look at it more > like real-time. > > > You only need to do an absolute max of 1*10,000 at a time. And...you > actually only need to do considerably less than that with age preference and > other factors for the users....and doing the computation via a built-in in > your database will prevent having to retrieve all the data...split > it...start jvms...reduce it...spit out the file...read in the > file...etc....saving lots more time than using hadoop. > > You should be able to do 10,000 computations inside MySQL in less than a > second. It will take you a minute or more to do it in hadoop. > > Just do them as they occur and don't worry about the once-per-day thing. > Then you're left with a linear growth pattern which can be overcome by > using MySQL in a cluster rather than N^2 using hadoop. > > Give it a try and see how it performs for you... > > The whole thing will boil down to one SQL statement where you add potential > matches via the compare function. > > Something like this: > > idcur = current id of add/change > select id,score(idcur,id) from people where religion='RELIGIONX' and > SEX='M' and AGE BETWEEN X and Y > > You then update the match table for the users returned with some score > threshold (I would assume there's a threshold) > > I don't know if you care to elucidate your "score" as I don't see a whole > lot of numeric flexibiliy in matching people...unlees you're doing > personality profiles too. > > Michael D. Black > Senior Scientist > Advanced Analytics Directorate > Northrop Grumman Information Systems > > > ________________________________ > > From: Brian [mailto:[EMAIL PROTECTED]] > Sent: Mon 1/10/2011 2:00 PM > To: [EMAIL PROTECTED] > Cc: <[EMAIL PROTECTED]> > Subject: EXTERNAL:Re: Import data from mysql > > > > Hi Michael, > > that all makes total sense and I very much appreciate your help. > Leaving the bayesian issue asside for a moment, I still think I'm > stuck with a potentially big calculating problem, even if it is not > quadratic. > > For example, imagine I've got 10,000 users of each gender. If only 100 > update their preferences and another 100 join, i'm still talking about > 2 million calculations for the new/updated users to score everyone > else and another 2 million for existing users to create scores for the > new users. > > Thus, I would greatly appreciate your opinion on whether or not using > hadoop for this would make sense in order to parallelize the task if > it gets too slow. > > Thanks again, > Brian > > > > -- ----------------------------------------- Brian McSweeney Technology Director Smarter Technology web: http://www.smarter.ie phone: +353868578212 -----------------------------------------
-
Re: Import data from mysqlBrian McSweeney 2011-01-11, 00:54
Thanks Ted,
Good to know that hadoop can help. I'll look more into it also. really appreciate it. Brian On Mon, Jan 10, 2011 at 9:51 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Yes. Hadoop can definitely help with this. > > On Mon, Jan 10, 2011 at 12:00 PM, Brian <[EMAIL PROTECTED]> wrote: > > > Thus, I would greatly appreciate your opinion on whether or not using > > hadoop for this would make sense in order to parallelize the task if it > gets > > too slow. > -- ----------------------------------------- Brian McSweeney Technology Director Smarter Technology web: http://www.smarter.ie phone: +353868578212 -----------------------------------------
-
Re: Import data from mysqlMark Kerzner 2011-01-14, 06:02
Brian,
I read with fascination your thread on MySQL and Hadoop. I enjoyed your polite answers to every person. Your problem is interesting. Your helpers were brilliant. Disclaimer: I have a vested interest, as I am writing "Hadoop in Practice" for Manning, and I was at the beginning of chapter 3, "SQL Databases and Hadoop" when you asked your question. You can imagine that I was thrilled and stored the thread to be read later. Which is now. I think that your problem has two different components. 1. Import of MySQL data into Hadoop. This can be done with Sqoop, HIHO, custom file formats on top of Hadoop API, Cascading, cascading-dbmigrate. I imagine that you would dump the files in text format for Hadoop into HDFS; 2. Changing and enhancing the architecture, using update-only-what-changed, data grouping, or some other clever heuristics. I am thinking about both questions. For 1., I am planning to look at every one of them, then prepare a section with an example on each, because that is how the whole book is constructed. For 2., I am thinking about other approaches. Essentially, you have a big matrix, and you want to compute something similar to matrix multiplication. If so, can you normalize the matrix before? Or, can you express this as an optimization problem, "I am trying to find max number of best matches, according to some criteria, and do it in a reasonable time." I would not be very happy to change algorithm just for the purpose of optimizing the speed. At the very least, it should not be done on the first iteration, as this would be a case of premature optimization. I also wonder if graph operations, something like Pregel (Hama) can be useful here. On the subject of kimono, your site, http://www.smarter.ie/<http://www.smarter.ie/index.do> is about auctioning car insurance, and perhaps other types of insurance. Is it only for Europe? The site uses the Ireland domain name, ie. Also, is your real problem in insurance matching, and you just used dating as a metaphor? Why would I ask? I see in this a wonderful practical application example - and nothing beats practice - so I would like to describe it as a practical use case, in some general terms. Thus, I would like to know, so that I can be closer to reality. Thank you. Sincerely, Mark On Sat, Jan 8, 2011 at 5:33 PM, Brian McSweeney <[EMAIL PROTECTED]>wrote: > Hi folks, > > I'm a TOTAL newbie on hadoop. I have an existing webapp that has a growing > number of rows in a mysql database that I have to compare against one > another once a day from a batch job. This is an exponential problem as > every > row must be compared against every other row. I was thinking of > parallelizing this computation via hadoop. As such, I was thinking that > perhaps the first thing to look at is how to bring info from a database to > a > hadoop job and vise versa. I have seen the following relevant info > > https://issues.apache.org/jira/browse/HADOOP-2536 > > and also > > http://architects.dzone.com/articles/tools-moving-sql-database > > any advice on what approach to use? > > cheers, > Brian >
-
Re: Import data from mysqlBrian McSweeney 2011-01-14, 20:24
Hi Mark,
what a very interesting email ! And it sounds like you are writing a very interesting and timely book. I'm glad you enjoyed the thread. I did too :-) I would love to help you all I can with your book and would be fascinated to read the chapter you're writing that is related to my initial question. With respect to my problem, actually, it has nothing to do with the insurance site I have. Smarter.ie is an irish focused insurance auction site, we have other similar sites in progress for other locations. However, with respect to my matching problem, it is unrelated to our insurance sites and was actually broadly as I described...I did withhold some of the details and that may have made it slightly confusing but that is because there are some commercially sensitive issues. I would love to help you with your book and be happy to help with it as an example, but I think to go further in it we should probably discuss it off the mailing list as there is some commercially sensitive stuff in the example and if it were to be used in your book I would want it to be generalized. But yes, you are on the money with regards to your graph theory. Anyway, feel free to mail me directly at my gmail address and I'd be very happy to help all I can. kind regards and best of luck with the book! Brian On Fri, Jan 14, 2011 at 6:02 AM, Mark Kerzner <[EMAIL PROTECTED]> wrote: > Brian, > > I read with fascination your thread on MySQL and Hadoop. I enjoyed your > polite answers to every person. Your problem is interesting. Your helpers > were brilliant. Disclaimer: I have a vested interest, as I am writing > "Hadoop in Practice" for Manning, and I was at the beginning of chapter 3, > "SQL Databases and Hadoop" when you asked your question. You can imagine > that I was thrilled and stored the thread to be read later. Which is now. > > I think that your problem has two different components. > > > 1. Import of MySQL data into Hadoop. This can be done with Sqoop, HIHO, > custom file formats on top of Hadoop API, Cascading, cascading-dbmigrate. > I > imagine that you would dump the files in text format for Hadoop into > HDFS; > 2. Changing and enhancing the architecture, using > update-only-what-changed, data grouping, or some other clever heuristics. > > I am thinking about both questions. For 1., I am planning to look at every > one of them, then prepare a section with an example on each, because that > is > how the whole book is constructed. For 2., I am thinking about other > approaches. Essentially, you have a big matrix, and you want to compute > something similar to matrix multiplication. If so, can you normalize the > matrix before? Or, can you express this as an optimization problem, "I am > trying to find max number of best matches, according to some criteria, and > do it in a reasonable time." I would not be very happy to change algorithm > just for the purpose of optimizing the speed. At the very least, it should > not be done on the first iteration, as this would be a case of premature > optimization. I also wonder if graph operations, something like Pregel > (Hama) can be useful here. > > On the subject of kimono, your site, > http://www.smarter.ie/<http://www.smarter.ie/index.do> is > about auctioning car insurance, and perhaps other types of insurance. Is it > only for Europe? The site uses the Ireland domain name, ie. Also, is your > real problem in insurance matching, and you just used dating as a metaphor? > Why would I ask? I see in this a wonderful practical application example - > and nothing beats practice - so I would like to describe it as a practical > use case, in some general terms. Thus, I would like to know, so that I can > be closer to reality. > > Thank you. Sincerely, > Mark > > On Sat, Jan 8, 2011 at 5:33 PM, Brian McSweeney > <[EMAIL PROTECTED]>wrote: > > > Hi folks, > > > > I'm a TOTAL newbie on hadoop. I have an existing webapp that has a > growing > > number of rows in a mysql database that I have to compare against one Brian McSweeney Technology Director Smarter Technology web: http://www.smarter.ie phone: +353868578212 |