|
Bhavesh Shah
2012-05-14, 07:48
Nitin Pawar
2012-05-14, 08:37
Bhavesh Shah
2012-05-14, 09:01
Nitin Pawar
2012-05-14, 09:43
Justin Coffey
2012-05-14, 10:46
Bhavesh Shah
2012-05-14, 11:08
Nitin Pawar
2012-05-14, 12:35
Bhavesh Shah
2012-05-14, 13:17
Nanda Vijaydev
2012-05-14, 19:45
Bhavesh Shah
2012-05-15, 11:33
Nitin Pawar
2012-05-15, 11:44
|
-
Re: Is my Use Case possible with Hive?Bhavesh Shah 2012-05-14, 07:48
I have near about 1 billion records in my relational database.
Currently locally I am using just one cluster. But I also tried this on Amazon Elastic Mapreduce with 10 nodes. But the time taken to execute the complete program is same as that on my single local machine. On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote: > how many # records? > > what is your hadoop cluster setup? how many nodes? > if you are running hadoop on a single node setup with normal desktop, i > doubt it will be of any help. > > You need a stronger cluster setup for better query runtimes and ofcourse > query optimization which I guess you would have already taken care. > > > > On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah <[EMAIL PROTECTED]>wrote: > >> Hello all, >> My Use Case is: >> 1) I have a relational database which has a very large data. (MS SQL >> Server) >> 2) I want to do analysis on these huge data and want to generate reports >> on it after analysis. >> Like this I have to generate various reports based on different analysis. >> >> I tried to implement this using Hive. What I did is: >> 1) I imported all tables in Hive from MS SQL Server using SQOOP. >> 2) I wrote many queries in Hive which is executing using JDBC on Hive >> Thrift Server >> 3) I am getting the correct result in table form, which I am expecting >> 4) But the problem is that the time which require to execute is too much >> long. >> (My complete program is executing in near about 3-4 hours on *small >> amount of data*). >> >> >> I decided to do this using Hive. >> And as I told previously how much time Hive consumed for execution. my >> organization is expecting to complete this task in near about less than >> 1/2 hours >> >> Now after spending too much time for complete execution for this task what >> should I do? >> I want to ask one thing that: >> *Is this Use Case is possible with Hive?* If possible what should I do in >> >> my program to increase the performance? >> *And If not possible what is the other good way to implement this Use >> Case?* >> >> >> Please reply me. >> Thanks >> >> >> -- >> Regards, >> Bhavesh Shah >> > > > > -- > Nitin Pawar > > -- Regards, Bhavesh Shah
-
Re: Is my Use Case possible with Hive?Nitin Pawar 2012-05-14, 08:37
with a 10 node cluster the performance should improve.
how many maps and reducers are being launched? On Mon, May 14, 2012 at 1:18 PM, Bhavesh Shah <[EMAIL PROTECTED]>wrote: > I have near about 1 billion records in my relational database. > Currently locally I am using just one cluster. But I also tried this on > Amazon Elastic Mapreduce with 10 nodes. But the time taken to execute the > complete program is same as that on my single local machine. > > > On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote: > >> how many # records? >> >> what is your hadoop cluster setup? how many nodes? >> if you are running hadoop on a single node setup with normal desktop, i >> doubt it will be of any help. >> >> You need a stronger cluster setup for better query runtimes and ofcourse >> query optimization which I guess you would have already taken care. >> >> >> >> On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah <[EMAIL PROTECTED]>wrote: >> >>> Hello all, >>> My Use Case is: >>> 1) I have a relational database which has a very large data. (MS SQL >>> Server) >>> 2) I want to do analysis on these huge data and want to generate reports >>> on it after analysis. >>> Like this I have to generate various reports based on different analysis. >>> >>> I tried to implement this using Hive. What I did is: >>> 1) I imported all tables in Hive from MS SQL Server using SQOOP. >>> 2) I wrote many queries in Hive which is executing using JDBC on Hive >>> Thrift Server >>> 3) I am getting the correct result in table form, which I am expecting >>> 4) But the problem is that the time which require to execute is too much >>> long. >>> (My complete program is executing in near about 3-4 hours on *small >>> amount of data*). >>> >>> >>> I decided to do this using Hive. >>> And as I told previously how much time Hive consumed for execution. >>> my >>> organization is expecting to complete this task in near about less than >>> 1/2 hours >>> >>> Now after spending too much time for complete execution for this task >>> what >>> should I do? >>> I want to ask one thing that: >>> *Is this Use Case is possible with Hive?* If possible what should I do in >>> >>> my program to increase the performance? >>> *And If not possible what is the other good way to implement this Use >>> Case?* >>> >>> >>> Please reply me. >>> Thanks >>> >>> >>> -- >>> Regards, >>> Bhavesh Shah >>> >> >> >> >> -- >> Nitin Pawar >> >> > > > -- > Regards, > Bhavesh Shah > > -- Nitin Pawar
-
Re: Is my Use Case possible with Hive?Bhavesh Shah 2012-05-14, 09:01
That I fail to know, how many maps and reducers are there. Because due to
some reason my instance get terminated :( I want to know one thing that If we use multiple nodes, then what should be the count of maps and reducers. Actually I am confused about that. How to decide it? Also I want to try the different properties like block size, compress output, size of in-memorybuffer, parallel execution etc. Will these all properties matters to increase the performance? Nitin, you have read all my use case. Whatever the thing I did to implement with the help of Hadoop is correct? Is it possible to increase the performance? Thanks Nitin for your reply. :) -- Regards, Bhavesh Shah On Mon, May 14, 2012 at 2:07 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote: > with a 10 node cluster the performance should improve. > how many maps and reducers are being launched? > > > On Mon, May 14, 2012 at 1:18 PM, Bhavesh Shah <[EMAIL PROTECTED]>wrote: > >> I have near about 1 billion records in my relational database. >> Currently locally I am using just one cluster. But I also tried this on >> Amazon Elastic Mapreduce with 10 nodes. But the time taken to execute the >> complete program is same as that on my single local machine. >> >> >> On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote: >> >>> how many # records? >>> >>> what is your hadoop cluster setup? how many nodes? >>> if you are running hadoop on a single node setup with normal desktop, i >>> doubt it will be of any help. >>> >>> You need a stronger cluster setup for better query runtimes and ofcourse >>> query optimization which I guess you would have already taken care. >>> >>> >>> >>> On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah <[EMAIL PROTECTED]>wrote: >>> >>>> Hello all, >>>> My Use Case is: >>>> 1) I have a relational database which has a very large data. (MS SQL >>>> Server) >>>> 2) I want to do analysis on these huge data and want to generate >>>> reports >>>> on it after analysis. >>>> Like this I have to generate various reports based on different >>>> analysis. >>>> >>>> I tried to implement this using Hive. What I did is: >>>> 1) I imported all tables in Hive from MS SQL Server using SQOOP. >>>> 2) I wrote many queries in Hive which is executing using JDBC on Hive >>>> Thrift Server >>>> 3) I am getting the correct result in table form, which I am expecting >>>> 4) But the problem is that the time which require to execute is too much >>>> long. >>>> (My complete program is executing in near about 3-4 hours on *small >>>> amount of data*). >>>> >>>> >>>> I decided to do this using Hive. >>>> And as I told previously how much time Hive consumed for execution. >>>> my >>>> organization is expecting to complete this task in near about less than >>>> 1/2 hours >>>> >>>> Now after spending too much time for complete execution for this task >>>> what >>>> should I do? >>>> I want to ask one thing that: >>>> *Is this Use Case is possible with Hive?* If possible what should I do >>>> in >>>> >>>> my program to increase the performance? >>>> *And If not possible what is the other good way to implement this Use >>>> Case?* >>>> >>>> >>>> Please reply me. >>>> Thanks >>>> >>>> >>>> -- >>>> Regards, >>>> Bhavesh Shah >>>> >>> >>> >>> >>> -- >>> Nitin Pawar >>> >>> >> >> >> -- >> Regards, >> Bhavesh Shah >> >> > > > -- > Nitin Pawar > >
-
Re: Is my Use Case possible with Hive?Nitin Pawar 2012-05-14, 09:43
it is definitely possible to increase your performance.
I have run queries where more than 10 billion records were involved. If you are doing joins in your queries, you may have a look at different kind of joins supported by hive. If one of your table is very small in size compared to another table then you may consider mapside join etc Also the number of maps and reducers are decided by the split size you provide to maps. I would suggest before you go full speed, decide on how you want to layout data for hive. You can try loading some data, partition the data and write queries based on partition then performance will improve but in that case your queries will be in batch processing format. there are other approaches as well. On Mon, May 14, 2012 at 2:31 PM, Bhavesh Shah <[EMAIL PROTECTED]>wrote: > That I fail to know, how many maps and reducers are there. Because due to > some reason my instance get terminated :( > I want to know one thing that If we use multiple nodes, then what should > be the count of maps and reducers. > Actually I am confused about that. How to decide it? > > Also I want to try the different properties like block size, compress > output, size of in-memorybuffer, parallel execution etc. > Will these all properties matters to increase the performance? > > Nitin, you have read all my use case. Whatever the thing I did to > implement with the help of Hadoop is correct? > Is it possible to increase the performance? > > Thanks Nitin for your reply. :) > > -- > Regards, > Bhavesh Shah > > > On Mon, May 14, 2012 at 2:07 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote: > >> with a 10 node cluster the performance should improve. >> how many maps and reducers are being launched? >> >> >> On Mon, May 14, 2012 at 1:18 PM, Bhavesh Shah <[EMAIL PROTECTED]>wrote: >> >>> I have near about 1 billion records in my relational database. >>> Currently locally I am using just one cluster. But I also tried this on >>> Amazon Elastic Mapreduce with 10 nodes. But the time taken to execute the >>> complete program is same as that on my single local machine. >>> >>> >>> On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote: >>> >>>> how many # records? >>>> >>>> what is your hadoop cluster setup? how many nodes? >>>> if you are running hadoop on a single node setup with normal desktop, i >>>> doubt it will be of any help. >>>> >>>> You need a stronger cluster setup for better query runtimes and >>>> ofcourse query optimization which I guess you would have already taken care. >>>> >>>> >>>> >>>> On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah <[EMAIL PROTECTED] >>>> > wrote: >>>> >>>>> Hello all, >>>>> My Use Case is: >>>>> 1) I have a relational database which has a very large data. (MS SQL >>>>> Server) >>>>> 2) I want to do analysis on these huge data and want to generate >>>>> reports >>>>> on it after analysis. >>>>> Like this I have to generate various reports based on different >>>>> analysis. >>>>> >>>>> I tried to implement this using Hive. What I did is: >>>>> 1) I imported all tables in Hive from MS SQL Server using SQOOP. >>>>> 2) I wrote many queries in Hive which is executing using JDBC on Hive >>>>> Thrift Server >>>>> 3) I am getting the correct result in table form, which I am expecting >>>>> 4) But the problem is that the time which require to execute is too >>>>> much >>>>> long. >>>>> (My complete program is executing in near about 3-4 hours on *small >>>>> amount of data*). >>>>> >>>>> >>>>> I decided to do this using Hive. >>>>> And as I told previously how much time Hive consumed for >>>>> execution. my >>>>> organization is expecting to complete this task in near about less than >>>>> 1/2 hours >>>>> >>>>> Now after spending too much time for complete execution for this task >>>>> what >>>>> should I do? >>>>> I want to ask one thing that: >>>>> *Is this Use Case is possible with Hive?* If possible what should I do >>>>> in >>>>> >>>>> my program to increase the performance? Nitin Pawar
-
Re: Is my Use Case possible with Hive?Justin Coffey 2012-05-14, 10:46
You can also have a reduce-side bottleneck if, for example, you are doing
distinct counts or with skewed group sizes (ie one aggregation group is much larger than others). But to know this you really need to look at the stats of your jobs via the jobtracker and even the progress counter output of hive. If you do see one reducer getting all the work you can try setting hive.groupby.skewindata=true. -Justin On Mon, May 14, 2012 at 11:43 AM, Nitin Pawar <[EMAIL PROTECTED]>wrote: > it is definitely possible to increase your performance. > > I have run queries where more than 10 billion records were involved. > If you are doing joins in your queries, you may have a look at different > kind of joins supported by hive. > If one of your table is very small in size compared to another table then > you may consider mapside join etc > > Also the number of maps and reducers are decided by the split size you > provide to maps. > > I would suggest before you go full speed, decide on how you want to layout > data for hive. > > You can try loading some data, partition the data and write queries based > on partition then performance will improve but in that case your queries > will be in batch processing format. there are other approaches as well. > > > On Mon, May 14, 2012 at 2:31 PM, Bhavesh Shah <[EMAIL PROTECTED]>wrote: > >> That I fail to know, how many maps and reducers are there. Because due to >> some reason my instance get terminated :( >> I want to know one thing that If we use multiple nodes, then what should >> be the count of maps and reducers. >> Actually I am confused about that. How to decide it? >> >> Also I want to try the different properties like block size, compress >> output, size of in-memorybuffer, parallel execution etc. >> Will these all properties matters to increase the performance? >> >> Nitin, you have read all my use case. Whatever the thing I did to >> implement with the help of Hadoop is correct? >> Is it possible to increase the performance? >> >> Thanks Nitin for your reply. :) >> >> -- >> Regards, >> Bhavesh Shah >> >> >> On Mon, May 14, 2012 at 2:07 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote: >> >>> with a 10 node cluster the performance should improve. >>> how many maps and reducers are being launched? >>> >>> >>> On Mon, May 14, 2012 at 1:18 PM, Bhavesh Shah <[EMAIL PROTECTED]>wrote: >>> >>>> I have near about 1 billion records in my relational database. >>>> Currently locally I am using just one cluster. But I also tried this on >>>> Amazon Elastic Mapreduce with 10 nodes. But the time taken to execute the >>>> complete program is same as that on my single local machine. >>>> >>>> >>>> On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote: >>>> >>>>> how many # records? >>>>> >>>>> what is your hadoop cluster setup? how many nodes? >>>>> if you are running hadoop on a single node setup with normal desktop, >>>>> i doubt it will be of any help. >>>>> >>>>> You need a stronger cluster setup for better query runtimes and >>>>> ofcourse query optimization which I guess you would have already taken care. >>>>> >>>>> >>>>> >>>>> On Mon, May 14, 2012 at 12:39 PM, Bhavesh Shah < >>>>> [EMAIL PROTECTED]> wrote: >>>>> >>>>>> Hello all, >>>>>> My Use Case is: >>>>>> 1) I have a relational database which has a very large data. (MS SQL >>>>>> Server) >>>>>> 2) I want to do analysis on these huge data and want to generate >>>>>> reports >>>>>> on it after analysis. >>>>>> Like this I have to generate various reports based on different >>>>>> analysis. >>>>>> >>>>>> I tried to implement this using Hive. What I did is: >>>>>> 1) I imported all tables in Hive from MS SQL Server using SQOOP. >>>>>> 2) I wrote many queries in Hive which is executing using JDBC on Hive >>>>>> Thrift Server >>>>>> 3) I am getting the correct result in table form, which I am expecting >>>>>> 4) But the problem is that the time which require to execute is too >>>>>> much >>>>>> long. [EMAIL PROTECTED]
-
Re: Is my Use Case possible with Hive?Bhavesh Shah 2012-05-14, 11:08
Hello Nitin,
Thanks for suggesting me about the partition. But I want to tell one thing that I forgot to mention before is that :* I am using Indexes on all tables tables which are used again and again. * But the problem is that after execution I didn't see the difference in performance (before applying the index and after applying it) I have created the indexes as below: sql = "CREATE INDEX INDEX_VisitDate ON TABLE Tmp(Uid,VisitDate) as 'COMPACT' WITH DEFERRED REBUILD stored as RCFILE"; res2 = stmt2.executeQuery(sql); sql = (new StringBuilder(" INSERT OVERWRITE TABLE Tmp select C1.Uid, C1.VisitDate, C1.ID from TmpElementTable C1 LEFT OUTER JOIN Tmp T on C1.Uid=T.Uid and C1.VisitDate=T.VisitDate").toString(); stmt2.executeUpdate(sql); sql = "load data inpath '/user/hive/warehouse/tmp' overwrite into table TmpElementTable"; stmt2.executeUpdate(sql); sql = "alter index clinical_index on TmpElementTable REBUILD"; res2 = stmt2.executeQuery(sql); *Did I use it in correct way?* As you told me told me to try with partition Actually I am altering the table with large number of columns at the runtime only. If i use partition in such situation then is it good to use partition for all columns? So, I want to know that After using the partition Will it be able to improve the performance or do I need to use both Partition and Indexes? -- Regards, Bhavesh Shah On Mon, May 14, 2012 at 3:13 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote: > it is definitely possible to increase your performance. > > I have run queries where more than 10 billion records were involved. > If you are doing joins in your queries, you may have a look at different > kind of joins supported by hive. > If one of your table is very small in size compared to another table then > you may consider mapside join etc > > Also the number of maps and reducers are decided by the split size you > provide to maps. > > I would suggest before you go full speed, decide on how you want to layout > data for hive. > > You can try loading some data, partition the data and write queries based > on partition then performance will improve but in that case your queries > will be in batch processing format. there are other approaches as well. > > > On Mon, May 14, 2012 at 2:31 PM, Bhavesh Shah <[EMAIL PROTECTED]>wrote: > >> That I fail to know, how many maps and reducers are there. Because due to >> some reason my instance get terminated :( >> I want to know one thing that If we use multiple nodes, then what should >> be the count of maps and reducers. >> Actually I am confused about that. How to decide it? >> >> Also I want to try the different properties like block size, compress >> output, size of in-memorybuffer, parallel execution etc. >> Will these all properties matters to increase the performance? >> >> Nitin, you have read all my use case. Whatever the thing I did to >> implement with the help of Hadoop is correct? >> Is it possible to increase the performance? >> >> Thanks Nitin for your reply. :) >> >> -- >> Regards, >> Bhavesh Shah >> >> >> On Mon, May 14, 2012 at 2:07 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote: >> >>> with a 10 node cluster the performance should improve. >>> how many maps and reducers are being launched? >>> >>> >>> On Mon, May 14, 2012 at 1:18 PM, Bhavesh Shah <[EMAIL PROTECTED]>wrote: >>> >>>> I have near about 1 billion records in my relational database. >>>> Currently locally I am using just one cluster. But I also tried this on >>>> Amazon Elastic Mapreduce with 10 nodes. But the time taken to execute the >>>> complete program is same as that on my single local machine. >>>> >>>> >>>> On Mon, May 14, 2012 at 1:13 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote: >>>> >>>>> how many # records? >>>>> >>>>> what is your hadoop cluster setup? how many nodes? >>>>> if you are running hadoop on a single node setup with normal desktop, >>>>> i doubt it will be of any help. >>>>> >>>>> You need a stronger cluster setup for better query runtimes and
-
Re: Is my Use Case possible with Hive?Nitin Pawar 2012-05-14, 12:35
partitioning is mainly used when you want to access the table based on
value of a particular column and dont want to go through entire table for same operation. This actually means if there are few columns whose values are repeated in all the records, then you can consider partitioning on them. Other approach will be partition data based on date/time if applicable. >From the queries you showed, i am just seeing inserting and creating indexes. loading data to tables should not take much time and I personally have never used indexing so can not tell about that particular query execution time. if I understand correctly following is your execution approach 1) Import data from MS-SQL to hive using sqoop should be over quickly depending on how much time MS-SQL takes to export 2) example of queries which you are doing on the data being dumped in hive will be good to know if we can decide on the data layout and change the queries as per needed if needed 3) Once query execution is over you are putting the result back in MS-SQL can you note individually how much time each step is taking? On Mon, May 14, 2012 at 4:38 PM, Bhavesh Shah <[EMAIL PROTECTED]>wrote: > Hello Nitin, > Thanks for suggesting me about the partition. > But I want to tell one thing that I forgot to mention before is that :* > I am using Indexes on all tables tables which are used again and again. * > But the problem is that after execution I didn't see the difference in > performance (before applying the index and after applying it) > I have created the indexes as below: > sql = "CREATE INDEX INDEX_VisitDate ON TABLE Tmp(Uid,VisitDate) as > 'COMPACT' WITH DEFERRED REBUILD stored as RCFILE"; > res2 = stmt2.executeQuery(sql); > sql = (new StringBuilder(" INSERT OVERWRITE TABLE Tmp select C1.Uid, > C1.VisitDate, C1.ID from > TmpElementTable C1 LEFT OUTER JOIN Tmp T on C1.Uid=T.Uid and > C1.VisitDate=T.VisitDate").toString(); > stmt2.executeUpdate(sql); > sql = "load data inpath '/user/hive/warehouse/tmp' overwrite into table > TmpElementTable"; > stmt2.executeUpdate(sql); > sql = "alter index clinical_index on TmpElementTable REBUILD"; > res2 = stmt2.executeQuery(sql); > *Did I use it in correct way?* > > As you told me told me to try with partition > Actually I am altering the table with large number of columns at the > runtime only. > If i use partition in such situation then is it good to use partition for > all columns? > > So, I want to know that After using the partition Will it be able to > improve the performance or > do I need to use both Partition and Indexes? > > > > > -- > Regards, > Bhavesh Shah > > > On Mon, May 14, 2012 at 3:13 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote: > >> it is definitely possible to increase your performance. >> >> I have run queries where more than 10 billion records were involved. >> If you are doing joins in your queries, you may have a look at different >> kind of joins supported by hive. >> If one of your table is very small in size compared to another table then >> you may consider mapside join etc >> >> Also the number of maps and reducers are decided by the split size you >> provide to maps. >> >> I would suggest before you go full speed, decide on how you want to >> layout data for hive. >> >> You can try loading some data, partition the data and write queries based >> on partition then performance will improve but in that case your queries >> will be in batch processing format. there are other approaches as well. >> >> >> On Mon, May 14, 2012 at 2:31 PM, Bhavesh Shah <[EMAIL PROTECTED]>wrote: >> >>> That I fail to know, how many maps and reducers are there. Because due >>> to some reason my instance get terminated :( >>> I want to know one thing that If we use multiple nodes, then what should >>> be the count of maps and reducers. >>> Actually I am confused about that. How to decide it? >>> >>> Also I want to try the different properties like block size, compress >>> output, size of in-memorybuffer, parallel execution etc. Nitin Pawar
-
Re: Is my Use Case possible with Hive?Bhavesh Shah 2012-05-14, 13:17
Thanks Nitin for your continous support.
*Here is my data layout and change the queries as per needed*: 1) Initially after importing the tables from MS SQL Server, 1st basic task I am doing is that *PIVOTING.* As SQL stores data in name value pair. 2) Pivoting results in subset of data, Using this subset we are running complex queries on history data and retrieves result for each row in subset. again *data is updated into pivoted columns*. (I am not using partition. updated by INSERT OVERWRITE) As update is not supporting, I have to again do *INSERT OVERWRITE TABLE *3) Likewise I have to do near about 20-30 times. (Depends upon Business rules and scenario if needed to Business rules) 4) After this I have to do computation which has very large queries from above generated tables. (Each query has near about 10-11 jobs query) This again repeats for 30 times. (My all queries contains - case when, group by, cast function, etc ) -- Regards, Bhavesh Shah On Mon, May 14, 2012 at 6:05 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote: > partitioning is mainly used when you want to access the table based on > value of a particular column and dont want to go through entire table for > same operation. This actually means if there are few columns whose values > are repeated in all the records, then you can consider partitioning on > them. Other approach will be partition data based on date/time if > applicable. > > From the queries you showed, i am just seeing inserting and creating > indexes. loading data to tables should not take much time and I personally > have never used indexing so can not tell about that particular query > execution time. > > if I understand correctly following is your execution approach > > 1) Import data from MS-SQL to hive using sqoop > should be over quickly depending on how much time MS-SQL takes to > export > 2) example of queries which you are doing on the data being dumped in hive > will be good to know if we can decide on the data layout and change the > queries as per needed if needed > 3) Once query execution is over you are putting the result back in MS-SQL > > can you note individually how much time each step is taking? > > > On Mon, May 14, 2012 at 4:38 PM, Bhavesh Shah <[EMAIL PROTECTED]>wrote: > >> Hello Nitin, >> Thanks for suggesting me about the partition. >> But I want to tell one thing that I forgot to mention before is that :* >> I am using Indexes on all tables tables which are used again and again. * >> But the problem is that after execution I didn't see the difference in >> performance (before applying the index and after applying it) >> I have created the indexes as below: >> sql = "CREATE INDEX INDEX_VisitDate ON TABLE Tmp(Uid,VisitDate) as >> 'COMPACT' WITH DEFERRED REBUILD stored as RCFILE"; >> res2 = stmt2.executeQuery(sql); >> sql = (new StringBuilder(" INSERT OVERWRITE TABLE Tmp select C1.Uid, >> C1.VisitDate, C1.ID from >> TmpElementTable C1 LEFT OUTER JOIN Tmp T on C1.Uid=T.Uid and >> C1.VisitDate=T.VisitDate").toString(); >> stmt2.executeUpdate(sql); >> sql = "load data inpath '/user/hive/warehouse/tmp' overwrite into table >> TmpElementTable"; >> stmt2.executeUpdate(sql); >> sql = "alter index clinical_index on TmpElementTable REBUILD"; >> res2 = stmt2.executeQuery(sql); >> *Did I use it in correct way?* >> >> As you told me told me to try with partition >> Actually I am altering the table with large number of columns at the >> runtime only. >> If i use partition in such situation then is it good to use partition for >> all columns? >> >> So, I want to know that After using the partition Will it be able to >> improve the performance or >> do I need to use both Partition and Indexes? >> >> >> >> >> -- >> Regards, >> Bhavesh Shah >> >> >> On Mon, May 14, 2012 at 3:13 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote: >> >>> it is definitely possible to increase your performance. >>> >>> I have run queries where more than 10 billion records were involved.
-
Re: Is my Use Case possible with Hive?Nanda Vijaydev 2012-05-14, 19:45
Hadoop in general does well with fewer large data files instead of more
smaller data files. RDBMS type of indexing and run time optimization is not exactly available in Hadoop/Hive yet. So one suggestion is to combine some of this data, if you can, into fewer tables as you are doing sqoop. Even if there is a slight redundancy it should be OK. Storage is cheap and helps during read. Other suggestions as given in this thread is to set map side and reduce side hive optimization parameters. Querying via jdbc is generally slow as well. There are certain products in Hadoop space that allow for hive querying without jdbc interface. Give it a try and it should improve performance. Good luck On Mon, May 14, 2012 at 6:17 AM, Bhavesh Shah <[EMAIL PROTECTED]>wrote: > Thanks Nitin for your continous support. > *Here is my data layout and change the queries as per needed*: > 1) Initially after importing the tables from MS SQL Server, 1st basic task > I am doing is that *PIVOTING.* > As SQL stores data in name value pair. > 2) Pivoting results in subset of data, Using this subset we are running > complex queries on history data and retrieves result for each row in > subset. > again *data is updated into pivoted columns*. (I am not using > partition. updated by INSERT OVERWRITE) > As update is not supporting, I have to again do *INSERT OVERWRITE > TABLE > *3) Likewise I have to do near about 20-30 times. (Depends upon Business > rules and scenario if needed to Business rules) > 4) After this I have to do computation which has very large queries from > above generated tables. > (Each query has near about 10-11 jobs query) > This again repeats for 30 times. > > (My all queries contains - case when, group by, cast function, etc ) > > -- > Regards, > Bhavesh Shah > > > On Mon, May 14, 2012 at 6:05 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote: > >> partitioning is mainly used when you want to access the table based on >> value of a particular column and dont want to go through entire table for >> same operation. This actually means if there are few columns whose values >> are repeated in all the records, then you can consider partitioning on >> them. Other approach will be partition data based on date/time if >> applicable. >> >> From the queries you showed, i am just seeing inserting and creating >> indexes. loading data to tables should not take much time and I personally >> have never used indexing so can not tell about that particular query >> execution time. >> >> if I understand correctly following is your execution approach >> >> 1) Import data from MS-SQL to hive using sqoop >> should be over quickly depending on how much time MS-SQL takes to >> export >> 2) example of queries which you are doing on the data being dumped in >> hive will be good to know if we can decide on the data layout and change >> the queries as per needed if needed >> 3) Once query execution is over you are putting the result back in MS-SQL >> >> can you note individually how much time each step is taking? >> >> >> On Mon, May 14, 2012 at 4:38 PM, Bhavesh Shah <[EMAIL PROTECTED]>wrote: >> >>> Hello Nitin, >>> Thanks for suggesting me about the partition. >>> But I want to tell one thing that I forgot to mention before is that :* >>> I am using Indexes on all tables tables which are used again and again. >>> * >>> But the problem is that after execution I didn't see the difference in >>> performance (before applying the index and after applying it) >>> I have created the indexes as below: >>> sql = "CREATE INDEX INDEX_VisitDate ON TABLE Tmp(Uid,VisitDate) as >>> 'COMPACT' WITH DEFERRED REBUILD stored as RCFILE"; >>> res2 = stmt2.executeQuery(sql); >>> sql = (new StringBuilder(" INSERT OVERWRITE TABLE Tmp select C1.Uid, >>> C1.VisitDate, C1.ID from >>> TmpElementTable C1 LEFT OUTER JOIN Tmp T on C1.Uid=T.Uid and >>> C1.VisitDate=T.VisitDate").toString(); >>> stmt2.executeUpdate(sql); >>> sql = "load data inpath '/user/hive/warehouse/tmp' overwrite into table
-
Re: Is my Use Case possible with Hive?Bhavesh Shah 2012-05-15, 11:33
Thanks all for their replies.
Just now I tried one thing that as folows: 1) I open tho two hive CLI. hive> 2) I have one query which takes 7 jobs for execution. I submitted that query to both the CLI. 3) one of the hive CLI took 147.319 seconds and second one took: 161.542 seconds 4) Later I tried that query only on one CLI and it took 122.307 seconds The thing what I want to ask is this, if multiple query runs parallel it takes less time to execute compare to execute one by one. If I want to execute such parallel queries through JDBC, how can I do it. I know that hive can accept at a time one connection. But still is there any way to so it? Pls suggest me some solution for this. -- Regards, Bhavesh Shah On Tue, May 15, 2012 at 1:15 AM, Nanda Vijaydev <[EMAIL PROTECTED]>wrote: > Hadoop in general does well with fewer large data files instead of more > smaller data files. RDBMS type of indexing and run time optimization is not > exactly available in Hadoop/Hive yet. So one suggestion is to combine some > of this data, if you can, into fewer tables as you are doing sqoop. Even if > there is a slight redundancy it should be OK. Storage is cheap and helps > during read. > > Other suggestions as given in this thread is to set map side and reduce > side hive optimization parameters. Querying via jdbc is generally slow as > well. There are certain products in Hadoop space that allow for hive > querying without jdbc interface. Give it a try and it should improve > performance. > > Good luck > > > > On Mon, May 14, 2012 at 6:17 AM, Bhavesh Shah <[EMAIL PROTECTED]>wrote: > >> Thanks Nitin for your continous support. >> *Here is my data layout and change the queries as per needed*: >> 1) Initially after importing the tables from MS SQL Server, 1st basic >> task I am doing is that *PIVOTING.* >> As SQL stores data in name value pair. >> 2) Pivoting results in subset of data, Using this subset we are running >> complex queries on history data and retrieves result for each row in >> subset. >> again *data is updated into pivoted columns*. (I am not using >> partition. updated by INSERT OVERWRITE) >> As update is not supporting, I have to again do *INSERT OVERWRITE >> TABLE >> *3) Likewise I have to do near about 20-30 times. (Depends upon Business >> rules and scenario if needed to Business rules) >> 4) After this I have to do computation which has very large queries from >> above generated tables. >> (Each query has near about 10-11 jobs query) >> This again repeats for 30 times. >> >> (My all queries contains - case when, group by, cast function, etc ) >> >> -- >> Regards, >> Bhavesh Shah >> >> >> On Mon, May 14, 2012 at 6:05 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote: >> >>> partitioning is mainly used when you want to access the table based on >>> value of a particular column and dont want to go through entire table for >>> same operation. This actually means if there are few columns whose values >>> are repeated in all the records, then you can consider partitioning on >>> them. Other approach will be partition data based on date/time if >>> applicable. >>> >>> From the queries you showed, i am just seeing inserting and creating >>> indexes. loading data to tables should not take much time and I personally >>> have never used indexing so can not tell about that particular query >>> execution time. >>> >>> if I understand correctly following is your execution approach >>> >>> 1) Import data from MS-SQL to hive using sqoop >>> should be over quickly depending on how much time MS-SQL takes to >>> export >>> 2) example of queries which you are doing on the data being dumped in >>> hive will be good to know if we can decide on the data layout and change >>> the queries as per needed if needed >>> 3) Once query execution is over you are putting the result back in >>> MS-SQL >>> >>> can you note individually how much time each step is taking? >>> >>> >>> On Mon, May 14, 2012 at 4:38 PM, Bhavesh Shah <[EMAIL PROTECTED]>wrote:
-
Re: Is my Use Case possible with Hive?Nitin Pawar 2012-05-15, 11:44
the problem with hive server with jdbc currently is that it does not handle
concurrent connection in a seamless manner and chokes down on larger number of parallel query executions. For this one reason, I had actually written a pipeline kind of infra using shell scripts which used to run queries after one another and used to run them from different terminals or run them as background processes (but this needed a larger memory on hive client cli as lot of times hive cli went OOM as too many queries were doing some pre query processing (like mapside joins etc) On Tue, May 15, 2012 at 5:03 PM, Bhavesh Shah <[EMAIL PROTECTED]>wrote: > Thanks all for their replies. > Just now I tried one thing that as folows: > 1) I open tho two hive CLI. hive> > 2) I have one query which takes 7 jobs for execution. I submitted that > query to both the CLI. > 3) one of the hive CLI took 147.319 seconds and second one took: 161.542 > seconds > 4) Later I tried that query only on one CLI and it took 122.307 seconds > The thing what I want to ask is this, if multiple query runs parallel it > takes less time to execute compare to execute one by one. > > If I want to execute such parallel queries through JDBC, how can I do it. > I know that hive can accept at a time one connection. But still is > there any way to so it? > Pls suggest me some solution for this. > > > -- > Regards, > Bhavesh Shah > > > On Tue, May 15, 2012 at 1:15 AM, Nanda Vijaydev <[EMAIL PROTECTED]>wrote: > >> Hadoop in general does well with fewer large data files instead of more >> smaller data files. RDBMS type of indexing and run time optimization is not >> exactly available in Hadoop/Hive yet. So one suggestion is to combine some >> of this data, if you can, into fewer tables as you are doing sqoop. Even if >> there is a slight redundancy it should be OK. Storage is cheap and helps >> during read. >> >> Other suggestions as given in this thread is to set map side and reduce >> side hive optimization parameters. Querying via jdbc is generally slow as >> well. There are certain products in Hadoop space that allow for hive >> querying without jdbc interface. Give it a try and it should improve >> performance. >> >> Good luck >> >> >> >> On Mon, May 14, 2012 at 6:17 AM, Bhavesh Shah <[EMAIL PROTECTED]>wrote: >> >>> Thanks Nitin for your continous support. >>> *Here is my data layout and change the queries as per needed*: >>> 1) Initially after importing the tables from MS SQL Server, 1st basic >>> task I am doing is that *PIVOTING.* >>> As SQL stores data in name value pair. >>> 2) Pivoting results in subset of data, Using this subset we are running >>> complex queries on history data and retrieves result for each row in >>> subset. >>> again *data is updated into pivoted columns*. (I am not using >>> partition. updated by INSERT OVERWRITE) >>> As update is not supporting, I have to again do *INSERT OVERWRITE >>> TABLE >>> *3) Likewise I have to do near about 20-30 times. (Depends upon >>> Business rules and scenario if needed to Business rules) >>> 4) After this I have to do computation which has very large queries from >>> above generated tables. >>> (Each query has near about 10-11 jobs query) >>> This again repeats for 30 times. >>> >>> (My all queries contains - case when, group by, cast function, etc ) >>> >>> -- >>> Regards, >>> Bhavesh Shah >>> >>> >>> On Mon, May 14, 2012 at 6:05 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote: >>> >>>> partitioning is mainly used when you want to access the table based on >>>> value of a particular column and dont want to go through entire table for >>>> same operation. This actually means if there are few columns whose values >>>> are repeated in all the records, then you can consider partitioning on >>>> them. Other approach will be partition data based on date/time if >>>> applicable. >>>> >>>> From the queries you showed, i am just seeing inserting and creating >>>> indexes. loading data to tables should not take much time and I personally Nitin Pawar |