Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # user - Re: Is my Use Case possible with Hive?

Bhavesh Shah 2012-05-14, 07:48
Nitin Pawar 2012-05-14, 08:37
Bhavesh Shah 2012-05-14, 09:01
Nitin Pawar 2012-05-14, 09:43
Bhavesh Shah 2012-05-14, 11:08
Copy link to this message
Re: Is my Use Case possible with Hive?
Nitin Pawar 2012-05-14, 12:35
partitioning is mainly used when you want to access the table based on
value of a particular column and dont want to go through entire table for
same operation. This actually means if there are few columns whose values
are repeated in all the records, then you can consider partitioning on
them. Other approach will be partition data based on date/time if

>From the queries you showed, i am just seeing inserting and creating
indexes. loading data to tables should not take much time and I personally
have never used indexing so can not tell about that particular query
execution time.

if I understand correctly following is your execution approach

1) Import data from MS-SQL to hive using sqoop
    should be over quickly depending on how much time MS-SQL takes to
2) example of queries which you are doing on the data being dumped in hive
will be good to know if we can decide on the data layout and change the
queries as per needed if needed
3) Once query execution is over you are putting the result back in MS-SQL

can you note individually how much time each step is taking?

On Mon, May 14, 2012 at 4:38 PM, Bhavesh Shah <[EMAIL PROTECTED]>wrote:

> Hello Nitin,
> Thanks for suggesting me about the partition.
> But I want to tell one thing that I forgot to mention before is that :*
> I am using Indexes on all tables tables which are used again and again. *
> But the problem is that after execution I didn't see the difference in
> performance (before applying the index and after applying it)
> I have created the indexes as below:
> sql = "CREATE INDEX INDEX_VisitDate ON TABLE Tmp(Uid,VisitDate) as
> res2 = stmt2.executeQuery(sql);
> sql = (new StringBuilder(" INSERT OVERWRITE TABLE Tmp  select C1.Uid,
> C1.VisitDate, C1.ID from
>        TmpElementTable C1 LEFT OUTER JOIN Tmp T on C1.Uid=T.Uid and
> C1.VisitDate=T.VisitDate").toString();
> stmt2.executeUpdate(sql);
> sql = "load data inpath '/user/hive/warehouse/tmp' overwrite into table
> TmpElementTable";
> stmt2.executeUpdate(sql);
> sql = "alter index clinical_index on TmpElementTable REBUILD";
> res2 = stmt2.executeQuery(sql);
> *Did I use it in correct way?*
> As you told me told me to try with partition
> Actually I am altering the table with large number of columns at the
> runtime only.
> If i use partition in such situation then is it good to use partition for
> all columns?
> So, I want to know that After using the partition Will it be able to
> improve the performance or
> do I need to use both Partition and Indexes?
> --
> Regards,
> Bhavesh Shah
> On Mon, May 14, 2012 at 3:13 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>> it is definitely possible to increase your performance.
>> I have run queries where more than 10 billion records were involved.
>> If you are doing joins in your queries, you may have a look at different
>> kind of joins supported by hive.
>> If one of your table is very small in size compared to another table then
>> you may consider mapside join etc
>> Also the number of maps and reducers are decided by the split size you
>> provide to maps.
>> I would suggest before you go full speed, decide on how you want to
>> layout data for hive.
>> You can try loading some data, partition the data and write queries based
>> on partition then performance will improve but in that case your queries
>> will be in batch processing format. there are other approaches as well.
>> On Mon, May 14, 2012 at 2:31 PM, Bhavesh Shah <[EMAIL PROTECTED]>wrote:
>>> That I fail to know, how many maps and reducers are there. Because due
>>> to some reason my instance get terminated   :(
>>> I want to know one thing that If we use multiple nodes, then what should
>>> be the count of maps and reducers.
>>> Actually I am confused about that. How to decide it?
>>> Also I want to try the different properties like block size, compress
>>> output, size of in-memorybuffer, parallel execution etc.
Nitin Pawar
Bhavesh Shah 2012-05-14, 13:17
Nanda Vijaydev 2012-05-14, 19:45
Bhavesh Shah 2012-05-15, 11:33
Nitin Pawar 2012-05-15, 11:44
Justin Coffey 2012-05-14, 10:46