Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Re: Is my Use Case possible with Hive?


+
Bhavesh Shah 2012-05-14, 07:48
+
Nitin Pawar 2012-05-14, 08:37
+
Bhavesh Shah 2012-05-14, 09:01
+
Nitin Pawar 2012-05-14, 09:43
+
Bhavesh Shah 2012-05-14, 11:08
+
Nitin Pawar 2012-05-14, 12:35
+
Bhavesh Shah 2012-05-14, 13:17
+
Nanda Vijaydev 2012-05-14, 19:45
+
Bhavesh Shah 2012-05-15, 11:33
Copy link to this message
-
Re: Is my Use Case possible with Hive?
the problem with hive server with jdbc currently is that it does not handle
concurrent connection in a seamless manner and chokes down on larger number
of parallel query executions.

For this one reason, I had actually written a pipeline kind of infra using
shell scripts which used to run queries after one another and used to run
them from different terminals or run them as background processes (but this
needed a larger memory on hive client cli as lot of times hive cli went OOM
as too many queries were doing some pre query processing (like mapside
joins etc)

On Tue, May 15, 2012 at 5:03 PM, Bhavesh Shah <[EMAIL PROTECTED]>wrote:

> Thanks all for their replies.
> Just now I tried one thing that as folows:
> 1) I open tho two hive CLI.  hive>
> 2) I have one query which takes 7 jobs for execution. I submitted that
> query to both the CLI.
> 3) one of the hive CLI took 147.319 seconds  and second one took: 161.542
> seconds
> 4) Later I tried that query only on one CLI and it took 122.307 seconds
>  The thing what I want to ask is this, if multiple query runs parallel it
> takes less time to execute compare to execute one by one.
>
>   If I want to execute such parallel queries through JDBC, how can I do it.
>   I know that hive can accept at a time one connection. But still is
> there any way to so it?
>   Pls suggest me some solution for this.
>
>
> --
> Regards,
> Bhavesh Shah
>
>
> On Tue, May 15, 2012 at 1:15 AM, Nanda Vijaydev <[EMAIL PROTECTED]>wrote:
>
>> Hadoop in general does well with fewer large data files instead of more
>> smaller data files. RDBMS type of indexing and run time optimization is not
>> exactly available in Hadoop/Hive yet. So one suggestion is to combine some
>> of this data, if you can, into fewer tables as you are doing sqoop. Even if
>> there is a slight redundancy it should be OK. Storage is cheap and helps
>> during read.
>>
>> Other suggestions as given in this thread is to set map side and reduce
>> side hive optimization parameters. Querying via jdbc is generally slow as
>> well. There are certain products in Hadoop space that allow for hive
>> querying without jdbc interface. Give it a try and it should improve
>> performance.
>>
>> Good luck
>>
>>
>>
>> On Mon, May 14, 2012 at 6:17 AM, Bhavesh Shah <[EMAIL PROTECTED]>wrote:
>>
>>> Thanks Nitin for your continous support.
>>> *Here is my data layout and change the queries as per needed*:
>>> 1) Initially after importing the tables from MS SQL Server, 1st basic
>>> task I am doing is that *PIVOTING.*
>>>    As SQL stores data in name value pair.
>>> 2) Pivoting results in subset of data, Using this subset we are running
>>> complex queries on history data and retrieves result for each row in
>>> subset.
>>>     again *data is updated into pivoted columns*. (I am not using
>>> partition. updated by INSERT OVERWRITE)
>>>     As update is not supporting, I have to again do *INSERT OVERWRITE
>>> TABLE
>>> *3) Likewise I have to do near about 20-30 times. (Depends upon
>>> Business rules and scenario if needed to Business rules)
>>> 4) After this I have to do computation which has very large queries from
>>> above generated tables.
>>>     (Each query has near about 10-11 jobs query)
>>>     This again repeats for 30 times.
>>>
>>> (My all queries contains -  case when, group by, cast function, etc )
>>>
>>> --
>>> Regards,
>>> Bhavesh Shah
>>>
>>>
>>> On Mon, May 14, 2012 at 6:05 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>>>
>>>> partitioning is mainly used when you want to access the table based on
>>>> value of a particular column and dont want to go through entire table for
>>>> same operation. This actually means if there are few columns whose values
>>>> are repeated in all the records, then you can consider partitioning on
>>>> them. Other approach will be partition data based on date/time if
>>>> applicable.
>>>>
>>>> From the queries you showed, i am just seeing inserting and creating
>>>> indexes. loading data to tables should not take much time and I personally
Nitin Pawar
+
Justin Coffey 2012-05-14, 10:46