Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Help in debugging Hive Query


Copy link to this message
-
Re: Help in debugging Hive Query
Hi Sanjay,
 
Thanks for taking the time to write all the details. I did a silly mistake. The data type for visit_page_num, i created it as string. The string was causing issues when I am using the max function. A type cast to int in the query worked for me.
 
Regards,
Raj
________________________________
From: Sanjay Subramanian <[EMAIL PROTECTED]>
To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
Sent: Thursday, July 25, 2013 1:41 PM
Subject: Re: Help in debugging Hive Query

The query is correct but since u r creating a managed table , that is possibly creating some issue and the records are not all getting created

This is what I would propose

CHECKPOINT  1 : Is this query running at all ?
==================================Use this option in BOLD and run the QUERY ONLY (without any table creation) to log errors and pipe to a log file by using nohup or some other way that u prefer
hive -hiveconf hive.root.logger=INFO,console -e

select a.evar23,sum(b.max_visit_page_num) from
(select distinct visid_high,visid_low,evar23 from web.omniture_web_data) a
JOIN
(select visid_high,visid_low,max(visit_page_num) as max_visit_page_num from omniture_web_data group by visid_high,visid_low) b
where a.visid_high=b.visid_high and a.visid_low=b.visid_low
group by a.evar23;
CHECKPOINT 2 : Run the query (using the CREATE TABLE option) with these additional options
==========================================================================Required params:
------------------------
SET mapreduce.job.maps=500; 
SET mapreduce.job.reduces=8; 
SET mapreduce.tasktracker.map.tasks.maximum=12; 
SET mapreduce.tasktracker.reduce.tasks.maximum=8; 
SET mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec; 
SET mapreduce.map.output.compress=true; 
Optional params:
-----------------------
If u r using compression in output , use the following ; u can change the LzoCodec to whatever u r using for compression 
SET hive.exec.compress.intermediate=true; 
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec; 
SET mapreduce.output.fileoutputformat.compress=true; 
Thanks

Sanjay

From: Raj Hadoop <[EMAIL PROTECTED]>
Reply-To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>, Raj Hadoop <[EMAIL PROTECTED]>
Date: Thursday, July 25, 2013 5:00 AM
To: Hive <[EMAIL PROTECTED]>
Subject: Help in debugging Hive Query
All,

I am trying to determine visits for customer from omniture weblog file using Hive.

Table: omniture_web_data
Columns: visid_high,visid_low,evar23,visit_page_num

Sample Data:
visid_high,visid_low,evar23,visit_page_num
999,888,1003,10
999,888,1003,14
999,888,1003,6
999,777,1003,12
999,777,1003,20

I want to calculate for each Customer Number ( evar23 is  Customer Number ) , total visits. visid_high and visid_low determines a unique visit.
For each distinct visitor, calculate sum of maximum visit_page_num. In above example

14 + 20 = 34 should be the total visits for the customer 1003.

I am trying to run the following queries - Method 1 is almost the same as Method 2. Except in Method 1 I only choose a particualr customer number 1003. In method 2 , i generalized to all.

In Method 1 , I am getting the accurate result. In metnhod 2 , I am not getting the same result as Method 1.

Any suggestions on how to trouble shoot. ALso, any alternative approaches.

// Method 1
select a.evar23,sum(b.max_visit_page_num) from
(select distinct visid_high,visid_low,evar23 from web.omniture_web_data where evar23='1003') a
JOIN
(select visid_high,visid_low,max(visit_page_num) as max_visit_page_num from omniture_web_data where evar23='1003' group by visid_high,visid_low) b
where a.visid_high=b.visid_high and a.visid_low=b.visid_low
group by a.evar23;

///// Result of Method 1

1003    34

// Method 2

create table temp123 as
select a.evar23,sum(b.max_visit_page_num) from
(select distinct visid_high,visid_low,evar23 from web.omniture_web_data) a
JOIN
(select visid_high,visid_low,max(visit_page_num) as max_visit_page_num from omniture_web_data group by visid_high,visid_low) b
where a.visid_high=b.visid_high and a.visid_low=b.visid_low
group by a.evar23;

select * from temp123 where evar23='1003';

// The Result of Method 2 is not the same as Method 1. It is showing a different number.

Thanks,
Raj

 

CONFIDENTIALITY NOTICE
=====================This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB