Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> question..


Copy link to this message
-
RE: question..
Can you decompose the problem a bit .e.g

can you just count the number of rows returned by the subq to ensure that all the predicates are returning the desired results?

Also what does count(CB.ID) mean?

Ashish

________________________________
From: Ronak Bhatt [mailto:[EMAIL PROTECTED]]
Sent: Friday, August 20, 2010 1:56 PM
To: [EMAIL PROTECTED]
Subject: Re: question..

Hi - Ashish was right that in original query, I was using a wrong field to join e.g. used cb.id<http://cb.id> instead of cb.page_id (however, that one should have also brought the result because the content was integer values comprising of 1,2,3 etc..)

But to not leave any doubt, I did change the query to join on right fields...i.e. page.id<http://page.id> = cb.page_id ..but I was getting heap error, so put the inline table on left hand and callbacks on right side (as I learned from forum that smaller table on lhs should help address the error)

The query (pasted below) ran without any error, but the result is the same (pasted below) i.e. no records are coming as a result of query...but I verified in files that the data does exist i.e. there are records in CB where page_id for example has value 1 and that 1 has page like '%google%'

Any pointers as to what could be going wrong?

================== query ===============
select substr(CB.EXEC_DATE,1,10), count(CB.ID<http://CB.ID>)
    from
      (select * from  pages p where
               p.page like '%google.com/search%<http://google.com/search%><http://google.com/search%>'
            or p.page like '%google.com/custom%<http://google.com/custom%><http://google.com/custom%>'
            or p.page like '%google.com/#hl%<http://google.com/#hl%><http://google.com/#hl%>'
            or p.page like '%google.com/cse%<http://google.com/cse%><http://google.com/cse%>'
            or p.page like '%search.yahoo.com/search%<http://search.yahoo.com/search%><http://search.yahoo.com/search%>'
            or p.page like '%bing.com/search%<http://bing.com/search%><http://bing.com/search%>'
            or p.page like '%google.com/product%<http://google.com/product%><http://google.com/product%>' ) s join callbacks CB ON s.id<http://s.id> = cb.page_id
    group by substr(CB.EXEC_DATE,1,10);
==================query output ============
10/08/20 13:43:32 INFO mapred.LocalJobRunner: reduce > reduce
10/08/20 13:43:32 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
2010-08-20 01:43:32,839 map = 100%,  reduce =100%
10/08/20 13:43:32 INFO exec.ExecDriver: 2010-08-20 01:43:32,839 map = 100%,  reduce =100%
Ended Job = job_local_0001
10/08/20 13:43:32 INFO exec.ExecDriver: Ended Job = job_local_0001
10/08/20 13:43:32 INFO exec.FileSinkOperator: Moving tmp dir: hdfs://hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/_tmp.10001<http://hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/_tmp.10001> to: hdfs://hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/_tmp.10001.intermediate<http://hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/_tmp.10001.intermediate>
10/08/20 13:43:32 INFO exec.FileSinkOperator: Moving tmp dir: hdfs://hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/_tmp.10001.intermediate<http://hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/_tmp.10001.intermediate> to: hdfs://hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/10001<http://hdp01.billeo.com:54310/tmp/hive-hadoop/1895081082/10001>
OK
Time taken: 2241.278 seconds
thanks, ronak

408 504 4847
My Blog : http://ronakbaps.posterous.com

On Fri, Aug 20, 2010 at 12:18 PM, Ashish Thusoo <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
I think the query did finish properly. Can you recheck the data to see if you would really get a few rows of output?

Ashish

________________________________
From: Ronak Bhatt [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>]
Sent: Friday, August 20, 2010 12:12 PM
To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
Subject: question..

Hi - in my HIVE environment, I ran the following query and expecting to see the rows (the data is present)...however, after 2339 seconds, the o/p I get is as shown below in the email (I've pasted last 5~10 lines of screen output)...

Is there anything that I'm missing? Did the process finish correctly? should there be something that could point me how to debug?
select substr(CB.EXEC_DATE,1,10), count(CB.ID<http://CB.ID>)
    from callbacks CB JOIN
      (select * from  pages p where
               p.page like '%google.com/search%<http://google.com/search%><http://google.com/search%>'
            or p.page like '%google.com/custom%<http://google.com/custom%><http://google.com/custom%>'
            or p.page like '%google.com/#hl%<http://google.com/#hl%><http://google.com/#hl%>'
            or p.page like '%google.com/cse%<http://google.com/cse%><http://google.com/cse%>'
            or p.page like '%search.yahoo.com/search%<http://search.yahoo.com/search%><http://search.yahoo.com/search%>'
            or p.page like '%bing.com/search%<http://bing.com/search%><http://bing.com/search%>'
            or p.page like '%google.com/product%<http://google.com/product%><http://google.com/product%>' ) s
ON s.id<http://s.id> = cb.id<http://cb.id>
    group by substr(CB.EXEC_DATE,1,10);
================= o/p on screen =================
10/08/20 11:13:38 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
2010-08-20 11:13:39,526 map = 100%,  reduce =100%
10/08/20 11:13:39 INFO exec.ExecDriver: 2010-08-20 11:13:39,526 map = 100%,  reduce =100%
Ended Job = job_local_0001
10/08/20 11:13:39 INFO exec.ExecDriver: Ended Job = job_local_0001
10/08/20 11:13:39 INFO exec.FileSinkOperator: Moving tmp dir: hdfs://hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001<http://hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001> to: hdfs://hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.10001.intermediate<http://hdp01.billeo.com:54310/tmp/hive-hadoop/234840696/_tmp.100
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB