Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Why HBase integation with Hive makes Hive slow


Copy link to this message
-
Re: Why HBase integation with Hive makes Hive slow
Thank you, lars

The performance is largely improved when setting scanner caching to 10000
But I still encounter a problem.

When loading data to a hbast table via hive, I got a NullPointrExecption:

java.lang.NullPointerException
     at
org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableIntObjectInspector.get(WritableIntObjectInspector.java:35)
     at
org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:199)
     at
org.apache.hadoop.hive.hbase.HBaseSerDe.serialize(HBaseSerDe.java:696)
     at
org.apache.hadoop.hive.hbase.HBaseSerDe.serialize(HBaseSerDe.java:758)
     at
org.apache.hadoop.hive.hbase.HBaseSerDe.serialize(HBaseSerDe.java:713)
     at
org.apache.hadoop.hive.hbase.HBaseSerDe.serialize(HBaseSerDe.java:758)
     at
org.apache.hadoop.hive.hbase.HBaseSerDe.serialize(HBaseSerDe.java:713)
     at
org.apache.hadoop.hive.hbase.HBaseSerDe.serialize(HBaseSerDe.java:685)
     at
org.apache.hadoop.hive.hbase.HBaseSerDe.serializeField(HBaseSerDe.java:648)
     at
org.apache.hadoop.hive.hbase.HBaseSerDe.serialize(HBaseSerDe.java:560)
     at
org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:568)
     at
shark.execution.FileSinkOperator$$anonfun$processPartition$1.apply(FileSinkOperator.scala:73)
     at
shark.execution.FileSinkOperator$$anonfun$processPartition$1.apply(FileSinkOperator.scala:72)
     at scala.collection.Iterator$class.foreach(Iterator.scala:772)
     at scala.collection.Iterator$$anon$19.foreach(Iterator.scala:399)
     at
shark.execution.FileSinkOperator.processPartition(FileSinkOperator.scala:72)
     at
shark.execution.FileSinkOperator$.writeFiles$1(FileSinkOperator.scala:133)
     at
shark.execution.FileSinkOperator$$anonfun$executeProcessFileSinkPartition$1.apply(FileSinkOperator.scala:138)
     at
shark.execution.FileSinkOperator$$anonfun$executeProcessFileSinkPartition$1.apply(FileSinkOperator.scala:138)
     at spark.scheduler.ResultTask.run(ResultTask.scala:77)
     at spark.executor.Executor$TaskRunner.run(Executor.scala:98)
     at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
     at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
     at java.lang.Thread.run(Thread.java:724)

Here are some queries concerned:

CREATE TABLE hbase_byg_client (
idclient string,
isfictif boolean,
visites array < struct <
idvisite:string,datevisite:string,isauthent:boolean, affichages: array <
struct < page:string,idcategorie:int,freq:int >>>>)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" =
":key,navi:isfictif,navi:visites#s")
TBLPROPERTIES("hbase.table.name" = "byg_client")
;

INSERT OVERWRITE table hbase_byg_client
SELECT idClient, isfictif, collect_row(named_struct('idvisite',
idVisite, 'dateVisite', dateVisit, 'isAuthent', isAuthent, 'affichages',
t.affichages)) AS visites
FROM(
   SELECT idClient, isfictif, idVisite, dateVisit, isAuthent,
collect_row(named_struct('page', page, 'IdCategorie', IdCategorie,
'freq', freq)) AS affichages
   FROM v_byg_clean
   GROUP BY idClient, isfictif, idVisite, dateVisit, isAuthent) t
GROUP BY idClient, isfictif
;

Actually, hbase_byg_client contains a complex non-primitive type field.

Any workaround here ?

Thank you.

Hao

Le 01/08/2013 21:00, lars hofhansl a �crit :
> Need to set scanner caching, otherwise each call to next will be an network RTT.
>
>
>
> ________________________________
>   From: Hao Ren <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Thursday, August 1, 2013 7:45 AM
> Subject: Why HBase integation with Hive makes Hive slow
>  
>
> Hi,
>
> I have a cluster (1 master + 3 slaves) on which there Hive, Hbase, and
> Hadoop.
>
> In order to do some daily row-level update routine, we need to integrate
> Hbase with hive, but the performance is not good.
>
> E.g. There are 2 tables in hive,
>       hbase_table:  a hbase table created via Hive
Hao Ren
ClaraVista
www.claravista.fr
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB