Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive, mail # user - Problem with NULLs in HBase "leaking" into following rows


Copy link to this message
-
Problem with NULLs in HBase "leaking" into following rows
Lars Francke 2012-06-21, 15:42
Hi,

we're using the HBase integration in Hive 0.9 and are running into
problems when there are rows with NULL values (which would map to a
non-existing cell in HBase).

We're using a UDF[1] but see the same behavior without it.

Just as an example table we have just two rows

In HBase Shell:

create 'hive_hbase_test', 'test'
put 'hive_hbase_test', '1', 'test:c1', 'c1-1'
put 'hive_hbase_test', '1', 'test:c2', 'c2-1'
put 'hive_hbase_test', '1', 'test:c3', 'c3-1'
put 'hive_hbase_test', '2', 'test:c1', 'c1-2'

In Hive:

DROP TABLE IF EXISTS hive_hbase_test;
CREATE EXTERNAL TABLE hive_hbase_test (
  id int,
  c1 string,
  c2 string,
  c3 string
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" ":key#s,test:c1#s,test:c2#s,test:c3#s")
TBLPROPERTIES("hbase.table.name" = "hive_hbase_test");

hive> select * from hive_hbase_test;
OK
1 c1-1 c2-1 c3-1
2 c1-2 NULL NULL

hive> select c1 from hive_hbase_test;
c1-1
c1-2

hive> select c1, c2 from hive_hbase_test;
c1-1 c2-1
c1-2 NULL

So far everything is correct but now:

hive> select c1, c2, c2 from hive_hbase_test;
c1-1 c2-1 c2-1
c1-2 NULL c2-1

Selecting c2 twice works the first time but the second time we
actually get the value from the previous row.

hive> select c1, c3, c2, c2, c3, c3, c1 from hive_hbase_test;
c1-1 c3-1 c2-1 c2-1 c3-1 c3-1 c1-1
c1-2 NULL NULL c2-1 c3-1 c3-1 c1-2

This works with a "native" HDFS backed table.

In our UDF we were started logging (this UDF gets a year, month and
day and any of those might be null) and tested a simple two row table.

hive> SELECT id, year, month, parseDate(year, month, day) FROM
naughty_occurrence;

First row (data in HBase, 1997-1-1):
deferred: [1997] - convertedObject: [1997]
deferred: [1] - convertedObject: [1]
deferred: [1] - convertedObject: [1]
Year: [1997], Month: [1], Day: [1]

Second row (data in HBase: 2006-null-null):
deferred: [2006] - convertedObject: [2006]
deferred: [1] - convertedObject: [1]
deferred: [1] - convertedObject: [null]
Year: [2006], Month: [1], Day: [null]

I know this looks very confusing and I hope I haven't overdone it with
the examples but this seems like a rather serious problem with the
HBase integration. Values from previous rows are "leaking" into null
values in following rows. We're not 100% sure if we're doing something
wrong but I don't see what we could do wrong here. I'll open an issue
if no one has an idea what's going on here. Tried looking at the HBase
Handler code but was confused by it. Will try again tomorrow.

Thanks for bearing with me.

Cheers,
Lars

[1] I would very much appreciate a review of our usage of
DeferredObjects etc.:
<https://code.google.com/p/gbif-occurrencestore/source/browse/trunk/occurrence-store/src/main/java/org/gbif/occurrencestore/hive/udf/DateParsingUDF.java>
+
Lars Francke 2012-06-22, 12:26