|
|
-
Problem with NULLs in HBase "leaking" into following rowsLars Francke 2012-06-21, 15:42
Hi,
we're using the HBase integration in Hive 0.9 and are running into problems when there are rows with NULL values (which would map to a non-existing cell in HBase). We're using a UDF[1] but see the same behavior without it. Just as an example table we have just two rows In HBase Shell: create 'hive_hbase_test', 'test' put 'hive_hbase_test', '1', 'test:c1', 'c1-1' put 'hive_hbase_test', '1', 'test:c2', 'c2-1' put 'hive_hbase_test', '1', 'test:c3', 'c3-1' put 'hive_hbase_test', '2', 'test:c1', 'c1-2' In Hive: DROP TABLE IF EXISTS hive_hbase_test; CREATE EXTERNAL TABLE hive_hbase_test ( id int, c1 string, c2 string, c3 string ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" ":key#s,test:c1#s,test:c2#s,test:c3#s") TBLPROPERTIES("hbase.table.name" = "hive_hbase_test"); hive> select * from hive_hbase_test; OK 1 c1-1 c2-1 c3-1 2 c1-2 NULL NULL hive> select c1 from hive_hbase_test; c1-1 c1-2 hive> select c1, c2 from hive_hbase_test; c1-1 c2-1 c1-2 NULL So far everything is correct but now: hive> select c1, c2, c2 from hive_hbase_test; c1-1 c2-1 c2-1 c1-2 NULL c2-1 Selecting c2 twice works the first time but the second time we actually get the value from the previous row. hive> select c1, c3, c2, c2, c3, c3, c1 from hive_hbase_test; c1-1 c3-1 c2-1 c2-1 c3-1 c3-1 c1-1 c1-2 NULL NULL c2-1 c3-1 c3-1 c1-2 This works with a "native" HDFS backed table. In our UDF we were started logging (this UDF gets a year, month and day and any of those might be null) and tested a simple two row table. hive> SELECT id, year, month, parseDate(year, month, day) FROM naughty_occurrence; First row (data in HBase, 1997-1-1): deferred: [1997] - convertedObject: [1997] deferred: [1] - convertedObject: [1] deferred: [1] - convertedObject: [1] Year: [1997], Month: [1], Day: [1] Second row (data in HBase: 2006-null-null): deferred: [2006] - convertedObject: [2006] deferred: [1] - convertedObject: [1] deferred: [1] - convertedObject: [null] Year: [2006], Month: [1], Day: [null] I know this looks very confusing and I hope I haven't overdone it with the examples but this seems like a rather serious problem with the HBase integration. Values from previous rows are "leaking" into null values in following rows. We're not 100% sure if we're doing something wrong but I don't see what we could do wrong here. I'll open an issue if no one has an idea what's going on here. Tried looking at the HBase Handler code but was confused by it. Will try again tomorrow. Thanks for bearing with me. Cheers, Lars [1] I would very much appreciate a review of our usage of DeferredObjects etc.: <https://code.google.com/p/gbif-occurrencestore/source/browse/trunk/occurrence-store/src/main/java/org/gbif/occurrencestore/hive/udf/DateParsingUDF.java> |