Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Hive Problems Reading Avro+Snappy Data


Copy link to this message
-
RE: Hive Problems Reading Avro+Snappy Data
Connell, Chuck 2013-04-07, 23:32
When you do SELECT *, Hive does not run a real MapReduce job, so it is not a good test. Something is wrong with SerDe or InputFormat.

Chuck
From: Thomas, Matthew [mailto:[EMAIL PROTECTED]]
Sent: Sunday, April 07, 2013 5:41 PM
To: [EMAIL PROTECTED]
Subject: Hive Problems Reading Avro+Snappy Data

Hive users,

I am having problems performing "complex" queries on Avro+Snappy data.  If I do a "SELECT * FROM Blah LIMIT 50", I see the data coming back as it should be.  But if I perform any kind of more complex query such as "SELECT count(*) FROM Blah" I am receive several rows of NULL values.  My workflow of how I created the table is described below along with some of the setup.

- I am running CDH4.2 with Avro 1.7.3

hive> select * From mthomas_testavro limit 1;
OK
Field1 Field2
03-19-2013 a
03-19-2013 b
03-19-2013 c
03-19-2013 c
Time taken: 0.103 seconds

hive> select count(*) From mthomas_testavro;
...
Total MapReduce CPU Time Spent: 6 seconds 420 msec
OK
NULL
NULL
NULL
NULL
Time taken: 17.634 seconds
...
CREATE EXTERNAL TABLE mthomas_testavro
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/tmp/testavro/'
TBLPROPERTIES (
'avro.schema.literal'='{
"namespace": "hello.world",
"name": "some_schema",
"type": "record",
"fields": [
{ "name":"field1","type":"string"},
{ "name":"field2","type":"string"}
]
}')
;

SET avro.output.codec=snappy;
SET mapred.output.compression.type=BLOCK;
SET hive.exec.compress.output=true;
SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

INSERT OVERWRITE TABLE mthomas_testavro SELECT * FROM identical_table_inGzip_format;

If I cat the output file in the external table, I see "Objavro.codec^Lsnappyavro.schema?{"type"..." at the beginning followed by the rest of the schema and binary data.  So I am assuming the snappy compression worked.  Furthermore, I also tried to query this table via Impala and both queries worked just fine.

Maybe it is related to https://issues.apache.org/jira/browse/HIVE-3308  ???

Any ideas?

Thanks.

Matt
"This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed, and may contain information that is non-public, proprietary, privileged, confidential and exempt from disclosure under applicable law or may be constituted as attorney work product. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this message in error, notify sender immediately and delete this message immediately."