Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # dev >> How is the STORED AS PARQUET used?


Copy link to this message
-
Re: How is the STORED AS PARQUET used?
Hi,

CTAS needs to be implemented for Parquet + Hive. There are more
details here: https://issues.apache.org/jira/browse/HIVE-6375

For a basic guide, I'd look at the following files in the patch:

parquet_partitioned.q and parquet_create.q

I have working on the Parquet documentation on my calendar for Thursday/Friday.

Brock

On Wed, Feb 5, 2014 at 8:27 AM, Remus Rusanu <[EMAIL PROTECTED]> wrote:
> Hello all,
>
> I tried the following on a build that has the latest HIVE-5783 patch applied over trunk:
>
> hive> set hive.aux.jars.path=file:///usr/lib/hcatalog/share/hcatalog/hcatalog-core.jar,file:///usr/lib/hive/lib/parquet-hadoop-bundle-1.3.2.jar;
> hive> create table alltypes_parquet stored as parquet as select cint, ctinyint, csmallint, cdouble, cfloat, cstring1 from alltypesorc;
> hive> show create table alltypes_parquet;
> OK
> CREATE  TABLE `alltypes_parquet`(
>   `cint` int COMMENT 'from deserializer',
>   `ctinyint` tinyint COMMENT 'from deserializer',
>   `csmallint` smallint COMMENT 'from deserializer',
>   `cdouble` double COMMENT 'from deserializer',
>   `cfloat` float COMMENT 'from deserializer',
>   `cstring1` string COMMENT 'from deserializer')
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   'hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/alltypes_parquet'
> TBLPROPERTIES (
>   'numFiles'='1',
>   'transient_lastDdlTime'='1391609238',
>   'COLUMN_STATS_ACCURATE'='true',
>   'totalSize'='256959',
>   'numRows'='12288',
>   'rawDataSize'='73728')
> Time taken: 0.256 seconds, Fetched: 22 row(s)
>
> hive> select * from alltypes_parquet where 1=1;
> ...
> Error:
> Caused by: parquet.io.InvalidRecordException: cint not found in message table_schema {
> }
>         at parquet.schema.GroupType.getFieldIndex(GroupType.java:104)
>         at parquet.schema.GroupType.getType(GroupType.java:136)
>         at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:93)
>         at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:205)
>         at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:79)
>         at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:66)
>         at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51)
>         at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:65)
>
> So what am I missing? The catalog info seems at odds with the record structure after CREATE TABLE.
>
> Thanks,
> ~Remus
>
> PS. alltypesorc is the test ORC table based on data from <enlistment>\data\files\alltypesorc

--
Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB