Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> bz2 compressed table usage?


Copy link to this message
-
bz2 compressed table usage?
Hi folks,

Anyone have any experience using bz2 based compressed tables? I have
the following .q file:

=SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.dynamic.partition=true;
SET hive.exec.max.dynamic.partitions=500;
SET hive.exec.max.dynamic.partitions.pernode=500;
SET hive.exec.compress.output=true ;
SET mapred.output.compress=true ;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec ;
SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec
;
SET io.seqfile.compression.type=BLOCK;
SET mapred.output.compression.type=BLOCK;
SET mapred.compress.map.output=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.dynamic.partition=true;
SET hive.exec.max.dynamic.partitions=500;
SET hive.exec.max.dynamic.partitions.pernode=500;
SET mapred.child.java.opts=-Xmx2048m ;
SET mapred.reduce.tasks=40 ;
SET hive.mapred.reduce.tasks.speculative.execution=false ;
REATE TABLE stopwords_rcf_bzip2 (word STRING ) STORED AS RCFILE;
INSERT OVERWRITE TABLE stopwords_rcf_bzip2 select * from stopwords;
=
(where stopwords is a pre-existing textfile based table that has
various words loaded in from the standard linux dictionary.)

After doing this, the write succeeds, and the output appears to be
compressed (tested by doing a hadoop fs -get and manual inspection -
seems to have RCFile headers, some metadata indicating bz2 compression
classes, and then the BZ marker and binary data.

If I try reading this, though, I get the following error:

=hive -e 'select * from stopwords_rcf_bzip2 limit 20;'
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated.
Please use org.apache.hadoop.log.metrics.EventCounter in all the
log4j.properties files.
Logging initialized using configuration in
jar:file:/usr/lib/hive/lib/hive-common-0.10.0.23.jar!/hive-log4j.properties
Hive history file=/tmp/sush/hive_job_log_hrt_qa_201304031715_2012500346.txt
OK
Failed with exception java.io.IOException:java.io.IOException: Stream
is not BZip2 formatted: expected 'h' as first byte but got '#'
Time taken: 2.582 seconds
=
If I try the same above commands with Gzip codecs instead of BZip2, it
works fine. Does anyone have any idea as to what I'm doing wrong? Is
this a bug we need to fix?

Also, when I try replacing rcfile with textfile, as well, it doesn't
work, except, this time, instead of an IOException about the stream
not being a bz2 stream, I just get garbled binary output from the
select.

Thanks,
-Sushanth
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB