Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> IndexOutOfBoundsException with Snappy compressed SequenceFile from Flume


Copy link to this message
-
Re: IndexOutOfBoundsException with Snappy compressed SequenceFile from Flume
Hi Keith,

Were you able to resolve this? Or, is this still an issue?

Thanks.
Shreepadma
On Tue, May 28, 2013 at 6:02 AM, Keith Wright <[EMAIL PROTECTED]> wrote:

> Hi all,
>
>    This is my first post to the hive mailing list and I was hoping to get
> some help with the exception I am getting below.  I am using CDH4.2 (hive
> 0.10.0) to query snappy compressed, Sequence files that are built using
> Flume (relevant portion of flume conf below as well).  Note that I'm using
> a SequenceFile as it was needed for Impala integration.  Has anyone see
> this error before?  Couple of additional points to help diagnose:
>
>    1. Queries seem to be able to process some mappers without issues.
>    In fact, I can do a simple select * from <table> limit 10 without issue.
>    However if I make the limit high enough, it will eventually fail presumably
>    as it needs to read in a file that has this issue.
>    2. The same query runs in Impala without errors but appears to "skip"
>    some data.  I can confirm that the missing data is present via a custom
>    map/reduce job
>    3. I am able to write a map/reduce job that reads through all of the
>    same data without issue and have been unable to identify data corruption
>    4. This is a partitioned table and queries fail that touch ANY of the
>    partitions (and there are hundreds) so this does not appear to be a
>    sporadic, data integrity problem (table definition below)
>    5. We are using '\001' as our field separator.  We are capturing other
>    data also with SequenceFile, snappy but using '|' as our delimiter and we
>    do not have any issues querying there.  Although we are using a different
>    flume source.
>
> My next step for debugging was to disable snappy compression and see if I
> could query the data.  If not, switch from SequenceFile to simple text.
>
> I appreciate the help!!!
>
> CREATE EXTERNAL TABLE ORGANIC_EVENTS (
> event_id BIGINT,
> app_id INT,
> user_id BIGINT,
> type STRING,
> name STRING,
> value STRING,
> extra STRING,
> ip_address STRING,
> user_agent STRING,
> referrer STRING,
> event_time BIGINT,
> install_flag TINYINT,
> first_for_user TINYINT,
> cookie STRING,
> year int,
> month int,
> day int,
> hour int)  PARTITIONED BY (year int, month int, day int,hour int)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001'
> COLLECTION ITEMS TERMINATED BY '\002'
> MAP KEYS TERMINATED BY '\003'
> STORED AS SEQUENCEFILE
> LOCATION '/events/organic';
>
> agent.sinks.exhaustHDFSSink3.type = HDFS
> agent.sinks.exhaustHDFSSink3.channel = exhaustFileChannel
> agent.sinks.exhaustHDFSSink3.hdfs.path = hdfs://lxscdh001.nanigans.com:8020
> %{path}
> agent.sinks.exhaustHDFSSink3.hdfs.filePrefix = 3.%{hostname}
> agent.sinks.exhaustHDFSSink3.hdfs.rollInterval = 0
> agent.sinks.exhaustHDFSSink3.hdfs.idleTimeout = 600
> agent.sinks.exhaustHDFSSink3.hdfs.rollSize = 0
> agent.sinks.exhaustHDFSSink3.hdfs.rollCount = 0
> agent.sinks.exhaustHDFSSink3.hdfs.batchSize = 5000
> agent.sinks.exhaustHDFSSink3.hdfs.txnEventMax = 5000
> agent.sinks.exhaustHDFSSink3.hdfs.fileType = SequenceFile
> agent.sinks.exhaustHDFSSink3.hdfs.maxOpenFiles = 100
> agent.sinks.exhaustHDFSSink3.hdfs.codeC = snappy
> agent.sinks.exhaustHDFSSink.3hdfs.writeFormat = Text
>
> 2013-05-28 12:29:00,919 WARN org.apache.hadoop.mapred.Child: Error running child                              java.io.IOException: java.io.IOException: java.lang.IndexOutOfBoundsException
>                               at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
>                               at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
>                               at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:330)
>                               at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:246)