Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Re: IndexOutOfBoundsException with Snappy compressed SequenceFile from Flume


Copy link to this message
-
Re: IndexOutOfBoundsException with Snappy compressed SequenceFile from Flume
Shreepadma Venugopalan 2013-07-15, 23:24
Thanks Keith.
On Mon, Jul 15, 2013 at 4:22 PM, Keith Wright <[EMAIL PROTECTED]> wrote:

>  I was able to resolve it.  Turns our we had a misconfiguration with one of our flume hdfs sinks where 2 were writing as sequence file text while one was sequence file bytes.  It was due to one bad character in the flume conf.  My manual map/reduce jobs worked fine because it would automatically handle both cases.  Hive on the other hand would work for the beginning of the query until it hit a non-text based file.  One of those 60 hour of debugging leads to a one character change bugs.  Hopefully this will be a good lesson for others in the future!
>
> Shreepadma Venugopalan <[EMAIL PROTECTED]> wrote:
>
>
>  Hi Keith,
>
>  Were you able to resolve this? Or, is this still an issue?
>
>  Thanks.
> Shreepadma
>
>
> On Tue, May 28, 2013 at 6:02 AM, Keith Wright <[EMAIL PROTECTED]>wrote:
>
>>  Hi all,
>>
>>     This is my first post to the hive mailing list and I was hoping to
>> get some help with the exception I am getting below.  I am using CDH4.2
>> (hive 0.10.0) to query snappy compressed, Sequence files that are built
>> using Flume (relevant portion of flume conf below as well).  Note that I'm
>> using a SequenceFile as it was needed for Impala integration.  Has anyone
>> see this error before?  Couple of additional points to help diagnose:
>>
>>    1. Queries seem to be able to process some mappers without issues.
>>    In fact, I can do a simple select * from <table> limit 10 without issue.
>>    However if I make the limit high enough, it will eventually fail presumably
>>    as it needs to read in a file that has this issue.
>>    2. The same query runs in Impala without errors but appears to "skip"
>>    some data.  I can confirm that the missing data is present via a custom
>>    map/reduce job
>>    3. I am able to write a map/reduce job that reads through all of the
>>    same data without issue and have been unable to identify data corruption
>>    4. This is a partitioned table and queries fail that touch ANY of the
>>    partitions (and there are hundreds) so this does not appear to be a
>>    sporadic, data integrity problem (table definition below)
>>    5. We are using '\001' as our field separator.  We are capturing
>>    other data also with SequenceFile, snappy but using '|' as our delimiter
>>    and we do not have any issues querying there.  Although we are using a
>>    different flume source.
>>
>> My next step for debugging was to disable snappy compression and see if I
>> could query the data.  If not, switch from SequenceFile to simple text.
>>
>>  I appreciate the help!!!
>>
>>  CREATE EXTERNAL TABLE ORGANIC_EVENTS (
>> event_id BIGINT,
>> app_id INT,
>> user_id BIGINT,
>> type STRING,
>> name STRING,
>> value STRING,
>> extra STRING,
>> ip_address STRING,
>> user_agent STRING,
>> referrer STRING,
>> event_time BIGINT,
>> install_flag TINYINT,
>> first_for_user TINYINT,
>> cookie STRING,
>> year int,
>> month int,
>> day int,
>> hour int)  PARTITIONED BY (year int, month int, day int,hour int)
>>  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001'
>> COLLECTION ITEMS TERMINATED BY '\002'
>> MAP KEYS TERMINATED BY '\003'
>> STORED AS SEQUENCEFILE
>>  LOCATION '/events/organic';
>>
>>  agent.sinks.exhaustHDFSSink3.type = HDFS
>> agent.sinks.exhaustHDFSSink3.channel = exhaustFileChannel
>> agent.sinks.exhaustHDFSSink3.hdfs.path >> hdfs://lxscdh001.nanigans.com:8020%{path}
>> agent.sinks.exhaustHDFSSink3.hdfs.filePrefix = 3.%{hostname}
>> agent.sinks.exhaustHDFSSink3.hdfs.rollInterval = 0
>> agent.sinks.exhaustHDFSSink3.hdfs.idleTimeout = 600
>> agent.sinks.exhaustHDFSSink3.hdfs.rollSize = 0
>> agent.sinks.exhaustHDFSSink3.hdfs.rollCount = 0
>> agent.sinks.exhaustHDFSSink3.hdfs.batchSize = 5000
>> agent.sinks.exhaustHDFSSink3.hdfs.txnEventMax = 5000
>> agent.sinks.exhaustHDFSSink3.hdfs.fileType = SequenceFile
>> agent.sinks.exhaustHDFSSink3.hdfs.maxOpenFiles = 100
>> agent.sinks.exhaustHDFSSink3.hdfs.codeC = snappy