Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - bz2 Splits.

Copy link to this message
RE: bz2 Splits.
Ashish Thusoo 2009-07-21, 19:37
I don't think these are splittable. Compression on sequencefiles is splittable across sequencefile blocks.


-----Original Message-----
From: Bill Craig [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, July 21, 2009 8:06 AM
Subject: bz2 Splits.

I loaded 5 files of bzip2 compressed data into a table in Hive. Three are small test files containing 10,000 records. Two were large ~8Gb compressed.
When I run a query against the table I see three tasks that complete almost immediately and two tasks that run for a very long time. It appears to me that Hive/Hadoop is not splitting the input of the *.bz2. I have seen some old mails about this, but could not find any resolution for this problem. I compressed the files using the Apache bz2 jar, the file are named *.bz2. I am using Hadoop
0.19.1 r745977