|
Bill Craig
2009-07-21, 15:06
Ashish Thusoo
2009-07-21, 19:37
Zheng Shao
2009-07-21, 19:58
Saurabh Nanda
2009-07-24, 15:09
bcraig7@...
2009-07-24, 15:48
Neal Richter
2009-07-24, 16:32
Saurabh Nanda
2009-07-25, 10:05
Zheng Shao
2009-07-25, 10:14
Saurabh Nanda
2009-07-25, 10:27
Zheng Shao
2009-07-25, 10:44
Saurabh Nanda
2009-07-25, 10:46
Saurabh Nanda
2009-07-25, 10:48
Zheng Shao
2009-07-25, 11:00
Saurabh Nanda
2009-07-27, 05:05
Zheng Shao
2009-07-27, 05:13
Saurabh Nanda
2009-07-27, 08:29
Saurabh Nanda
2009-07-27, 10:06
Saurabh Nanda
2009-07-27, 10:51
Saurabh Nanda
2009-07-27, 15:38
Prasad Chakka
2009-07-27, 17:25
Saurabh Nanda
2009-07-28, 04:13
Zheng Shao
2009-07-28, 04:53
Saurabh Nanda
2009-07-28, 05:02
Zheng Shao
2009-07-28, 05:34
Saurabh Nanda
2009-07-28, 05:38
Zheng Shao
2009-07-28, 05:55
Saurabh Nanda
2009-07-28, 06:08
Zheng Shao
2009-07-28, 06:22
Edward Capriolo
2009-07-28, 15:02
Edward Capriolo
2009-07-28, 16:39
Saurabh Nanda
2009-07-29, 05:01
|
-
bz2 Splits.Bill Craig 2009-07-21, 15:06
I loaded 5 files of bzip2 compressed data into a table in Hive. Three
are small test files containing 10,000 records. Two were large ~8Gb compressed. When I run a query against the table I see three tasks that complete almost immediately and two tasks that run for a very long time. It appears to me that Hive/Hadoop is not splitting the input of the *.bz2. I have seen some old mails about this, but could not find any resolution for this problem. I compressed the files using the Apache bz2 jar, the file are named *.bz2. I am using Hadoop 0.19.1 r745977
-
RE: bz2 Splits.Ashish Thusoo 2009-07-21, 19:37
I don't think these are splittable. Compression on sequencefiles is splittable across sequencefile blocks.
Ashish -----Original Message----- From: Bill Craig [mailto:[EMAIL PROTECTED]] Sent: Tuesday, July 21, 2009 8:06 AM To: [EMAIL PROTECTED] Subject: bz2 Splits. I loaded 5 files of bzip2 compressed data into a table in Hive. Three are small test files containing 10,000 records. Two were large ~8Gb compressed. When I run a query against the table I see three tasks that complete almost immediately and two tasks that run for a very long time. It appears to me that Hive/Hadoop is not splitting the input of the *.bz2. I have seen some old mails about this, but could not find any resolution for this problem. I compressed the files using the Apache bz2 jar, the file are named *.bz2. I am using Hadoop 0.19.1 r745977
-
Re: bz2 Splits.Zheng Shao 2009-07-21, 19:58
There are some work along this direction in the hadoop land, but it's
not committed yet: https://issues.apache.org/jira/browse/HADOOP-4012 For the short term, we won't be able to split bzip files. If your bzip files are generated outside of hadoop, please split the files before doing compression (so you will load many smaller files to hadoop/hive). If your bzip files are generated by hadoop/hive, please change the output file format to SequenceFile format. SequenceFile formats are splittable. Zheng On Tue, Jul 21, 2009 at 12:37 PM, Ashish Thusoo<[EMAIL PROTECTED]> wrote: > I don't think these are splittable. Compression on sequencefiles is splittable across sequencefile blocks. > > Ashish > > -----Original Message----- > From: Bill Craig [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, July 21, 2009 8:06 AM > To: [EMAIL PROTECTED] > Subject: bz2 Splits. > > I loaded 5 files of bzip2 compressed data into a table in Hive. Three are small test files containing 10,000 records. Two were large ~8Gb compressed. > When I run a query against the table I see three tasks that complete almost immediately and two tasks that run for a very long time. It appears to me that Hive/Hadoop is not splitting the input of the *.bz2. I have seen some old mails about this, but could not find any resolution for this problem. I compressed the files using the Apache bz2 jar, the file are named *.bz2. I am using Hadoop > 0.19.1 r745977 > -- Yours, Zheng
-
Re: bz2 Splits.Saurabh Nanda 2009-07-24, 15:09
Please excuse my ignorance, but can I import gzip compressed files directly
as Hive tables? I have separate gzip files for each days weblog data. Right now I am gunzipping them and then importing into a raw table. Can I import the gzipped files directly into Hive? Saurabh. On Wed, Jul 22, 2009 at 1:07 AM, Ashish Thusoo <[EMAIL PROTECTED]> wrote: > I don't think these are splittable. Compression on sequencefiles is > splittable across sequencefile blocks. > > Ashish > > -----Original Message----- > From: Bill Craig [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, July 21, 2009 8:06 AM > To: [EMAIL PROTECTED] > Subject: bz2 Splits. > > I loaded 5 files of bzip2 compressed data into a table in Hive. Three are > small test files containing 10,000 records. Two were large ~8Gb compressed. > When I run a query against the table I see three tasks that complete almost > immediately and two tasks that run for a very long time. It appears to me > that Hive/Hadoop is not splitting the input of the *.bz2. I have seen some > old mails about this, but could not find any resolution for this problem. I > compressed the files using the Apache bz2 jar, the file are named *.bz2. I > am using Hadoop > 0.19.1 r745977 > -- http://nandz.blogspot.com http://foodieforlife.blogspot.com
-
Re: Re: bz2 Splits.bcraig7@... 2009-07-24, 15:48
Have not checked gzip out yet but Hive is happy with .bz2 files. The
documentation on this is spotty. It seems that any Hadoop supported compression will work. The issue with .gz files is that they will not be splittable. That is one map will process an entire file so if your .gz files are large and you have more map capability than files you will not be able to make use of it. On Jul 24, 2009 10:09am, Saurabh Nanda <[EMAIL PROTECTED]> wrote: > Please excuse my ignorance, but can I import gzip compressed files > directly as Hive tables? I have separate gzip files for each days weblog > data. Right now I am gunzipping them and then importing into a raw table. > Can I import the gzipped files directly into Hive? > Saurabh. > On Wed, Jul 22, 2009 at 1:07 AM, Ashish Thusoo [EMAIL PROTECTED]> > wrote: > I don't think these are splittable. Compression on sequencefiles is > splittable across sequencefile blocks. > Ashish > -----Original Message----- > From: Bill Craig [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, July 21, 2009 8:06 AM > To: [EMAIL PROTECTED] > Subject: bz2 Splits. > I loaded 5 files of bzip2 compressed data into a table in Hive. Three are > small test files containing 10,000 records. Two were large ~8Gb > compressed. > When I run a query against the table I see three tasks that complete > almost immediately and two tasks that run for a very long time. It > appears to me that Hive/Hadoop is not splitting the input of the *.bz2. I > have seen some old mails about this, but could not find any resolution > for this problem. I compressed the files using the Apache bz2 jar, the > file are named *.bz2. I am using Hadoop > 0.19.1 r745977 > -- > http://nandz.blogspot.com > http://foodieforlife.blogspot.com
-
Re: Re: bz2 Splits.Neal Richter 2009-07-24, 16:32
gz files work fine. We're attaching daily directories of gziped logs
in S3 as hive table partitions. Best to have your logrotator do hourly rotation to create lots of gz files for better mapping. OR one could use zcat, split, and gzip to divide into smaller chunks if you really only have one gz file per partition. On Fri, Jul 24, 2009 at 9:48 AM, <[EMAIL PROTECTED]> wrote: > Have not checked gzip out yet but Hive is happy with .bz2 files. The > documentation on this is spotty. It seems that any Hadoop supported > compression will work. The issue with .gz files is that they will not be > splittable. That is one map will process an entire file so if your .gz files > are large and you have more map capability than files you will not be able > to make use of it. > > On Jul 24, 2009 10:09am, Saurabh Nanda <[EMAIL PROTECTED]> wrote: >> Please excuse my ignorance, but can I import gzip compressed files >> directly as Hive tables? I have separate gzip files for each days weblog >> data. Right now I am gunzipping them and then importing into a raw table. >> Can I import the gzipped files directly into Hive? >> >> >> Saurabh. >> >> On Wed, Jul 22, 2009 at 1:07 AM, Ashish Thusoo [EMAIL PROTECTED]> >> wrote: >> >> I don't think these are splittable. Compression on sequencefiles is >> splittable across sequencefile blocks. >> >> >> >> Ashish >> >> >> >> >> -----Original Message----- >> >> From: Bill Craig [mailto:[EMAIL PROTECTED]] >> >> Sent: Tuesday, July 21, 2009 8:06 AM >> >> To: [EMAIL PROTECTED] >> >> Subject: bz2 Splits. >> >> >> >> I loaded 5 files of bzip2 compressed data into a table in Hive. Three are >> small test files containing 10,000 records. Two were large ~8Gb compressed. >> >> When I run a query against the table I see three tasks that complete >> almost immediately and two tasks that run for a very long time. It appears >> to me that Hive/Hadoop is not splitting the input of the *.bz2. I have seen >> some old mails about this, but could not find any resolution for this >> problem. I compressed the files using the Apache bz2 jar, the file are named >> *.bz2. I am using Hadoop >> >> >> 0.19.1 r745977 >> >> >> >> >> >> >> -- >> http://nandz.blogspot.com >> http://foodieforlife.blogspot.com >>
-
Re: Re: bz2 Splits.Saurabh Nanda 2009-07-25, 10:05
I tried the following and ran into an error message:
create table compressed_raw(line string) partitioned by(dt string) row format delimited fields terminated by '\t' lines terminated by '\n' stored as sequencefile; hive> load data local inpath '/tmp/weblogs/20090602000000-172.16.1.40-access.log.gz' into table compressed_raw partition(dt='2009-06-01'); Copying data from file:/tmp/weblogs/20090602000000-172.16.1.40-access.log.gz Loading data to table compressed_raw partition {dt=2009-06-01} Failed with exception Cannot load text files into a table stored as SequenceFile. FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask I guess this is what the following thread is talking about -- http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/200903.mbox/%[EMAIL PROTECTED]%3E To sum up the discussion there, do I have to first import into a textfile table, set hive.exec.compress.output to true, and then insert into a sequencefile table? If that's the case, I don't understand why I have to explicitly set hive.exec.compress.output? Shouldn't the fact that the target is a sequencefile table, achieve the desired result? I'm on hadoop-0.18.3 & hive-0.3.0 PS: More details on the Wiki around compresses storage would be really appreciated. Saurabh. On Fri, Jul 24, 2009 at 10:02 PM, Neal Richter <[EMAIL PROTECTED]> wrote: > gz files work fine. We're attaching daily directories of gziped logs > in S3 as hive table partitions. > > Best to have your logrotator do hourly rotation to create lots of gz > files for better mapping. OR one could use zcat, split, and gzip to > divide into smaller chunks if you really only have one gz file per > partition. > > On Fri, Jul 24, 2009 at 9:48 AM, <[EMAIL PROTECTED]> wrote: > > Have not checked gzip out yet but Hive is happy with .bz2 files. The > > documentation on this is spotty. It seems that any Hadoop supported > > compression will work. The issue with .gz files is that they will not be > > splittable. That is one map will process an entire file so if your .gz > files > > are large and you have more map capability than files you will not be > able > > to make use of it. > > > > On Jul 24, 2009 10:09am, Saurabh Nanda <[EMAIL PROTECTED]> wrote: > >> Please excuse my ignorance, but can I import gzip compressed files > >> directly as Hive tables? I have separate gzip files for each days weblog > >> data. Right now I am gunzipping them and then importing into a raw > table. > >> Can I import the gzipped files directly into Hive? > >> > >> > >> Saurabh. > >> > >> On Wed, Jul 22, 2009 at 1:07 AM, Ashish Thusoo [EMAIL PROTECTED]> > >> wrote: > >> > >> I don't think these are splittable. Compression on sequencefiles is > >> splittable across sequencefile blocks. > >> > >> > >> > >> Ashish > >> > >> > >> > >> > >> -----Original Message----- > >> > >> From: Bill Craig [mailto:[EMAIL PROTECTED]] > >> > >> Sent: Tuesday, July 21, 2009 8:06 AM > >> > >> To: [EMAIL PROTECTED] > >> > >> Subject: bz2 Splits. > >> > >> > >> > >> I loaded 5 files of bzip2 compressed data into a table in Hive. Three > are > >> small test files containing 10,000 records. Two were large ~8Gb > compressed. > >> > >> When I run a query against the table I see three tasks that complete > >> almost immediately and two tasks that run for a very long time. It > appears > >> to me that Hive/Hadoop is not splitting the input of the *.bz2. I have > seen > >> some old mails about this, but could not find any resolution for this > >> problem. I compressed the files using the Apache bz2 jar, the file are > named > >> *.bz2. I am using Hadoop > >> > >> > >> 0.19.1 r745977 > >> > >> > >> > >> > >> > >> > >> -- > >> http://nandz.blogspot.com > >> http://foodieforlife.blogspot.com > >> > -- http://nandz.blogspot.com http://foodieforlife.blogspot.com
-
Re: Re: bz2 Splits.Zheng Shao 2009-07-25, 10:14
Hi Saurabh,
If you want to load data (in compressed/uncompressed text format) into a table, you have to defined the table as "stored as textfile" instead of "stored as sequencefile". Can you try again and let us know? Zheng On Sat, Jul 25, 2009 at 3:05 AM, Saurabh Nanda<[EMAIL PROTECTED]> wrote: > I tried the following and ran into an error message: > > create table compressed_raw(line string) partitioned by(dt string) > row format delimited fields terminated by '\t' lines terminated by '\n' > stored as sequencefile; > > hive> load data local inpath > '/tmp/weblogs/20090602000000-172.16.1.40-access.log.gz' into table > compressed_raw partition(dt='2009-06-01'); > Copying data from file:/tmp/weblogs/20090602000000-172.16.1.40-access.log.gz > Loading data to table compressed_raw partition {dt=2009-06-01} > Failed with exception Cannot load text files into a table stored as > SequenceFile. > FAILED: Execution Error, return code 1 from > org.apache.hadoop.hive.ql.exec.MoveTask > > I guess this is what the following thread is talking about -- > http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/200903.mbox/%[EMAIL PROTECTED]%3E > > To sum up the discussion there, do I have to first import into a textfile > table, set hive.exec.compress.output to true, and then insert into a > sequencefile table? If that's the case, I don't understand why I have to > explicitly set hive.exec.compress.output? Shouldn't the fact that the target > is a sequencefile table, achieve the desired result? > > I'm on hadoop-0.18.3 & hive-0.3.0 > > PS: More details on the Wiki around compresses storage would be really > appreciated. > > Saurabh. > > On Fri, Jul 24, 2009 at 10:02 PM, Neal Richter <[EMAIL PROTECTED]> wrote: >> >> gz files work fine. We're attaching daily directories of gziped logs >> in S3 as hive table partitions. >> >> Best to have your logrotator do hourly rotation to create lots of gz >> files for better mapping. OR one could use zcat, split, and gzip to >> divide into smaller chunks if you really only have one gz file per >> partition. >> >> On Fri, Jul 24, 2009 at 9:48 AM, <[EMAIL PROTECTED]> wrote: >> > Have not checked gzip out yet but Hive is happy with .bz2 files. The >> > documentation on this is spotty. It seems that any Hadoop supported >> > compression will work. The issue with .gz files is that they will not be >> > splittable. That is one map will process an entire file so if your .gz >> > files >> > are large and you have more map capability than files you will not be >> > able >> > to make use of it. >> > >> > On Jul 24, 2009 10:09am, Saurabh Nanda <[EMAIL PROTECTED]> wrote: >> >> Please excuse my ignorance, but can I import gzip compressed files >> >> directly as Hive tables? I have separate gzip files for each days >> >> weblog >> >> data. Right now I am gunzipping them and then importing into a raw >> >> table. >> >> Can I import the gzipped files directly into Hive? >> >> >> >> >> >> Saurabh. >> >> >> >> On Wed, Jul 22, 2009 at 1:07 AM, Ashish Thusoo [EMAIL PROTECTED]> >> >> wrote: >> >> >> >> I don't think these are splittable. Compression on sequencefiles is >> >> splittable across sequencefile blocks. >> >> >> >> >> >> >> >> Ashish >> >> >> >> >> >> >> >> >> >> -----Original Message----- >> >> >> >> From: Bill Craig [mailto:[EMAIL PROTECTED]] >> >> >> >> Sent: Tuesday, July 21, 2009 8:06 AM >> >> >> >> To: [EMAIL PROTECTED] >> >> >> >> Subject: bz2 Splits. >> >> >> >> >> >> >> >> I loaded 5 files of bzip2 compressed data into a table in Hive. Three >> >> are >> >> small test files containing 10,000 records. Two were large ~8Gb >> >> compressed. >> >> >> >> When I run a query against the table I see three tasks that complete >> >> almost immediately and two tasks that run for a very long time. It >> >> appears >> >> to me that Hive/Hadoop is not splitting the input of the *.bz2. I have >> >> seen >> >> some old mails about this, but could not find any resolution for this > Yours, Zheng
-
Re: Re: bz2 Splits.Saurabh Nanda 2009-07-25, 10:27
> If you want to load data (in compressed/uncompressed text format) into
> a table, you have to defined the table as "stored as textfile" instead > of "stored as sequencefile". I'm completely confused right now. If sequencefiles are not used for compressed data storage then what are they used for? If I have a gz file, and I want to import it as is (without gunzipping or using an intermediate table), what should I be doing? Saurabh. -- http://nandz.blogspot.com http://foodieforlife.blogspot.com
-
Re: Re: bz2 Splits.Zheng Shao 2009-07-25, 10:44
Both TextFile and SequenceFile can be compressed or uncompressed.
TextFile means the plain text file (records delimited by "\n"). Compressed TextFiles are just text files compressed by gzip or bzip2 utility. SequenceFile is a special file format that only Hadoop can understand. Since your files are compressed TextFiles, you have to create a table with TextFile format, in order to load the data without any conversion. (Compression is detected automatically for both TextFile and SequenceFile - you don't need to specify it when creating a table) Does this make the things a bit clearer? Zheng On Sat, Jul 25, 2009 at 3:27 AM, Saurabh Nanda<[EMAIL PROTECTED]> wrote: > >> If you want to load data (in compressed/uncompressed text format) into >> a table, you have to defined the table as "stored as textfile" instead >> of "stored as sequencefile". > > I'm completely confused right now. If sequencefiles are not used for > compressed data storage then what are they used for? > > If I have a gz file, and I want to import it as is (without gunzipping or > using an intermediate table), what should I be doing? > > Saurabh. > -- > http://nandz.blogspot.com > http://foodieforlife.blogspot.com > -- Yours, Zheng
-
Re: Re: bz2 Splits.Saurabh Nanda 2009-07-25, 10:46
> If you want to load data (in compressed/uncompressed text format) into
>> a table, you have to defined the table as "stored as textfile" instead >> of "stored as sequencefile". > > I tried both the approaches. Approach #1: a) gunzip log file b) import into textfile table c) set hive.exec.compress.output to true d) inserted into sequencefile table It seems to have given me 125 files named 'attempt_*' in the partition's directory. All under 10MB. (How do I find out the total size of a directory? Need to see how much saving the compression resulted in) Approach #2: imported gzip log files into a textfile table The files seem to have been copied as-is into the partition's directory. But every query is always split up into 8 maps (which is the number of files I imported). This, I guess won't help me much because I would be under utilizing the map power I have. Here's something interesting. I ran a SELECT COUNT(1) on all the three tables and go different results and wildly different response times. Gunzipped files imported into textfile table: 8,259,720 (108 sec) sequencefile table populated by step 1d above: 8,316,946 (114 sec) Gzip files imported into textfile tables: 8,619,980 (50 sec) How is a simple row count differing? And surprisingly lesser maps resulted in better performance! Saurabh. -- http://nandz.blogspot.com http://foodieforlife.blogspot.com
-
Re: Re: bz2 Splits.Saurabh Nanda 2009-07-25, 10:48
> TextFile means the plain text file (records delimited by "\n").
> Compressed TextFiles are just text files compressed by gzip or bzip2 > utility. SequenceFile is a special file format that only Hadoop can > understand. > Since your files are compressed TextFiles, you have to create a table > with TextFile format, in order to load the data without any > conversion. > (Compression is detected automatically for both TextFile and > SequenceFile - you don't need to specify it when creating a table) This really clears things up. I guess adding a note in the Wiki will put an end to the confusion permanently. A little note on the approach (compressed textfile vs compressed sequencefile) with the best performance would also be appreciated. Saurabh. -- http://nandz.blogspot.com http://foodieforlife.blogspot.com
-
Re: Re: bz2 Splits.Zheng Shao 2009-07-25, 11:00
Hi Saurabh,
Can you help put that information into appropriate place on the wiki (where you see fit)? Thanks for the help. By the way, I guess we need to debug what went wrong with the "count(1)" queries. There is definitely something going wrong. For the timing, how much mapper slots do you have in your cluster? I think you might want to consider this: Approach #3: a) import gzip files into textfile table b) set hive.exec.compress.output to true c) inserted into sequencefile table This will create bigger sequencefiles which will help reducing the overhead. This is better than Approach #2 because jobs from the sequencefile tables will have more mappers. Zheng On Sat, Jul 25, 2009 at 3:48 AM, Saurabh Nanda<[EMAIL PROTECTED]> wrote: > >> TextFile means the plain text file (records delimited by "\n"). >> Compressed TextFiles are just text files compressed by gzip or bzip2 >> utility. SequenceFile is a special file format that only Hadoop can >> understand. >> Since your files are compressed TextFiles, you have to create a table >> with TextFile format, in order to load the data without any >> conversion. >> (Compression is detected automatically for both TextFile and >> SequenceFile - you don't need to specify it when creating a table) > > This really clears things up. I guess adding a note in the Wiki will put an > end to the confusion permanently. A little note on the approach (compressed > textfile vs compressed sequencefile) with the best performance would also be > appreciated. > > Saurabh. > -- > http://nandz.blogspot.com > http://foodieforlife.blogspot.com > -- Yours, Zheng
-
Re: Re: bz2 Splits.Saurabh Nanda 2009-07-27, 05:05
> Can you help put that information into appropriate place on the wiki
> (where you see fit)? > Thanks for the help. Will do. > By the way, I guess we need to debug what went wrong with the > "count(1)" queries. There is definitely something going wrong. My bad here. I think I forgot to import some files when running the queries earlier. The counts are exactly the same. However the timings for "select count(1)" queries are very different. #1 Uncompressed logs in textfile tables: 106sec (filesize of 7,686 MB over 8 uncompressed files) #2 Compressed logs in textfile tables: 60sec (filesize of 736 MB over 8 compressed files) #3 Compressed logs in sequencefile tables: 101sec (filesize of 4,773 MB over 126 compressed files) > For the timing, how much mapper slots do you have in your cluster? I have a 4-node cluster with mapred.reduce.tasks=17 Is that what you mean by mapper slots? > Approach #3: > a) import gzip files into textfile table > b) set hive.exec.compress.output to true > c) inserted into sequencefile table > This will create bigger sequencefiles which will help reducing the > overhead. This is better than Approach #2 because jobs from the > sequencefile tables will have more mappers. This is exactly what I did in #3 above. But, from those benchmarks #2 seems to give the best results, both, in terms of file size and speed. Is that not what you were expecting? Saurabh. -- http://nandz.blogspot.com http://foodieforlife.blogspot.com
-
Re: Re: bz2 Splits.Zheng Shao 2009-07-27, 05:13
If you follow Approach #3, you should have 8 big compressed
sequencefiles instead of 126 small files. By the way, you probably didn't set the compression type to BLOCK compression, otherwise sequencefile compression won't perform like that. Try setting up this in your hive-site.xml or hadoop-site.xml: <property> <name>io.seqfile.compression.type</name> <value>BLOCK</value> </property> See http://blog.foofactory.fi/2006/12/my-fellow-nutch-developer-andrzej.html Zheng On Sun, Jul 26, 2009 at 10:05 PM, Saurabh Nanda<[EMAIL PROTECTED]> wrote: > >> Can you help put that information into appropriate place on the wiki >> (where you see fit)? >> Thanks for the help. > > Will do. > >> >> By the way, I guess we need to debug what went wrong with the >> "count(1)" queries. There is definitely something going wrong. > > My bad here. I think I forgot to import some files when running the queries > earlier. The counts are exactly the same. However the timings for "select > count(1)" queries are very different. > > #1 Uncompressed logs in textfile tables: 106sec (filesize of 7,686 MB over 8 > uncompressed files) > #2 Compressed logs in textfile tables: 60sec (filesize of 736 MB over 8 > compressed files) > #3 Compressed logs in sequencefile tables: 101sec (filesize of 4,773 MB over > 126 compressed files) > > >> >> For the timing, how much mapper slots do you have in your cluster? > > I have a 4-node cluster with mapred.reduce.tasks=17 Is that what you mean by > mapper slots? > >> >> Approach #3: >> a) import gzip files into textfile table >> b) set hive.exec.compress.output to true >> c) inserted into sequencefile table >> This will create bigger sequencefiles which will help reducing the >> overhead. This is better than Approach #2 because jobs from the >> sequencefile tables will have more mappers. > > This is exactly what I did in #3 above. But, from those benchmarks #2 seems > to give the best results, both, in terms of file size and speed. Is that not > what you were expecting? > > Saurabh. > -- > http://nandz.blogspot.com > http://foodieforlife.blogspot.com > -- Yours, Zheng
-
Re: Re: bz2 Splits.Saurabh Nanda 2009-07-27, 08:29
> #1 Uncompressed logs in textfile tables: 106sec (filesize of 7,686 MB over
> 8 uncompressed files) > #2 Compressed logs in textfile tables: 60sec (filesize of 736 MB over 8 > compressed files) > #3 Compressed logs in sequencefile tables: 101sec (filesize of 4,773 MB > over 126 compressed files) > Some more stats, if anyone's interested. I ran all the three tables (described above) through my ETL query (as described in http://nandz.blogspot.com/2009/07/using-hive-for-weblog-analysis.html) #1: 699sec with 1,561,633 rows in the final table #2: 563sec with 1,561,633 rows in the final table #3: 697sec with 1,654,291 rows in the final table (!) For #3 I've got a different row count. I tried importing the gzipped files & putting them through ETL again and landed up with 1,743,377 rows the second time! Will spend some more time to see where I'm going wrong. However, with these stats it seems that approach #2 gives best results with complex queries. #1 = Uncompressed log files into uncompressed textfile tables #2 = Inserting #1 with compression on into sequencefile tables #3 = Compressed log files (gzip) into textfile tables Saurabh. -- http://nandz.blogspot.com http://foodieforlife.blogspot.com
-
Re: Re: bz2 Splits.Saurabh Nanda 2009-07-27, 10:06
One last question here. If both, TextFile and SequenceFile can be
compressed, then what's the advantage of the SequenceFile format? Is it that a compressed file can be split into chunks only if it is stored as a SequenceFile? Saurabh. On Sat, Jul 25, 2009 at 4:14 PM, Zheng Shao <[EMAIL PROTECTED]> wrote: > Both TextFile and SequenceFile can be compressed or uncompressed. > > TextFile means the plain text file (records delimited by "\n"). > Compressed TextFiles are just text files compressed by gzip or bzip2 > utility. > SequenceFile is a special file format that only Hadoop can understand. > > Since your files are compressed TextFiles, you have to create a table > with TextFile format, in order to load the data without any > conversion. > (Compression is detected automatically for both TextFile and > SequenceFile - you don't need to specify it when creating a table) > > > Does this make the things a bit clearer? > > Zheng > > On Sat, Jul 25, 2009 at 3:27 AM, Saurabh Nanda<[EMAIL PROTECTED]> > wrote: > > > >> If you want to load data (in compressed/uncompressed text format) into > >> a table, you have to defined the table as "stored as textfile" instead > >> of "stored as sequencefile". > > > > I'm completely confused right now. If sequencefiles are not used for > > compressed data storage then what are they used for? > > > > If I have a gz file, and I want to import it as is (without gunzipping or > > using an intermediate table), what should I be doing? > > > > Saurabh. > > -- > > http://nandz.blogspot.com > > http://foodieforlife.blogspot.com > > > > > > -- > Yours, > Zheng > -- http://nandz.blogspot.com http://foodieforlife.blogspot.com
-
Re: Re: bz2 Splits.Saurabh Nanda 2009-07-27, 10:51
> Can you help put that information into appropriate place on the wiki
> (where you see fit)? > Thanks for the help. http://wiki.apache.org/hadoop/CompressedStorage (please QC and correct where wrong) http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL?action=diff http://wiki.apache.org/hadoop/Hive/LanguageManual/DML?action=diff Saurabh. -- http://nandz.blogspot.com http://foodieforlife.blogspot.com
-
Re: Re: bz2 Splits.Saurabh Nanda 2009-07-27, 15:38
> #2 Compressed logs in textfile tables: 60sec (filesize of 736 MB over 8
> compressed files) > #3 Compressed logs in sequencefile tables: 101sec (filesize of 4,773 MB > over 126 compressed files) > Why is there such a *big* difference in compression ratios between the gzip utility and Hive? Uncompressed file size: approx 3500 MB Gzip utility: approx 250 MB org.apache.hadoop.io.compress.GzipCodec (BLOCK): approx 1600 MB org.apache.hadoop.io.compress.DefaultCodec (BLOCK): approx 1700 MB Saurabh. -- http://nandz.blogspot.com http://foodieforlife.blogspot.com
-
Re: bz2 Splits.Prasad Chakka 2009-07-27, 17:25
Sequence Block compression happens on smaller chunks (around 1MB I think) so the compression ration would be smaller than compressing complete file.
________________________________ From: Saurabh Nanda <[EMAIL PROTECTED]> Reply-To: <[EMAIL PROTECTED]> Date: Mon, 27 Jul 2009 08:38:08 -0700 To: <[EMAIL PROTECTED]> Subject: Re: Re: bz2 Splits. #2 Compressed logs in textfile tables: 60sec (filesize of 736 MB over 8 compressed files) #3 Compressed logs in sequencefile tables: 101sec (filesize of 4,773 MB over 126 compressed files) Why is there such a *big* difference in compression ratios between the gzip utility and Hive? Uncompressed file size: approx 3500 MB Gzip utility: approx 250 MB org.apache.hadoop.io.compress.GzipCodec (BLOCK): approx 1600 MB org.apache.hadoop.io.compress.DefaultCodec (BLOCK): approx 1700 MB Saurabh. -- http://nandz.blogspot.com http://foodieforlife.blogspot.com
-
Re: bz2 Splits.Saurabh Nanda 2009-07-28, 04:13
> Sequence Block compression happens on smaller chunks (around 1MB I think)
> so the compression ration would be smaller than compressing complete file. > Is there a configuration parameter which controls this? Is it io.seqfile.compress.blocksize? It was set to 1,000,000 in hadoop-default.xml, which is approx 1MB. Saurabh. -- http://nandz.blogspot.com http://foodieforlife.blogspot.com
-
Re: Re: bz2 Splits.Zheng Shao 2009-07-28, 04:53
I cannot imagine there is such a huge compression ratio difference. On
our side, the compression ratio of gzip and GzipCodec (BLOCK) are within 10% relative difference. Log file compression ratio is usually 5x to 15x, so 250MB looks like a good one. The 1600MB number looks like record-level compression. Are you sure you've turned on block compression? Zheng On Mon, Jul 27, 2009 at 8:38 AM, Saurabh Nanda<[EMAIL PROTECTED]> wrote: > >> #2 Compressed logs in textfile tables: 60sec (filesize of 736 MB over 8 >> compressed files) >> #3 Compressed logs in sequencefile tables: 101sec (filesize of 4,773 MB >> over 126 compressed files) > > Why is there such a *big* difference in compression ratios between the gzip > utility and Hive? > > Uncompressed file size: approx 3500 MB > Gzip utility: approx 250 MB > org.apache.hadoop.io.compress.GzipCodec (BLOCK): approx 1600 MB > org.apache.hadoop.io.compress.DefaultCodec (BLOCK): approx 1700 MB > > Saurabh. > -- > http://nandz.blogspot.com > http://foodieforlife.blogspot.com > -- Yours, Zheng
-
Re: Re: bz2 Splits.Saurabh Nanda 2009-07-28, 05:02
> The 1600MB number looks like record-level compression. Are you sure
> you've turned on block compression? Here's the exact snippet from my shell script. Do I have to set these configuration parameters directly in the hadoop configuration file: ${HIVE_COMMAND} -e "set hive.exec.compress.output=true; set io.seqfile.compression.type=BLOCK; set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; set io.seqfile.compress.blocksize=50000000; insert overwrite table raw_compressed partition(dt='${D}') select line from raw where dt='${D}'" Saurabh. -- http://nandz.blogspot.com http://foodieforlife.blogspot.com
-
Re: Re: bz2 Splits.Zheng Shao 2009-07-28, 05:34
Hi Saurabh,
The right configuration parameter is: set mapred.output.compression.type=BLOCK; Sorry about pointing you to the wrong configuration parameter. Zheng On Mon, Jul 27, 2009 at 10:02 PM, Saurabh Nanda<[EMAIL PROTECTED]> wrote: > >> The 1600MB number looks like record-level compression. Are you sure >> you've turned on block compression? > > Here's the exact snippet from my shell script. Do I have to set these > configuration parameters directly in the hadoop configuration file: > > ${HIVE_COMMAND} -e "set hive.exec.compress.output=true; set > io.seqfile.compression.type=BLOCK; set > mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; set > io.seqfile.compress.blocksize=50000000; insert overwrite table > raw_compressed partition(dt='${D}') select line from raw where dt='${D}'" > > Saurabh. > -- > http://nandz.blogspot.com > http://foodieforlife.blogspot.com > -- Yours, Zheng
-
Re: Re: bz2 Splits.Saurabh Nanda 2009-07-28, 05:38
> The right configuration parameter is:
> set mapred.output.compression.type=BLOCK; I've set mapred.output.compression.type and changed io.seqfile.compress.blocksize to 100,000,000 (100MB) and now 3600 MB files are down to 260MB! Is such high compression recommended? Saurabh. -- http://nandz.blogspot.com http://foodieforlife.blogspot.com
-
Re: Re: bz2 Splits.Zheng Shao 2009-07-28, 05:55
In our setup, we didn't change io.seqfile.compress.blocksize (1MB) and
it's still fairly good. You are free to try 100MB for better compression ratio, but I would recommend to keep the default setting to minimize the possibilities of hitting unknown bugs. Zheng On Mon, Jul 27, 2009 at 10:38 PM, Saurabh Nanda<[EMAIL PROTECTED]> wrote: > >> The right configuration parameter is: >> set mapred.output.compression.type=BLOCK; > > I've set mapred.output.compression.type and changed > io.seqfile.compress.blocksize to 100,000,000 (100MB) and now 3600 MB files > are down to 260MB! > > Is such high compression recommended? > > Saurabh. > -- > http://nandz.blogspot.com > http://foodieforlife.blogspot.com > -- Yours, Zheng
-
Re: Re: bz2 Splits.Saurabh Nanda 2009-07-28, 06:08
> In our setup, we didn't change io.seqfile.compress.blocksize (1MB) and
> it's still fairly good. > You are free to try 100MB for better compression ratio, but I would > recommend to keep the default setting to minimize the possibilities of > hitting unknown bugs. Makes sense. Better compression brought down a count(1) query from 100+ sec down to 40sec. The ETL phase is now taking 510sec as opposed to 700sec earlier. Do you also compress all tables, not just the raw ones? Would you recommend it? Saurabh. -- http://nandz.blogspot.com http://foodieforlife.blogspot.com
-
Re: Re: bz2 Splits.Zheng Shao 2009-07-28, 06:22
Yes we do compress all tables.
Zheng On Mon, Jul 27, 2009 at 11:08 PM, Saurabh Nanda<[EMAIL PROTECTED]> wrote: > >> In our setup, we didn't change io.seqfile.compress.blocksize (1MB) and >> it's still fairly good. >> You are free to try 100MB for better compression ratio, but I would >> recommend to keep the default setting to minimize the possibilities of >> hitting unknown bugs. > > Makes sense. Better compression brought down a count(1) query from 100+ sec > down to 40sec. The ETL phase is now taking 510sec as opposed to 700sec > earlier. > > Do you also compress all tables, not just the raw ones? Would you recommend > it? > > Saurabh. > -- > http://nandz.blogspot.com > http://foodieforlife.blogspot.com > -- Yours, Zheng
-
Re: Re: bz2 Splits.Edward Capriolo 2009-07-28, 15:02
On Tue, Jul 28, 2009 at 2:22 AM, Zheng Shao<[EMAIL PROTECTED]> wrote:
> Yes we do compress all tables. > > Zheng > > On Mon, Jul 27, 2009 at 11:08 PM, Saurabh Nanda<[EMAIL PROTECTED]> wrote: >> >>> In our setup, we didn't change io.seqfile.compress.blocksize (1MB) and >>> it's still fairly good. >>> You are free to try 100MB for better compression ratio, but I would >>> recommend to keep the default setting to minimize the possibilities of >>> hitting unknown bugs. >> >> Makes sense. Better compression brought down a count(1) query from 100+ sec >> down to 40sec. The ETL phase is now taking 510sec as opposed to 700sec >> earlier. >> >> Do you also compress all tables, not just the raw ones? Would you recommend >> it? >> >> Saurabh. >> -- >> http://nandz.blogspot.com >> http://foodieforlife.blogspot.com >> > > > > -- > Yours, > Zheng > Saurabh, That you for the wiki page on this. Keep up the good work and please post all your findings about compression. Many people (including me) will benefit from an explanation about the different types of compression available and the trade offs of different codecs and options. I am really excited as I have (shamefully ) had some large tables with multiple text files building up, and the thought of smaller data and faster queries is giving me goosebumps. Edward
-
Re: Re: bz2 Splits.Edward Capriolo 2009-07-28, 16:39
On Tue, Jul 28, 2009 at 11:02 AM, Edward Capriolo<[EMAIL PROTECTED]> wrote:
> On Tue, Jul 28, 2009 at 2:22 AM, Zheng Shao<[EMAIL PROTECTED]> wrote: >> Yes we do compress all tables. >> >> Zheng >> >> On Mon, Jul 27, 2009 at 11:08 PM, Saurabh Nanda<[EMAIL PROTECTED]> wrote: >>> >>>> In our setup, we didn't change io.seqfile.compress.blocksize (1MB) and >>>> it's still fairly good. >>>> You are free to try 100MB for better compression ratio, but I would >>>> recommend to keep the default setting to minimize the possibilities of >>>> hitting unknown bugs. >>> >>> Makes sense. Better compression brought down a count(1) query from 100+ sec >>> down to 40sec. The ETL phase is now taking 510sec as opposed to 700sec >>> earlier. >>> >>> Do you also compress all tables, not just the raw ones? Would you recommend >>> it? >>> >>> Saurabh. >>> -- >>> http://nandz.blogspot.com >>> http://foodieforlife.blogspot.com >>> >> >> >> >> -- >> Yours, >> Zheng >> > > Saurabh, > > That you for the wiki page on this. Keep up the good work and please > post all your findings about compression. Many people (including me) > will benefit from an explanation about the different types of > compression available and the trade offs of different codecs and > options. I am really excited as I have (shamefully ) had some large > tables with multiple text files building up, and the thought of > smaller data and faster queries is giving me goosebumps. > > Edward > On a related note.. Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: SequenceFile doesn't work with GzipCodec without native-hadoop code! :( I have an 18.3 (cloudera) system in production. hadoop-native-0.18.3-7.cloudera.CH0_3.i386.rpm Is there any java based codec I could use that does not require external native libraries?
-
Re: Re: bz2 Splits.Saurabh Nanda 2009-07-29, 05:01
> That you for the wiki page on this. Keep up the good work and please
> post all your findings about compression. Many people (including me) > will benefit from an explanation about the different types of > compression available and the trade offs of different codecs and > options. Thanks, Edward. I''m glad that it helped someone. Saurabh. -- http://nandz.blogspot.com http://foodieforlife.blogspot.com |