|
Young-Geun Park
2012-09-06, 23:25
Ruslan Al-Fakikh
2012-09-07, 08:26
박영근
2012-09-07, 09:06
Young-Geun Park
2012-09-10, 02:29
Harsh J
2012-09-10, 04:40
|
-
Lzo vs SequenceFile for big fileYoung-Geun Park 2012-09-06, 23:25
Hi, All
I have tested which method is better between Lzo and SequenceFile for a BIG file. File size is 10GiB and WordCount MR is used. Inputs of WordCount MR are lzo which would be indexed by LzoIndexTool(lzo), sequence file which is compressed by block level snappy(seq) , and uncompressed original file(none). Map output is compressed except of uncompressed file. mapreduce output is not compressed for all cases. The following are wordcount MR running time; none lzo seq 248s 243s 1410s -Test Environments - OS : CentOS 5.6 (x64) (kernel = 2.6.18) - # of Core : 8 (cpu = Intel(R) Xeon(R) CPU E5504 @ 2.00GHz) - RAM : 18GB - Java version : 1.6.0_26 - Hadoop version : CDH3U2 - # of datanode(tasktracker) : 8 According to the result, The running time of SequnceFile is much less than the others. Before testing, I had expected that the results of both SequenceFile and Lzo are about the same. I want to know why performance of the sequence file compressed by snappy is so bad? do I miss anything in tests? Regards, Park
-
Re: Lzo vs SequenceFile for big fileRuslan Al-Fakikh 2012-09-07, 08:26
Hi,
I would be interesting to see the jobs' statistics (counters). Thanks On Fri, Sep 7, 2012 at 3:25 AM, Young-Geun Park <[EMAIL PROTECTED]> wrote: > Hi, All > > I have tested which method is better between Lzo and SequenceFile for a BIG > file. > > File size is 10GiB and WordCount MR is used. > Inputs of WordCount MR are lzo which would be indexed by LzoIndexTool(lzo), > sequence file which is compressed by block level snappy(seq) , and > uncompressed original file(none). > > Map output is compressed except of uncompressed file. mapreduce output is > not compressed for all cases. > > The following are wordcount MR running time; > none lzo seq > 248s 243s 1410s > > -Test Environments > > OS : CentOS 5.6 (x64) (kernel = 2.6.18) > # of Core : 8 (cpu = Intel(R) Xeon(R) CPU E5504 @ 2.00GHz) > RAM : 18GB > Java version : 1.6.0_26 > Hadoop version : CDH3U2 > # of datanode(tasktracker) : 8 > > According to the result, The running time of SequnceFile is much less than > the others. > Before testing, I had expected that the results of both SequenceFile and > Lzo are about the same. > > I want to know why performance of the sequence file compressed by snappy is > so bad? > > do I miss anything in tests? > > > Regards, > Park > > -- Best Regards, Ruslan Al-Fakikh
-
Re: Lzo vs SequenceFile for big file박영근 2012-09-07, 09:06
Ruslan,
Thanks for your reply in advance. Jobs' statistics are as follows; case 1 : uncompressed data(none) 12/08/09 16:12:44 INFO mapred.JobClient: Job complete: job_201208021633_0049 12/08/09 16:12:44 INFO mapred.JobClient: Counters: 23 12/08/09 16:12:44 INFO mapred.JobClient: Job Counters 12/08/09 16:12:44 INFO mapred.JobClient: Launched reduce tasks=1 12/08/09 16:12:44 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=3623053 12/08/09 16:12:44 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/08/09 16:12:44 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/08/09 16:12:44 INFO mapred.JobClient: Rack-local map tasks=1 12/08/09 16:12:44 INFO mapred.JobClient: Launched map tasks=166 12/08/09 16:12:44 INFO mapred.JobClient: Data-local map tasks=165 12/08/09 16:12:44 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=220786 12/08/09 16:12:44 INFO mapred.JobClient: FileSystemCounters 12/08/09 16:12:44 INFO mapred.JobClient: FILE_BYTES_READ=1852424288 12/08/09 16:12:44 INFO mapred.JobClient: HDFS_BYTES_READ=10644581454 12/08/09 16:12:44 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1894096220 12/08/09 16:12:44 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=211440 12/08/09 16:12:44 INFO mapred.JobClient: Map-Reduce Framework 12/08/09 16:12:44 INFO mapred.JobClient: Reduce input groups=13661 12/08/09 16:12:44 INFO mapred.JobClient: Combine output records=69055428 12/08/09 16:12:44 INFO mapred.JobClient: Map input records=158156100 12/08/09 16:12:44 INFO mapred.JobClient: Reduce shuffle bytes=33143186 12/08/09 16:12:44 INFO mapred.JobClient: Reduce output records=13661 12/08/09 16:12:44 INFO mapred.JobClient: Spilled Records=122916251 12/08/09 16:12:44 INFO mapred.JobClient: Map output bytes=15704921900 12/08/09 16:12:44 INFO mapred.JobClient: Combine input records=1332132129 12/08/09 16:12:44 INFO mapred.JobClient: Map output records=1265248800 12/08/09 16:12:44 INFO mapred.JobClient: SPLIT_RAW_BYTES=19716 12/08/09 16:12:44 INFO mapred.JobClient: Reduce input records=2172099 case2 : lzo 12/08/09 15:58:11 INFO mapred.JobClient: Job complete: job_201208021633_0048 12/08/09 15:58:11 INFO mapred.JobClient: Counters: 23 12/08/09 15:58:11 INFO mapred.JobClient: Job Counters 12/08/09 15:58:11 INFO mapred.JobClient: Launched reduce tasks=1 12/08/09 15:58:11 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=3361287 12/08/09 15:58:11 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/08/09 15:58:11 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/08/09 15:58:11 INFO mapred.JobClient: Rack-local map tasks=4 12/08/09 15:58:11 INFO mapred.JobClient: Launched map tasks=65 12/08/09 15:58:11 INFO mapred.JobClient: Data-local map tasks=61 12/08/09 15:58:11 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=183529 12/08/09 15:58:11 INFO mapred.JobClient: FileSystemCounters 12/08/09 15:58:11 INFO mapred.JobClient: FILE_BYTES_READ=568178351 12/08/09 15:58:11 INFO mapred.JobClient: HDFS_BYTES_READ=3860287251 12/08/09 15:58:11 INFO mapred.JobClient: FILE_BYTES_WRITTEN=576095398 12/08/09 15:58:11 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=211440 12/08/09 15:58:11 INFO mapred.JobClient: Map-Reduce Framework 12/08/09 15:58:11 INFO mapred.JobClient: Reduce input groups=13661 12/08/09 15:58:11 INFO mapred.JobClient: Combine output records=66734193 12/08/09 15:58:11 INFO mapred.JobClient: Map input records=158156100 12/08/09 15:58:11 INFO mapred.JobClient: Reduce shuffle bytes=4752406 12/08/09 15:58:11 INFO mapred.JobClient: Reduce output records=13661 12/08/09 15:58:11 INFO mapred.JobClient: Spilled Records=132612729 12/08/09 15:58:11 INFO mapred.JobClient: Map output bytes=15704921900 12/08/09 15:58:11 INFO mapred.JobClient: Combine input records=1331190655 12/08/09 15:58:11 INFO mapred.JobClient: Map output records=1265248800 12/08/09 15:58:11 INFO mapred.JobClient: SPLIT_RAW_BYTES=7366 12/08/09 15:58:11 INFO mapred.JobClient: Reduce input records=792338 case3 : sequence file compressed block-level by snappy 12/09/05 18:33:00 INFO mapred.JobClient: Job complete: job_201209051652_0008 12/09/05 18:33:00 INFO mapred.JobClient: Counters: 23 12/09/05 18:33:00 INFO mapred.JobClient: Job Counters 12/09/05 18:33:00 INFO mapred.JobClient: Launched reduce tasks=1 12/09/05 18:33:00 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=5885897 12/09/05 18:33:00 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/09/05 18:33:00 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/09/05 18:33:00 INFO mapred.JobClient: Rack-local map tasks=2 12/09/05 18:33:00 INFO mapred.JobClient: Launched map tasks=68 12/09/05 18:33:00 INFO mapred.JobClient: Data-local map tasks=66 12/09/05 18:33:00 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=1320075 12/09/05 18:33:00 INFO mapred.JobClient: FileSystemCounters 12/09/05 18:33:00 INFO mapred.JobClient: FILE_BYTES_READ=3706936196 12/09/05 18:33:00 INFO mapred.JobClient: HDFS_BYTES_READ=4419150507 12/09/05 18:33:00 INFO mapred.JobClient: FILE_BYTES_WRITTEN=4581439981 12/09/05 18:33:00 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=211440 12/09/05 18:33:00 INFO mapred.JobClient: Map-Reduce Framework 12/09/05 18:33:00 INFO mapred.JobClient: Reduce input groups=13661 12/09/05 18:33:00 INFO mapred.JobClient: Combine output records=0 12/09/05 18:33:00 INFO mapred.JobClient: Map input records=158156100 12/09/05 18:33:00 INFO mapred.JobClient: Reduce shuffle bytes=857964933 12/09/05 18:33:00 INFO mapred.JobClient: Reduce output records=13661 12/09/05 18:33:00 INFO mapred.JobClient: Spilled Records=6232725043 12/09/05 18:
-
Re: Lzo vs SequenceFile for big fileYoung-Geun Park 2012-09-10, 02:29
Is there anyone who had tested performance of sequence file format and lzo?
Regards, Park 2012/9/7 Young-Geun PARK <[EMAIL PROTECTED]> > Ruslan, > Thanks for your reply in advance. > > Jobs' statistics are as follows; > > case 1 : uncompressed data(none) > 12/08/09 16:12:44 INFO mapred.JobClient: Job complete: > job_201208021633_0049 > 12/08/09 16:12:44 INFO mapred.JobClient: Counters: 23 > 12/08/09 16:12:44 INFO mapred.JobClient: Job Counters > 12/08/09 16:12:44 INFO mapred.JobClient: Launched reduce tasks=1 > 12/08/09 16:12:44 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=3623053 > 12/08/09 16:12:44 INFO mapred.JobClient: Total time spent by all > reduces waiting after reserving slots (ms)=0 > 12/08/09 16:12:44 INFO mapred.JobClient: Total time spent by all maps > waiting after reserving slots (ms)=0 > 12/08/09 16:12:44 INFO mapred.JobClient: Rack-local map tasks=1 > 12/08/09 16:12:44 INFO mapred.JobClient: Launched map tasks=166 > 12/08/09 16:12:44 INFO mapred.JobClient: Data-local map tasks=165 > 12/08/09 16:12:44 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=220786 > 12/08/09 16:12:44 INFO mapred.JobClient: FileSystemCounters > 12/08/09 16:12:44 INFO mapred.JobClient: FILE_BYTES_READ=1852424288 > 12/08/09 16:12:44 INFO mapred.JobClient: HDFS_BYTES_READ=10644581454 > 12/08/09 16:12:44 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1894096220 > 12/08/09 16:12:44 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=211440 > 12/08/09 16:12:44 INFO mapred.JobClient: Map-Reduce Framework > 12/08/09 16:12:44 INFO mapred.JobClient: Reduce input groups=13661 > 12/08/09 16:12:44 INFO mapred.JobClient: Combine output > records=69055428 > 12/08/09 16:12:44 INFO mapred.JobClient: Map input records=158156100 > 12/08/09 16:12:44 INFO mapred.JobClient: Reduce shuffle bytes=33143186 > 12/08/09 16:12:44 INFO mapred.JobClient: Reduce output records=13661 > 12/08/09 16:12:44 INFO mapred.JobClient: Spilled Records=122916251 > 12/08/09 16:12:44 INFO mapred.JobClient: Map output bytes=15704921900 > 12/08/09 16:12:44 INFO mapred.JobClient: Combine input > records=1332132129 > 12/08/09 16:12:44 INFO mapred.JobClient: Map output records=1265248800 > 12/08/09 16:12:44 INFO mapred.JobClient: SPLIT_RAW_BYTES=19716 > 12/08/09 16:12:44 INFO mapred.JobClient: Reduce input records=2172099 > > case2 : lzo > 12/08/09 15:58:11 INFO mapred.JobClient: Job complete: > job_201208021633_0048 > 12/08/09 15:58:11 INFO mapred.JobClient: Counters: 23 > 12/08/09 15:58:11 INFO mapred.JobClient: Job Counters > 12/08/09 15:58:11 INFO mapred.JobClient: Launched reduce tasks=1 > 12/08/09 15:58:11 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=3361287 > 12/08/09 15:58:11 INFO mapred.JobClient: Total time spent by all > reduces waiting after reserving slots (ms)=0 > 12/08/09 15:58:11 INFO mapred.JobClient: Total time spent by all maps > waiting after reserving slots (ms)=0 > 12/08/09 15:58:11 INFO mapred.JobClient: Rack-local map tasks=4 > 12/08/09 15:58:11 INFO mapred.JobClient: Launched map tasks=65 > 12/08/09 15:58:11 INFO mapred.JobClient: Data-local map tasks=61 > 12/08/09 15:58:11 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=183529 > 12/08/09 15:58:11 INFO mapred.JobClient: FileSystemCounters > 12/08/09 15:58:11 INFO mapred.JobClient: FILE_BYTES_READ=568178351 > 12/08/09 15:58:11 INFO mapred.JobClient: HDFS_BYTES_READ=3860287251 > 12/08/09 15:58:11 INFO mapred.JobClient: FILE_BYTES_WRITTEN=576095398 > 12/08/09 15:58:11 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=211440 > 12/08/09 15:58:11 INFO mapred.JobClient: Map-Reduce Framework > 12/08/09 15:58:11 INFO mapred.JobClient: Reduce input groups=13661 > 12/08/09 15:58:11 INFO mapred.JobClient: Combine output > records=66734193 > 12/08/09 15:58:11 INFO mapred.JobClient: Map input records=158156100 > 12/08/09 15:58:11 INFO mapred.JobClient: Reduce shuffle bytes=4752406 > 12/08/09 15:58:11 INFO mapred.JobClient: Reduce output records=13661
-
Re: Lzo vs SequenceFile for big fileHarsh J 2012-09-10, 04:40
A few things:
Storing simple, singular text records into sequence files isn't optimal, as you're just adding overheads for every line of text stored as Text type in it. If you have typed data and can benefit from type-based serializations for each record, go for a container format like SequenceFiles (With whatever serialization technique) or Avro DataFiles (Has embedded schema support, among other niceties). When comparing the result with Lzo, also factor in the indexing time as thats part of the requirement in making it parallel (I think the newer libs auto-index, but thats just what I heard was the plan, dunno if its already available). On Fri, Sep 7, 2012 at 4:55 AM, Young-Geun Park <[EMAIL PROTECTED]> wrote: > Hi, All > > I have tested which method is better between Lzo and SequenceFile for a BIG > file. > > File size is 10GiB and WordCount MR is used. > Inputs of WordCount MR are lzo which would be indexed by LzoIndexTool(lzo), > sequence file which is compressed by block level snappy(seq) , and > uncompressed original file(none). > > Map output is compressed except of uncompressed file. mapreduce output is > not compressed for all cases. > > The following are wordcount MR running time; > none lzo seq > 248s 243s 1410s > > -Test Environments > > OS : CentOS 5.6 (x64) (kernel = 2.6.18) > # of Core : 8 (cpu = Intel(R) Xeon(R) CPU E5504 @ 2.00GHz) > RAM : 18GB > Java version : 1.6.0_26 > Hadoop version : CDH3U2 > # of datanode(tasktracker) : 8 > > According to the result, The running time of SequnceFile is much less than > the others. > Before testing, I had expected that the results of both SequenceFile and > Lzo are about the same. > > I want to know why performance of the sequence file compressed by snappy is > so bad? > > do I miss anything in tests? > > > Regards, > Park > > -- Harsh J |