Compression is irrelevant with yarn.
If you want to store files with compression, you should compress the file
when they were load to HDFS.
The files on HDFS were compressed according to the parameter
"io.compression.codecs" which was set in core-site.xml.
If you want to specific a novel compression format, you need to set "STORED
AS INPUTFORMAT" to the corresponding class which act as the role of
compression such as "com.hadoop.mapred.DeprecatedLzoTextInputFormat".
1, you should compress each file in the dir rather than the whole dir.
2, consider the compression ratio, bzip2 > gzip > lzo, however, the
decompression speed is just the opposite order. So we need balance. gzip is
popular one as far as I know.
3, without need.
4, Yes, and the process is transparent to users.
2013/10/16 xeon <[EMAIL PROTECTED]>
> I want execute the wordcount in yarn with compression enabled with a dir
> with several files, but for that I must compress the input.
> 1 - Should I compress the whole dir or each file in the dir?
> 2 - Should I use gzip or bzip2?
> 3 - Do I need to setup any yarn configuration file?
> 4 - when the job is running, the files are decompressed before running the
> mappers and compressed again after reducers executed?