Hi, I have a question related to how the mapper generated for the input files from HDFS. I understand the split and blocks concept in the HDFS, but my originally understanding is that one mapper will only process data from one file in HDFS, no matter how small this file it is. Is that correct?
The reason for this is that in some ETL, I did see the logic to understand the data set based on the file name convention. So in the mapper, before processing the first KV, we can build some logic in the map() method to get the file name of the current input, and init some logic here. After that, we don't need to worry data could be from another file later, as one mapper task will only handle data from one file, even when the file is very small. So small files not only cause trouble in NN memory, it also wastes the Map tasks, as map task could consume too less data.
But today, when I run following hive query (hadoop 1.0.4 and hive 0.9.1),
select partition_column, count(*) from test_table group by partition_column
It only generates 2 mappers in MR job. This is an external hive table, and the input bytes for this MR job is only 338M, but the data files in the HDFS for this table is more than 100, even though a lot of them is very small, as this is one node cluster, but it is configured as one node full cluster mode, not local mode. Should the MR job generated here trigger at least 100 mappers? Is this because in hive that my original assumption not work any more?