You can use Flume for ingestion into hdfs . Flume takes care of the file sizes, combines the files and stores as one large file. This is a better approach.
You can have custom MR jobs to merge these files in hdfs as well. Use combineFileInputFormat and start a map only job with Identity mapper with split size set to the required large file size.
Sent from handheld, please excuse typos.
From: Cheng Su <[EMAIL PROTECTED]>
Date: Thu, 15 Nov 2012 16:03:44
To: <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
Subject: Can I merge files after I loaded them into hive?
Can I merge files after I loaded them into hive?
This is my situation:
There is a log table partitioned by date, which is store the nginx access logs.
The raw log files are loaded into hive every hour.
By now, a single log file size is small, say 10 MB or even smaller.
So there are 24 small size files in one partition.
This is ineffective in my opinion, and will consume more hadoop heap size.
That's why I want to merge the small files.
Can hive merge those files automatically?
Or dose hive provide some tools to merge files?
Or I can just use hadoop dfs -cat to do that?