Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> best way to load millions of gzip files in hdfs to one table in hive?


+
zuohua zhang 2012-10-02, 19:53
+
Alexander Pivovarov 2012-10-02, 20:16
Copy link to this message
-
Re: best way to load millions of gzip files in hdfs to one table in hive?
You may want to use:

https://github.com/edwardcapriolo/filecrush

We use this to deal with pathological cases although the best idea is
to avoid big files all together.

Edward

On Tue, Oct 2, 2012 at 4:16 PM, Alexander Pivovarov
<[EMAIL PROTECTED]> wrote:
> Options
> 1. create table and put files under the table dir
>
> 2. create external table and point it to files dir
>
> 3. if files are small then I recomend to create new set of files using
> simple MR program and specifying number of reduce tasks. Goal is to make
> files size > hdfs block size (it safes NN memory and read will be faster)
>
>
> On Tue, Oct 2, 2012 at 3:53 PM, zuohua zhang <[EMAIL PROTECTED]> wrote:
>>
>> I have millions of gzip files in hdfs (with the same fields), would like
>> to load them into one table in hive with a specified schema.
>> What is the most efficient ways to do that?
>> Given that my data is only in hdfs, and also gzipped, does that mean I
>> could just simply set up the table somehow bypassing some unnecessary
>> overhead of the typical approach?
>>
>> Thanks!
>
>
+
Abhishek 2012-10-02, 23:31