|
|
+
zuohua zhang 2012-10-02, 19:53
+
Alexander Pivovarov 2012-10-02, 20:16
-
Re: best way to load millions of gzip files in hdfs to one table in hive?Edward Capriolo 2012-10-02, 21:45
You may want to use:
https://github.com/edwardcapriolo/filecrush We use this to deal with pathological cases although the best idea is to avoid big files all together. Edward On Tue, Oct 2, 2012 at 4:16 PM, Alexander Pivovarov <[EMAIL PROTECTED]> wrote: > Options > 1. create table and put files under the table dir > > 2. create external table and point it to files dir > > 3. if files are small then I recomend to create new set of files using > simple MR program and specifying number of reduce tasks. Goal is to make > files size > hdfs block size (it safes NN memory and read will be faster) > > > On Tue, Oct 2, 2012 at 3:53 PM, zuohua zhang <[EMAIL PROTECTED]> wrote: >> >> I have millions of gzip files in hdfs (with the same fields), would like >> to load them into one table in hive with a specified schema. >> What is the most efficient ways to do that? >> Given that my data is only in hdfs, and also gzipped, does that mean I >> could just simply set up the table somehow bypassing some unnecessary >> overhead of the typical approach? >> >> Thanks! > > +
Abhishek 2012-10-02, 23:31
|