Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> best way to load millions of gzip files in hdfs to one table in hive?


+
zuohua zhang 2012-10-02, 19:53
+
Alexander Pivovarov 2012-10-02, 20:16
Copy link to this message
-
Re: best way to load millions of gzip files in hdfs to one table in hive?
You may want to use:

https://github.com/edwardcapriolo/filecrush

We use this to deal with pathological cases although the best idea is
to avoid big files all together.

Edward

On Tue, Oct 2, 2012 at 4:16 PM, Alexander Pivovarov
<[EMAIL PROTECTED]> wrote:
> Options
> 1. create table and put files under the table dir
>
> 2. create external table and point it to files dir
>
> 3. if files are small then I recomend to create new set of files using
> simple MR program and specifying number of reduce tasks. Goal is to make
> files size > hdfs block size (it safes NN memory and read will be faster)
>
>
> On Tue, Oct 2, 2012 at 3:53 PM, zuohua zhang <[EMAIL PROTECTED]> wrote:
>>
>> I have millions of gzip files in hdfs (with the same fields), would like
>> to load them into one table in hive with a specified schema.
>> What is the most efficient ways to do that?
>> Given that my data is only in hdfs, and also gzipped, does that mean I
>> could just simply set up the table somehow bypassing some unnecessary
>> overhead of the typical approach?
>>
>> Thanks!
>
>
+
Abhishek 2012-10-02, 23:31
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB