Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Parallely Load Data into Two partitions of a Hive Table


+
selva 2013-05-03, 07:04
+
Nitin Pawar 2013-05-03, 07:36
+
selva 2013-05-03, 08:21
Copy link to this message
-
Re: Parallely Load Data into Two partitions of a Hive Table
Why are u using LOAD DATA syntax ? Are these Hive managed tables ?  LOAD DATA will actually copy files into HDFS

I would recommend using EXTERNAL table and use

ALTER TABLE ADD PARTITION(logdate='2013-04-01') LOCATION '/logs/processed/2013-04-01'
This just makes entries in MYSQL and is lot faster
From: selva <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
Reply-To: "[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
Date: Friday, May 3, 2013 1:21 AM
To: "[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
Subject: Re: Parallely Load Data into Two partitions of a Hive Table

Only thing i was confused that mysql meta update may have lock and it might cause any data loss.

Now i am clear on this. Thanks a lot Nitin.
On Fri, May 3, 2013 at 1:06 PM, Nitin Pawar <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:

Why dont you load all of your data into a temporary table and then from there to your current tables.

Hive will take care of adding dynic partitions and that will remove  the ocerhead from you.

To answer your question, you can always load data in different partitions parallely as long as you have resources available on hive cli machine

On May 3, 2013 12:35 PM, "selva" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
>
> Hi All,
>
> I need to load a month worth of processed data into a hive table. Table have 10 partitions. Each day have many files to load and each file is taking two seconds(constantly) and i have ~3000 files). So it will take days to complete for 30 days worth of data.
>
> I planned to load every day data parellaly into respective partition so that i can complete it short time.
>
> But i need clarrification before proceeding it.
>
> Question:
>
> 1. Will it cause data loss/corruption by loading parellely in different partition of same hive table ?
>
> For example, Assume i am doing like below,
>
> Table : processedlogs
> Partition : logdate
>
> Running below commands parellely,
> LOAD DATA INPATH '/logs/processed/2013-04-01' OVERWRITE INTO TABLE processedlogs PARTITION(logdate='2013-04-01');
> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE processedlogs PARTITION(logdate='2013-04-02');
> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE processedlogs PARTITION(logdate='2013-04-03');
> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE processedlogs PARTITION(logdate='2013-04-04');
> .....
> LOAD DATA INPATH '/logs/processed/2013-04-30' OVERWRITE INTO TABLE processedlogs PARTITION(logdate='2013-04-30');
>
> Thanks
> Selva
>
>
>
>
>
>
>

--
-- selva
CONFIDENTIALITY NOTICE
=====================This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB