Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Modifying data before importation into Hive


Copy link to this message
-
RE: Modifying data before importation into Hive
How are you generating the value in the YEAR column? Is it a static value or something that gets computed from the data?

Ashish

________________________________
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, November 24, 2009 3:24 PM
To: [EMAIL PROTECTED]
Subject: Modifying data before importation into Hive

Hello,

I'm using Cloudera's hive-0.4.0+14.tar.gz with hadoop-0.20.1+152.tar.gz on a Centos machine.

I've been able to load syslog files into Hive using the RegexSerDe class - this works great. But what if your log files are missing a column, or the data needs to be manipulated in some way before being put in the table? In our case, we'd like to add a YEAR column as it's not included in the log files. We'd like to avoid having to rewrite all the logs to put them in that format though.

One suggestion from Ashish to a user was to do something like a left outer join with data staged in another table and to target the results into a table with the desired structure. But the lines of our log file don't have a unique key we could use to do such a join - just things like host, day, month, etc.

Is there any other way to add information in conjunction with doing LOAD DATA INPATH, given that we can't add data after it's in the table?

Thanks
Ken

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB