Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - Storing data in TSV with changing headers


Copy link to this message
-
Re: Storing data in TSV with changing headers
Mark Grover 2012-11-30, 15:12
Hi Marc,
While what Dean said is true for different schemas in general, there is a
way to do it all in the same table if the schema changes to the TSV file
are just additions of new tab-separated columns at the very end of each row
and no existing columns are being deleted.

Let's say your TSV file looked like
a_val1    b_val1    c_val1
a_val2    b_val2    c_val2

Your table definition could look like:
CREATE TABLE t1(a STRING, b STRING, c STRING) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\t';

Now say you insert more data into this table (externally through HDFS
commands) which looks like
a_val3    b_val3    c_val3    d_val3
a_val4    b_val4    c_val4    d_val4

Now, if you haven't changed the table definition just yet, it still says 3
columns, so a select * from t1 would still give you 3 columns. In order
words, the newly added 4th column doesn't get read and doesn't have any
adverse effect on the table if the table definition is still old.
a_val1    b_val1    c_val1
a_val2    b_val2    c_val2
a_val3    b_val3    c_val3
a_val4    b_val4    c_val4

Now let's go ahead and change the table definition. You could use one of
the ALTER TABLE commands for this. I personally like to create external
tables, drop the tables and recreate them on top of the same data.

Your new table create statement may look something like:
CREATE TABLE t1(a STRING, b STRING, c STRING,d STRING) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

A select * from t1 in this case would reveal:
a_val1    b_val1    c_val1    NULL
a_val2    b_val2    c_val2    NULL
a_val3    b_val3    c_val3    d_val3
a_val4    b_val4    c_val4    d_val4

New columns show up correctly. Values from old data for which those columns
didn't exist show up as NULLs.

So, to summarize, a single table may work, if the following 3 conditions
are met.
1. New columns get added to the very end of the existing columns
2. No existing columns are deleted
3. You are ok with NULL showing up as value for old records that don't have
a given new column.

I tested the above with CSV files, TSV should be the same as well.

Mark

On Fri, Nov 30, 2012 at 5:40 AM, Dean Wampler <
[EMAIL PROTECTED]> wrote:

> You'll have to define separate tables for the different schemas. You can
> "unify" them in a query with the union feature. You should also remove the
> header lines in the files, if you still have them, because Hive does not
> ignore them, but treats them as "data".
>
> dean
>
>
> On Fri, Nov 30, 2012 at 2:59 AM, Marc Canaleta <[EMAIL PROTECTED]>wrote:
>
>> Hi all!
>>
>> We want to use hive to analyze our logs. Our logs will be TSV files, one
>> per hour, and as it is possible that we add/remove more columns in the
>> future, we will include headers (column names) in each file.
>>
>> So it is possible that two TSV files for different days/hours have
>> different headers.
>>
>> Is it possible to do this with Hive?
>>
>> Thanks!
>>
>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>
>