Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - adding filenames as new columns via Hive


Copy link to this message
-
RE: adding filenames as new columns via Hive
Ashish Thusoo 2009-09-16, 19:11
You could also do this as a simple udf instead of a virtual column. Virtual columns do get shown in the describe command and I don't think it would make sense to show this in the describe command. So instead of
Select FILENAME, xyz from T

We could just do

Select Filename(), xyz from T

Thoughts?

Ashish

-----Original Message-----
From: Edward Capriolo [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, September 16, 2009 12:05 PM
To: [EMAIL PROTECTED]
Subject: Re: adding filenames as new columns via Hive

I just put in a related thread about this. This would be really nice.
It is just a virtual column, we dont need it in the metadata if we also have a command like 'show files in partition' so we can inspect what is there as well.
On Wed, Sep 16, 2009 at 3:02 PM, Namit Jain <[EMAIL PROTECTED]> wrote:
> I don't think it is a good idea to make it a part of table metadata in
> any way.
>
> What happens if the filename changes ? It will be very difficult to
> maintain.
>
> But, we can definitely add some virtual columns (FILENAME can be one
> of them
>
> to start with - it should not show up in describe, select * etc.
>
>
>
> But, the user can query based on them - this is mostly for advanced
> users and
>
> can be used for pruning etc. also
>
>
>
>
>
> I will open a new jira, and we can continue the discussion there.
>
>
>
>
>
> -namit
>
>
>
>
>
>
>
>
>
> From: Avram Aelony [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, September 16, 2009 11:39 AM
> To: [EMAIL PROTECTED]
> Subject: RE: adding filenames as new columns via Hive
>
>
>
>
>
> Very cool.  Looking forward to seeing this feature in action. J
>
>
>
> Thanks,
>
> -A
>
>
>
>
>
> From: Prasad Chakka [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, September 16, 2009 11:33 AM
> To: [EMAIL PROTECTED]
> Subject: Re: adding filenames as new columns via Hive
>
>
>
> FYI, all partition columns can be used as any regular columns select
> queries. So it should be fine.
>
> ________________________________
>
> From: Avram Aelony <[EMAIL PROTECTED]>
> Reply-To: <[EMAIL PROTECTED]>
> Date: Wed, 16 Sep 2009 11:23:45 -0700
> To: <[EMAIL PROTECTED]>
> Subject: RE: adding filenames as new columns via Hive
>
> Sounds great, Prasad.
>
> As long as I can further parse the filename field to piece out (new)
> derived fields, I will be happy. J For example, in a later query I'd
> like to be able to do something like:
>
> select
> substr(filename, 4, 7) as  class_A,
> substr(filename,  8, 10) as class_B
> count( x ) as cnt
> from FOO
> group by
> substr(filename, 4, 7),
> substr(filename,  8, 10) ;
>
>
> thanks,
> -A
>
>
>
> From: Prasad Chakka [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, September 16, 2009 11:10 AM
> To: [EMAIL PROTECTED]
> Subject: Re: adding filenames as new columns via Hive
>
> I think this can be a good feature though I would like the filename to
> be a partition column (one of such) instead of a separate type of
> column. Would that work?
>
> create external table FOO (  <list of fields and types> ) row format
> delimited fields terminated by ','
> partitioned by (file_name FILENAME)
> stored as textfile location 's3:/somebucket/';
>
> Or table partitioned by datestamp and filename
>
> create external table FOO (  <list of fields and types> ) row format
> delimited fields terminated by ','
> Partitioned by (ds STRING, file_name FILENAME) stored as textfile
> location 's3:/somebucket/';
>
>
> So FILENAME becomes a new type. I like this because partition columns
> are virtual columns just like the filename column and do not exist
> along with data on the disk.
>
> Prasad
>
> ________________________________
>
> From: Avram Aelony <[EMAIL PROTECTED]>
> Reply-To: <[EMAIL PROTECTED]>
> Date: Wed, 16 Sep 2009 10:48:33 -0700
> To: <[EMAIL PROTECTED]>
> Subject: adding filenames as new columns via Hive
>
> Dear Hive list,
>
> I am processing a large volume of files (many files, roughly 500M