|
|
-
Ignore subdirectories when querying external table
Dave 2011-08-18, 22:53
Hi,
I have a partitioned external table in Hive, and in the partition directories there are other subdirectories that are not related to the table itself. Hive seems to want to scan those directories, as I am getting an error message when trying to do a SELECT on the table:
Failed with exception java.io.IOException:java.io.IOException: Not a file: hdfs://path/to/partition/path/to/subdir
Also, it seems to ignore directories prefixed by an underscore (_directory).
I am using hive 0.7.1 on Hadoop 0.20.2.
Is there a way to force Hive to ignore all subdirectories in external tables and only look at files?
Thanks in advance, -Dave
-
Re: Ignore subdirectories when querying external table
Dave 2011-08-19, 20:22
I solved my own problem. For anyone who's curious:
It turns out that subclassing an InputFormat allows one to override the listStatus method, which returns the list of files for Hive (or mapreduce in general) to process. All I had to do was subclass org.apache.hadoop.mapred.TextInputFormat and override the listStatus method and voila; I was able to make it ignore directories. Here's the java code that I used:
public class TextFileInputFormatIgnoreSubDir extends TextInputFormat { @Override protected FileStatus[] listStatus (JobConf job) throws IOException { FileStatus[] files = super.listStatus(job); List<FileStatus> newFiles = new ArrayList<FileStatus>(); int len = files.length; for (int i = 0; i < len; ++i) { FileStatus file = files[i]; if (!file.isDir()) { newFiles.add(file); } }
files = new FileStatus[newFiles.size()]; for (int i = 0; i < newFiles.size(); ++i) { files[i] = newFiles.get(i); }
return files; } }
And the HiveQL code I used to define the table:
CREATE EXTERNAL TABLE users (id STRING, user_name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS INPUTFORMAT 'com.example.mapreduce.input.TextFileInputFormatIgnoreSubDir' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION '/data/test/users';
Hope this saves someone else the trouble of figuring it out...
-Dave
On Thu, Aug 18, 2011 at 3:53 PM, Dave <[EMAIL PROTECTED]> wrote:
> Hi, > > I have a partitioned external table in Hive, and in the partition > directories there are other subdirectories that are not related to the table > itself. Hive seems to want to scan those directories, as I am getting an > error message when trying to do a SELECT on the table: > > Failed with exception java.io.IOException:java.io.IOException: Not a file: > hdfs://path/to/partition/path/to/subdir > > Also, it seems to ignore directories prefixed by an underscore > (_directory). > > I am using hive 0.7.1 on Hadoop 0.20.2. > > Is there a way to force Hive to ignore all subdirectories in external > tables and only look at files? > > Thanks in advance, > -Dave >
-
Re: Ignore subdirectories when querying external table
Sam William 2011-08-19, 23:54
On similar lines, I want to have hive inlcude subdirs. That is..
I have an external table paritioned by month (data for each month under a folder). Under the current month I want to keep adding folders daily . Is this possible without having to subclass InputFormat ? On Aug 19, 2011, at 1:22 PM, Dave wrote:
> I solved my own problem. For anyone who's curious: > > It turns out that subclassing an InputFormat allows one to override the listStatus method, which returns the list of files for Hive (or mapreduce in general) to process. All I had to do was subclass org.apache.hadoop.mapred.TextInputFormat and override the listStatus method and voila; I was able to make it ignore directories. Here's the java code that I used: > > public class TextFileInputFormatIgnoreSubDir extends TextInputFormat { > @Override > protected FileStatus[] listStatus (JobConf job) throws IOException { > FileStatus[] files = super.listStatus(job); > List<FileStatus> newFiles = new ArrayList<FileStatus>(); > int len = files.length; > for (int i = 0; i < len; ++i) { > FileStatus file = files[i]; > if (!file.isDir()) { > newFiles.add(file); > } > } > > files = new FileStatus[newFiles.size()]; > for (int i = 0; i < newFiles.size(); ++i) { > files[i] = newFiles.get(i); > } > > return files; > } > } > > And the HiveQL code I used to define the table: > > CREATE EXTERNAL TABLE users (id STRING, user_name STRING) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' > STORED AS INPUTFORMAT 'com.example.mapreduce.input.TextFileInputFormatIgnoreSubDir' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' > LOCATION '/data/test/users'; > > Hope this saves someone else the trouble of figuring it out... > > -Dave > > On Thu, Aug 18, 2011 at 3:53 PM, Dave <[EMAIL PROTECTED]> wrote: > Hi, > > I have a partitioned external table in Hive, and in the partition directories there are other subdirectories that are not related to the table itself. Hive seems to want to scan those directories, as I am getting an error message when trying to do a SELECT on the table: > > Failed with exception java.io.IOException:java.io.IOException: Not a file: hdfs://path/to/partition/path/to/subdir > > Also, it seems to ignore directories prefixed by an underscore (_directory). > > I am using hive 0.7.1 on Hadoop 0.20.2. > > Is there a way to force Hive to ignore all subdirectories in external tables and only look at files? > > Thanks in advance, > -Dave >
Sam William [EMAIL PROTECTED]
-
Re: Ignore subdirectories when querying external table
Sam William 2011-08-29, 21:49
Dave, Where do you specify the classpath before starting the Hive shell , when you introduce a custom class like this ?
Sam On Aug 19, 2011, at 1:22 PM, Dave wrote:
> I solved my own problem. For anyone who's curious: > > It turns out that subclassing an InputFormat allows one to override the listStatus method, which returns the list of files for Hive (or mapreduce in general) to process. All I had to do was subclass org.apache.hadoop.mapred.TextInputFormat and override the listStatus method and voila; I was able to make it ignore directories. Here's the java code that I used: > > public class TextFileInputFormatIgnoreSubDir extends TextInputFormat { > @Override > protected FileStatus[] listStatus (JobConf job) throws IOException { > FileStatus[] files = super.listStatus(job); > List<FileStatus> newFiles = new ArrayList<FileStatus>(); > int len = files.length; > for (int i = 0; i < len; ++i) { > FileStatus file = files[i]; > if (!file.isDir()) { > newFiles.add(file); > } > } > > files = new FileStatus[newFiles.size()]; > for (int i = 0; i < newFiles.size(); ++i) { > files[i] = newFiles.get(i); > } > > return files; > } > } > > And the HiveQL code I used to define the table: > > CREATE EXTERNAL TABLE users (id STRING, user_name STRING) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' > STORED AS INPUTFORMAT 'com.example.mapreduce.input.TextFileInputFormatIgnoreSubDir' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' > LOCATION '/data/test/users'; > > Hope this saves someone else the trouble of figuring it out... > > -Dave > > On Thu, Aug 18, 2011 at 3:53 PM, Dave <[EMAIL PROTECTED]> wrote: > Hi, > > I have a partitioned external table in Hive, and in the partition directories there are other subdirectories that are not related to the table itself. Hive seems to want to scan those directories, as I am getting an error message when trying to do a SELECT on the table: > > Failed with exception java.io.IOException:java.io.IOException: Not a file: hdfs://path/to/partition/path/to/subdir > > Also, it seems to ignore directories prefixed by an underscore (_directory). > > I am using hive 0.7.1 on Hadoop 0.20.2. > > Is there a way to force Hive to ignore all subdirectories in external tables and only look at files? > > Thanks in advance, > -Dave >
Sam William [EMAIL PROTECTED]
|
|