Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Ignore subdirectories when querying external table


Copy link to this message
-
Re: Ignore subdirectories when querying external table
Dave,
 Where do you specify the  classpath before starting the Hive shell , when you introduce  a custom class like this ?

Sam
On Aug 19, 2011, at 1:22 PM, Dave wrote:

> I solved my own problem. For anyone who's curious:
>
> It turns out that subclassing an InputFormat allows one to override the listStatus method, which returns the list of files for Hive (or mapreduce in general) to process. All I had to do was subclass org.apache.hadoop.mapred.TextInputFormat and override the listStatus method and voila; I was able to make it ignore directories. Here's the java code that I used:
>
> public class TextFileInputFormatIgnoreSubDir extends TextInputFormat {
>     @Override
>     protected FileStatus[] listStatus (JobConf job) throws IOException {
>         FileStatus[] files = super.listStatus(job);
>         List<FileStatus> newFiles = new ArrayList<FileStatus>();
>         int len = files.length;
>         for (int i = 0; i < len; ++i) {
>             FileStatus file = files[i];
>             if (!file.isDir()) {
>                 newFiles.add(file);
>             }
>         }
>
>         files = new FileStatus[newFiles.size()];
>         for (int i = 0; i < newFiles.size(); ++i) {
>             files[i] = newFiles.get(i);
>         }
>
>         return files;
>     }
> }
>
> And the HiveQL code I used to define the table:
>
> CREATE EXTERNAL TABLE users (id STRING, user_name STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS INPUTFORMAT 'com.example.mapreduce.input.TextFileInputFormatIgnoreSubDir'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION '/data/test/users';
>
> Hope this saves someone else the trouble of figuring it out...
>
> -Dave
>
> On Thu, Aug 18, 2011 at 3:53 PM, Dave <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I have a partitioned external table in Hive, and in the partition directories there are other subdirectories that are not related to the table itself. Hive seems to want to scan those directories, as I am getting an error message when trying to do a SELECT on the table:
>
> Failed with exception java.io.IOException:java.io.IOException: Not a file: hdfs://path/to/partition/path/to/subdir
>
> Also, it seems to ignore directories prefixed by an underscore (_directory).
>
> I am using hive 0.7.1 on Hadoop 0.20.2.
>
> Is there a way to force Hive to ignore all subdirectories in external tables and only look at files?
>
> Thanks in advance,
> -Dave
>

Sam William
[EMAIL PROTECTED]