I am a newbie and I don't want to break any layered abstractions.
I am in the situation where I want to be able to examine
the predicate in the query and if it's a filter that I recognize
then I would like to use it to cut down on the number of records
processed. In particular I would like to make sure that in such a case the records
aren't even read, so they don't need to be filtered. At the moment
I can see that by providing my own InputFormat I can arrange that
the splits that I return are filtered down to the subset that I want.
However this means that the InputFormat needs to know something
about the table to be able to parse the predicate and see if it matches
the filtering criteria. But I learn that the InputFormat doesn't have access
to the table properties. So I have a problem.
OK, the serde has access to the table properties but it's in no position
to be able to perform the filtering. By the time it sees a record it's too late.
Similarly by the time the recordReader is invoked the record has been read.
I would use a facility like indexing, but I want this to work when the query
does not perform a Map/Reduce and my understanding is that Hive will
not invoke an indexes if there is no Map/Reduce. So indexing is a non-starter.
Also there are cases where creating an index seems massive overkill for
what I am trying to achieve.
So where is the Hive hook that allows me to do what I would like to do?
Which of the layers allows me to examine the table properties and
the predicate and to (pre-)filter the records returned?
Or are you saying that what I am trying to do doesn't make sense?
From: Edward Capriolo [mailto:[EMAIL PROTECTED]]
Sent: 28 May 2013 16:45
To: [EMAIL PROTECTED]
Cc: Peter Marron
Subject: Re: Accessing Table Properies from InputFormat
That does not really make sense. Your breaking the layered approache. InputFormats read/write data, serdes interpret data based on the table definition. its like asking "Why can't my input format run assembly code?"
On Tue, May 28, 2013 at 11:42 AM, Owen O'Malley <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
On Tue, May 28, 2013 at 7:59 AM, Peter Marron <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Hive 0.10.0 over Hadoop 1.0.4.
Further to my filtering questions of before.
I would like to be able to access the table properties from inside my custom InputFormat.
I've done searches and there seem to be some other people who have had a similar problem.
The closest I can see to a solution is to use
MapredWork mrwork = Utilities.getMapRedWork(configuration);
but this fails for me with the error below.
I'm not truly surprised because I and trying to make sure that my query
runs without a map/reduce and some of the e-mails suggest that in this case:
"...no mapred job is
run, so this trick doesn't work (and instead, the Configuration object
can be used, since it's local)."
Any pointers would be very much appreciated.
Yeah, as you discovered, that only works in the MapReduce case and breaks on cases like "select count(*)" that don't run in MapReduce.
I haven't tried it, but it looks like the best you can do with the current interface is to implement a SerDe which is passed the table properties in initialize. In terms of passing it to the InputFormat, I'd try a thread local variable. It looks like the getRecordReader is called soon after the serde.initialize although I didn't do a very deep search of the code.