Right, hive discarding the key is rather annoying. I have a series of
key+value input formats key only input format etc. Having hive return both
the key and the value, would be a breaking change, but not be very
The question we are diving into is how much of hive is going to be designed
around edge cases? Hive really was not made for columnar formats, or self
describing data-types. For the most part it handles them fairly well.
I am not sure what I believe about refactoring all of hive's guts. How much
refactoring and complexity are we going to add to support special cases? I
do not think we can justify sweeping API changes for the sake of one new
input format, or something that can be done in some other way.
On Tue, May 28, 2013 at 12:10 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote:
> On Tue, May 28, 2013 at 8:45 AM, Edward Capriolo <[EMAIL PROTECTED]>wrote:
>> That does not really make sense. Your breaking the layered approache.
>> InputFormats read/write data, serdes interpret data based on the table
>> definition. its like asking "Why can't my input format run assembly code?"
> The current model of:
> does well for text formats, but otherwise limits the input/output formats
> to doing binary data. That creates problems if the Input/OutputFormat has
> an integrated serialization mechanism. For example, ORC requires its SerDe
> and the OrcSerde just passes along the values through serialize and
> Also note that other formats like SequenceFile are restricted because the
> SerDe is placed above the FileFormat. Hive's SequenceFile input format
> discards the key and requires the value to be Text or BytesWritable. That
> covers many cases, but certainly not all. On the other hand, if it was
> Hive's SequenceFile InputFormat that was creating the ObjectInspector, it
> could actually handle more complex types and let Hive usefully read a wider
> range of SequenceFiles.
> I would propose that it would be better to push SerDes down into the
> Input/OutputFormats that can be parameterized by the serialization. Using
> them for TextInput/OutputFormat and HBaseTableInput/OutputFormat makes a
> lot of sense, but in general that isn't true.
> -- Owen