Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Drill >> mail # dev >> Re: Writer interface start

Copy link to this message
Re: Writer interface start
I see, didn't know we have plans to write results into various formats. If
we can do that it could integrate even with other data processing tools to
integrate with drill (which is probably the aim too?)

So if we're just writing results into disk, I wonder why we need a writer
interface that needs to consider row/column major differences?

Can't we just take the in-memory vv that is being produced and write a
Recordbatch at a time directly to formats we want?

On Fri, Oct 4, 2013 at 11:24 AM, Jason Altekruse

> Tim,
> Answers to your questions are below. I am almost always available after 2pm
> your time, feel free to send me some dates/times that work for you.
> - Maybe a bit more context? A writer interface doesn't seem to suggest what
> it really is about. Also if this is focused on writing (from record reader
> into drill vv), why is there many comments around reading in your
> consideration?
> - I don't see any writer interface proposed?
> There actually isn't a writer interface written yet. The document I shared
> are some thoughts I'm compiling about what the writer interface needs to
> handle. I hope to gather as much information about various formats before
> proposing a hard interface. I believe there could be a lot of value in
> trying to generalize the readers and writers, even across formats. I'm
> hoping it will minimize the burden of maintaining support for formats and
> they evolve, as well as update the readers and writers and the value
> vectors become more complex (compressed representations of data in memory,
> dictionary encodings, etc.)
> The reader interface was included for reference in the document, because I
> believe we should work on both the reader and writer together, as both have
> many similar properties and really just perform a translation in opposite
> directions.
> For clarity the writer interface is what will allow us to enable a create
> table operation and store results to disk. Obviously we will want to
> support a variety of formats, as most users will likely want to export in
> formats they are used to working with, because Drill will likely not be the
> only tool they use to analyze their data.
> As Drill is not designed for batch jobs, this really is not designed for
> converting large volumes of data between formats, because long running
> queries can die and are not recoverable in Drill.
> - Some of the considerations you're putting in columnar also applies to row
> major as well:
>     - compression (ie: Avro compresses per block).
>     - schema changes can happen in both
> - What are we writing to disk? And why does columnar requires a larger
> in-memory structure to be written to disk?
> The compression in row major is definitely an important consideration  When
> this is the case we will have to buffer a large amount of records in memory
> before writing to disk. With simple formats like csv we can really buffer
> as many or as few records in memory before actually writing. Likely
> buffering more will be better to prevent disk overhead.
> While schema changes can happen in both, we don't have to worry about it
> for writing values to disk, except for values with defined schemas per
> block. In a csv, it is completely possible to have additional columns in
> one of the rows (while the format is very limited you couldn't really leave
> out a column without there being a problem). While the value vectors would
> not handle a change in schema every value during reading, the reality is
> that this arrangement of data is unlikely to come out of drill because of
> the single schema per batch design. A fast change in schema every few
> records could only be represented by a series of very short batches,
> something we will try to avoid.
> This does speak to the consideration I brought up in the document about how
> to handle frequent schema changes, as it might make sense to go back an
> re-write some data if we figure out that the next batch has an additional