Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Drill >> mail # dev >> Re: Writer interface start


+
Timothy Chen 2013-10-02, 18:43
Copy link to this message
-
Re: Writer interface start
Tim,

Answers to your questions are below. I am almost always available after 2pm
your time, feel free to send me some dates/times that work for you.

- Maybe a bit more context? A writer interface doesn't seem to suggest what
it really is about. Also if this is focused on writing (from record reader
into drill vv), why is there many comments around reading in your
consideration?
- I don't see any writer interface proposed?

There actually isn't a writer interface written yet. The document I shared
are some thoughts I'm compiling about what the writer interface needs to
handle. I hope to gather as much information about various formats before
proposing a hard interface. I believe there could be a lot of value in
trying to generalize the readers and writers, even across formats. I'm
hoping it will minimize the burden of maintaining support for formats and
they evolve, as well as update the readers and writers and the value
vectors become more complex (compressed representations of data in memory,
dictionary encodings, etc.)

The reader interface was included for reference in the document, because I
believe we should work on both the reader and writer together, as both have
many similar properties and really just perform a translation in opposite
directions.

For clarity the writer interface is what will allow us to enable a create
table operation and store results to disk. Obviously we will want to
support a variety of formats, as most users will likely want to export in
formats they are used to working with, because Drill will likely not be the
only tool they use to analyze their data.

As Drill is not designed for batch jobs, this really is not designed for
converting large volumes of data between formats, because long running
queries can die and are not recoverable in Drill.

- Some of the considerations you're putting in columnar also applies to row
major as well:
    - compression (ie: Avro compresses per block).
    - schema changes can happen in both
- What are we writing to disk? And why does columnar requires a larger
in-memory structure to be written to disk?

The compression in row major is definitely an important consideration  When
this is the case we will have to buffer a large amount of records in memory
before writing to disk. With simple formats like csv we can really buffer
as many or as few records in memory before actually writing. Likely
buffering more will be better to prevent disk overhead.

While schema changes can happen in both, we don't have to worry about it
for writing values to disk, except for values with defined schemas per
block. In a csv, it is completely possible to have additional columns in
one of the rows (while the format is very limited you couldn't really leave
out a column without there being a problem). While the value vectors would
not handle a change in schema every value during reading, the reality is
that this arrangement of data is unlikely to come out of drill because of
the single schema per batch design. A fast change in schema every few
records could only be represented by a series of very short batches,
something we will try to avoid.

This does speak to the consideration I brought up in the document about how
to handle frequent schema changes, as it might make sense to go back an
re-write some data if we figure out that the next batch has an additional
field. This type of a scenario would otherwise require us to start a new
parquet file for example.

- I don't quite get why row major requires additional objects to be passed
to writer?

Many existing interfaces are written with JAVA conventions in mind, object
passing is common for representing a series of values in a single row. If
we create new objects for each row we pass into their writing interface
there would be a lot of object allocation and garbage collection. This is
obviously something we want to avoid.

When we are considering the reader interface, it is possible that an
existing interface will pass us back a new object each time it reads a
record. In some of these cases they might be making a new object each time.
We will want to go in and add new methods that allow for passing existing
objects in and having the libraries for the various readers just populate
them, this will also prevent excessive garbage collection.

Hi Jason,

Have some questions around your considerations:

- Maybe a bit more context? A writer interface doesn't seem to suggest what
it really is about. Also if this is focused on writing (from record reader
into drill vv), why is there many comments around reading in your
consideration?
- I don't see any writer interface proposed?
- Some of the considerations you're putting in columnar also applies to row
major as well:
    - compression (ie: Avro compresses per block).
    - schema changes can happen in both
- What are we writing to disk? And why does columnar requires a larger
in-memory structure to be written to disk?
- I don't quite get why row major requires additional objects to be passed
to writer?

Tim

On Wed, Oct 2, 2013 at 11:19 AM, Jason Altekruse
<[EMAIL PROTECTED]>wrote:

+
Timothy Chen 2013-10-05, 00:32
+
Jason Altekruse 2013-10-05, 23:47