Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Drill, mail # dev - Schemaless Schema Management: Pass per record, Per batch, or ?


Copy link to this message
-
Re: Schemaless Schema Management: Pass per record, Per batch, or ?
Timothy Chen 2012-11-15, 20:02
Hi Jacques,

Ill add a simple IDL that we can iterate on.

About the filtering discussion, do you want to bring this discussion to a google doc?

Tim

Sent from my iPhone

On Nov 14, 2012, at 9:42 PM, Jacques Nadeau <[EMAIL PROTECTED]> wrote:

> Hey Timothy,
>
> It's great that you started pulling something together.  Thanks for taking
> the initiative!  Do you want to spend some time looking at trying to define
> an IDL for MsgPack for schema information and add that to your work?
>
> We also need to come up with a standard selection/filter
> vocabulary/approach.  It would preferably cover things like:
>
>   - Support simple field/tree inclusion lists and wildcards.
>      - Classic relational like {column1, column2, column3}
>      - Nested like {arrayColumn1.[*], mapColumn.foo}
>   - Support some kind of filters such that could prune record, leaves, or
>   branches
>      - only include the first three sub elements
>      - only include map keys that start with "user%"
>      - only include this record where at least one
>      arrayColumn.phone-number starts with "415%"
>
> One idea might be to conceive of a fourth concept on top of the classic
> (table|scalar|aggregate) functions called tree functions and generate a set
> of primitives for that.  Then allow scalar functions inside tree function
> evaluation.  (I haven't thought a great deal about what this means.)
> I've also thought that xpath might be a good place to look for conceptual
> inspiration.  (But I don't think we have any interest to go to that
> level...)
>
> Does any of this sound interesting?   (That also goes for anyone out there
> who is lurking...)
>
> Thanks again,
> Jacques
>
>
> On Wed, Nov 14, 2012 at 5:45 PM, Timothy Chen <[EMAIL PROTECTED]> wrote:
>
>> I don't have much to add to the options you've suggested, I do agree
>> storing the schema and sending the diffs will be the most ideal way to go.
>>
>> And since we already need to look at every row, we can build the schema
>> diffs pretty easily.
>>
>> I currently have a simple JSON -> MsgPack impl using Yajl here:
>> https://github.com/tnachen/incubator-drill/tree/executor/sandbox/executor
>>
>> Depending on the parser we use, most already have basic types detection and
>> we can extend more data types discovery later on as extensions.
>>
>> Tim
>>
>>
>>
>> On Wed, Nov 14, 2012 at 3:17 PM, Jacques Nadeau <[EMAIL PROTECTED]
>>> wrote:
>>
>>> One of the goals we've talked about for Drill is the ability to consume
>>> "schemaless" data.  What this really means to me is data such as JSON
>> where
>>> the schema of data could change from record to record (and isn't known
>>> until query execution).  I'd suggest that in most cases, the schema
>> within
>>> a JSON 'source' (collection of similar files) is mostly stable.  The
>>> default JSON format passes this schema data with each record.  This would
>>> be the simplest way to manage this data.  However, if Drill operated in
>>> this manner we'd likely have to manage fairly different code paths for
>> data
>>> with schema versus those without.  There also seems like there would be a
>>> substantial processing and message size overhead interacting with all the
>>> schema information for each record.  Couple of notes:
>>>
>>>   - By schema here I more mean the structure of the key names and nested
>>>   structure of the data as opposed to value data types...
>>>   - A simple example: we have a user table and one of the query
>>>   expressions is user.phone-numbers.  If we query that without schema,
>> we
>>>   don't know if that is a scalar, a map or an array.  Thus... we can't
>>> figure
>>>   out the number of "fields" in the output stream.
>>>
>>>
>>> Separately, we've also talked before about having all the main
>> executional
>>> components operating on a batches of records as a single work unit
>>> (probably in MsgPack streaming format or similar).
>>>
>>> One way to manage schemaless data within these parameters is to generate