Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Drill >> mail # dev >> Schemaless Schema Management: Pass per record, Per batch, or ?


Copy link to this message
-
Re: Schemaless Schema Management: Pass per record, Per batch, or ?
Hi Jacques,

Ill add a simple IDL that we can iterate on.

About the filtering discussion, do you want to bring this discussion to a google doc?

Tim

Sent from my iPhone

On Nov 14, 2012, at 9:42 PM, Jacques Nadeau <[EMAIL PROTECTED]> wrote:

> Hey Timothy,
>
> It's great that you started pulling something together.  Thanks for taking
> the initiative!  Do you want to spend some time looking at trying to define
> an IDL for MsgPack for schema information and add that to your work?
>
> We also need to come up with a standard selection/filter
> vocabulary/approach.  It would preferably cover things like:
>
>   - Support simple field/tree inclusion lists and wildcards.
>      - Classic relational like {column1, column2, column3}
>      - Nested like {arrayColumn1.[*], mapColumn.foo}
>   - Support some kind of filters such that could prune record, leaves, or
>   branches
>      - only include the first three sub elements
>      - only include map keys that start with "user%"
>      - only include this record where at least one
>      arrayColumn.phone-number starts with "415%"
>
> One idea might be to conceive of a fourth concept on top of the classic
> (table|scalar|aggregate) functions called tree functions and generate a set
> of primitives for that.  Then allow scalar functions inside tree function
> evaluation.  (I haven't thought a great deal about what this means.)
> I've also thought that xpath might be a good place to look for conceptual
> inspiration.  (But I don't think we have any interest to go to that
> level...)
>
> Does any of this sound interesting?   (That also goes for anyone out there
> who is lurking...)
>
> Thanks again,
> Jacques
>
>
> On Wed, Nov 14, 2012 at 5:45 PM, Timothy Chen <[EMAIL PROTECTED]> wrote:
>
>> I don't have much to add to the options you've suggested, I do agree
>> storing the schema and sending the diffs will be the most ideal way to go.
>>
>> And since we already need to look at every row, we can build the schema
>> diffs pretty easily.
>>
>> I currently have a simple JSON -> MsgPack impl using Yajl here:
>> https://github.com/tnachen/incubator-drill/tree/executor/sandbox/executor
>>
>> Depending on the parser we use, most already have basic types detection and
>> we can extend more data types discovery later on as extensions.
>>
>> Tim
>>
>>
>>
>> On Wed, Nov 14, 2012 at 3:17 PM, Jacques Nadeau <[EMAIL PROTECTED]
>>> wrote:
>>
>>> One of the goals we've talked about for Drill is the ability to consume
>>> "schemaless" data.  What this really means to me is data such as JSON
>> where
>>> the schema of data could change from record to record (and isn't known
>>> until query execution).  I'd suggest that in most cases, the schema
>> within
>>> a JSON 'source' (collection of similar files) is mostly stable.  The
>>> default JSON format passes this schema data with each record.  This would
>>> be the simplest way to manage this data.  However, if Drill operated in
>>> this manner we'd likely have to manage fairly different code paths for
>> data
>>> with schema versus those without.  There also seems like there would be a
>>> substantial processing and message size overhead interacting with all the
>>> schema information for each record.  Couple of notes:
>>>
>>>   - By schema here I more mean the structure of the key names and nested
>>>   structure of the data as opposed to value data types...
>>>   - A simple example: we have a user table and one of the query
>>>   expressions is user.phone-numbers.  If we query that without schema,
>> we
>>>   don't know if that is a scalar, a map or an array.  Thus... we can't
>>> figure
>>>   out the number of "fields" in the output stream.
>>>
>>>
>>> Separately, we've also talked before about having all the main
>> executional
>>> components operating on a batches of records as a single work unit
>>> (probably in MsgPack streaming format or similar).
>>>
>>> One way to manage schemaless data within these parameters is to generate
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB