Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Drill >> mail # dev >> Introduction



Stefan,

> glad that I can help. May I suggest that I continue in the creation of use cases and the respective types of query profiles:
> * Wikipedia Edit History: After an initial glance the history is made up of 40 or so tables. I would design some user stories using join like queries across multiple tables - or however they are called in Drill.
> * I did not have an opportunity to check the Enron Stuff, but here I would design user stories as if building an email client, this would lead to heavy usage of a full text searching.
>
> There are some additional data-sets I would like to suggest: http://aws.amazon.com/datasets
>
> * Freebase.com: Simulate a visualization to jump from topic to topic as usert stories. This would lead to queries on a random and very small rowset.
> * Wikipedia Page Traffic Statistics: Simulate a log analysis. Heavy aggregation and date function on a large number of rows.
> * Global Weather Measurements: Design user stories based on geographic and chronoligic aggregation of climate data to visualize trends.

That sounds great! I reckon, as soon as we hear back from Ted re the Wiki we work there. For the time being, let's continue the discussion here.

Cheers,
Michael

--
Michael Hausenblas
Ireland, Europe
http://mhausenblas.info/

On 11 Jan 2013, at 00:18, "Siprell, Stefan" <[EMAIL PROTECTED]> wrote:

> Hi,
> glad that I can help. May I suggest that I continue in the creation of use cases and the respective types of query profiles:
> * Wikipedia Edit History: After an initial glance the history is made up of 40 or so tables. I would design some user stories using join like queries across multiple tables - or however they are called in Drill.
> * I did not have an opportunity to check the Enron Stuff, but here I would design user stories as if building an email client, this would lead to heavy usage of a full text searching.
>
> There are some additional data-sets I would like to suggest: http://aws.amazon.com/datasets
>
> * Freebase.com: Simulate a visualization to jump from topic to topic as usert stories. This would lead to queries on a random and very small rowset.
> * Wikipedia Page Traffic Statistics: Simulate a log analysis. Heavy aggregation and date function on a large number of rows.
> * Global Weather Measurements: Design user stories based on geographic and chronoligic aggregation of climate data to visualize trends.
>
>
> Regards
> Stefan
>
> ________________________________________
> Von: Michael Hausenblas [[EMAIL PROTECTED]]
> Gesendet: Donnerstag, 10. Januar 2013 19:54
> An: [EMAIL PROTECTED]
> Betreff: Re: Introduction
>
>> Michael Hausenblas is beginning to collect data sets and query examples for
>> different plausible use cases ranging from small to large.  He should show
>> up on the mailing list shortly and you could coordinate with him.
>
>
> Welcome, Stefan - great to have you on board!
>
> So the idea would be to compile a list of datasets along with typical (interesting) queries formulated in natural language. One thing we need to get this off the ground is the Wiki but I gather Ted is on that ..
>
> Datasets that might be of interest include, but are not restricted to:
>
> * Wikipedia edit history from [1]
> * Census data (US, Eurostat, etc.)
> * AOL search logs
> * Enron emails [2]
>
> Feel free to come up with additional ones as well.
>
> I suppose we can continue the discussion (who looks into what) here on the list and once the Wiki is available we can co-ordinate also via it.
>
> Cheers,
>                Michael
>
> [1] http://en.wikipedia.org/wiki/Wikipedia:Database_download
> [2] http://www.cs.cmu.edu/~enron/
>
> --
> Michael Hausenblas
> Ireland, Europe
> http://mhausenblas.info/
>
> On 10 Jan 2013, at 10:19, Ted Dunning <[EMAIL PROTECTED]> wrote:
>
>> Stefan,
>>
>> One of the key things to do right now is to work on use cases.
>>
>> Michael Hausenblas is beginning to collect data sets and query examples for
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB