On Mon, May 13, 2013 at 9:34 AM, Nalin Khosla <[EMAIL PROTECTED]>wrote:
> Had a quick question wrt to querying HADOOP data;
> 1. What tools are available to Query Hadoop data in real time vs batch?
The line between real time and batch isn't that clear. We are working on
substantially speeding up the performance of Hive (
The better question is whether you have small enough data so that it can
fit in RAM on your cluster. If so, you should look at Shark (
or a proprietary MPP database such as Teradata or Impala.
> 2. I believe HIVE provides a batch interface, not sure on what tools
> within HIVE support the query capabilities against HADOOP ?
Hive currently uses MapReduce to run the queries. We plan on extending to
use Tez, which is a new Apache project that provides a richer framework for
> 3. Besides HIVE, are there any other Query tools to query HADOOP data
> (ad-hoc queries) ?
Pig and Cascading are the main open source ones for large data. Shark does
the smaller ad-hoc queries. Drill plans to fit into the ad-hoc space, but
hasn't made a release yet.
4. Finally, what skill set is required to use HIVE or other alternate tools
> ? Can business users uses these tools?
Using Hive requires a learning curve. Business users will be able to run
queries against the data, but it will require someone with more engineering
background to design the table layouts and updating scheme.