-Re: How to apply data mining on Hive?
Mark Grover 2012-06-08, 23:55
Hive does expose a JDBC interface which can by tools and applications. You
would check out individual tools to see if they support Hadoop (I use the
word Hadoop and not Hive since an application doesn't need Hive to run Map
Reduce jobs on data in HDFS).
Apache Mahout, as Sreenath, mentioned is also an interesting open source
project which combines canonical machine learning algorithms with the power
of Hadoop. That might fit your bill too.
On Fri, Jun 8, 2012 at 1:25 AM, jason Yang <[EMAIL PROTECTED]> wrote:
> Hi, Mark.
> Thank you for your reply.
> I have read the User Guide, but I'm still wondering what can I do for the
> following scenario:
> 1. Suppose I have a table t_customer_info in Hive, which include lots of
> information about our customers.
> 2. Now I would like to cluster those customers into different groups so
> that customers within a group have high similarity, but are very dissimilar
> to customers in other groups.
> 3. This is a classical clustering problem in Data Mining field, I thought
> such job can not be done by query language, instead of some data mining
> When we look "back" to the traditional DBMS, there're lots of data mining
> tools or BI tools which could connect to the DBMS, and apply some canonical
> algorithms to the data in the DBMS. So I start to wonder is there similar
> tools over Hive?
> If not, what's the most used way to do data mining over Hadoop?
> 2012/6/8 Mark Grover <[EMAIL PROTECTED]>
>> Hi Jason,
>> Hive is a data warehouse system that sits on top of Hadoop. The key
>> selling point here is that it allows users to write SQL-like queries to
>> query their large scale data. These queries get compiled into Map Reduce
>> which is then run on the Hadoop cluster just like any other Map Reduce jobs.
>> Hadoop does all the parallel processing for you. All you have to do is
>> set up a Hadoop cluster, install Hive on the cluster and run your Hive
>> queries. All underlying processing will happen in parallel where possible.
>> This is a good place to get started and learn more about Hive:
>> Welcome and good luck!
>> On Thu, Jun 7, 2012 at 10:10 PM, jason Yang <[EMAIL PROTECTED]>wrote:
>>> Hi, dear friends.
>>> I was wondering what's the popular way to do data mining on Hive?
>>> Since the data in Hive is distributed over the cluster, is there any
>>> tool or solution could parallelize the data mining?
>>> Any suggestion would be appreciated.
>>> YANG, Lin
> YANG, Lin