-Re: How to apply data mining on Hive?
Sukhendu Chakraborty 2012-06-09, 00:50
If you are interested, you can also look at Apache hama which provides an
MPI like interface on top of hadoop map-reduce.
On Jun 8, 2012 4:55 PM, "Mark Grover" <[EMAIL PROTECTED]> wrote:
> Hi Jason,
> Hive does expose a JDBC interface which can by tools and applications. You
> would check out individual tools to see if they support Hadoop (I use the
> word Hadoop and not Hive since an application doesn't need Hive to run Map
> Reduce jobs on data in HDFS).
> Apache Mahout, as Sreenath, mentioned is also an interesting open source
> project which combines canonical machine learning algorithms with the power
> of Hadoop. That might fit your bill too.
> Good luck,
> On Fri, Jun 8, 2012 at 1:25 AM, jason Yang <[EMAIL PROTECTED]>wrote:
>> Hi, Mark.
>> Thank you for your reply.
>> I have read the User Guide, but I'm still wondering what can I do for the
>> following scenario:
>> 1. Suppose I have a table t_customer_info in Hive, which include lots
>> of information about our customers.
>> 2. Now I would like to cluster those customers into different groups so
>> that customers within a group have high similarity, but are very dissimilar
>> to customers in other groups.
>> 3. This is a classical clustering problem in Data Mining field, I thought
>> such job can not be done by query language, instead of some data mining
>> When we look "back" to the traditional DBMS, there're lots of data mining
>> tools or BI tools which could connect to the DBMS, and apply some canonical
>> algorithms to the data in the DBMS. So I start to wonder is there similar
>> tools over Hive?
>> If not, what's the most used way to do data mining over Hadoop?
>> 2012/6/8 Mark Grover <[EMAIL PROTECTED]>
>>> Hi Jason,
>>> Hive is a data warehouse system that sits on top of Hadoop. The key
>>> selling point here is that it allows users to write SQL-like queries to
>>> query their large scale data. These queries get compiled into Map Reduce
>>> which is then run on the Hadoop cluster just like any other Map Reduce jobs.
>>> Hadoop does all the parallel processing for you. All you have to do is
>>> set up a Hadoop cluster, install Hive on the cluster and run your Hive
>>> queries. All underlying processing will happen in parallel where possible.
>>> This is a good place to get started and learn more about Hive:
>>> Welcome and good luck!
>>> On Thu, Jun 7, 2012 at 10:10 PM, jason Yang <[EMAIL PROTECTED]>wrote:
>>>> Hi, dear friends.
>>>> I was wondering what's the popular way to do data mining on Hive?
>>>> Since the data in Hive is distributed over the cluster, is there any
>>>> tool or solution could parallelize the data mining?
>>>> Any suggestion would be appreciated.
>>>> YANG, Lin
>> YANG, Lin