Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - How to apply data mining on Hive?


Copy link to this message
-
Re: How to apply data mining on Hive?
Sukhendu Chakraborty 2012-06-09, 00:50
If you are interested, you can also look at Apache hama which provides an
MPI like interface on top of hadoop map-reduce.

http://incubator.apache.org/hama/
On Jun 8, 2012 4:55 PM, "Mark Grover" <[EMAIL PROTECTED]> wrote:

> Hi Jason,
> Hive does expose a JDBC interface which can by tools and applications. You
> would check out individual tools to see if they support Hadoop (I use the
> word Hadoop and not Hive since an application doesn't need Hive to run Map
> Reduce jobs on data in HDFS).
>
> Apache Mahout, as Sreenath, mentioned is also an interesting open source
> project which combines canonical machine learning algorithms with the power
> of Hadoop. That might fit your bill too.
>
> Good luck,
> Mark
>
> On Fri, Jun 8, 2012 at 1:25 AM, jason Yang <[EMAIL PROTECTED]>wrote:
>
>> Hi, Mark.
>>
>> Thank you for your reply.
>>
>> I have read the User Guide, but I'm still wondering what can I do for the
>> following scenario:
>> ----
>> 1. Suppose I have  a table t_customer_info in Hive, which include lots
>> of information about our customers.
>> 2. Now I would like to cluster those customers into different groups so
>> that customers within a group have high similarity, but are very dissimilar
>> to customers in other groups.
>> 3. This is a classical clustering problem in Data Mining field, I thought
>> such job can not be done by query language, instead of some data mining
>> algorithms.
>> ----
>>
>> When we look "back" to the traditional DBMS, there're lots of data mining
>> tools or BI tools which could connect to the DBMS, and apply some canonical
>> algorithms to the data in the DBMS. So I start to wonder is there similar
>> tools over Hive?
>>
>> If not, what's the most used way to do data mining over Hadoop?
>>
>> 2012/6/8 Mark Grover <[EMAIL PROTECTED]>
>>
>>> Hi Jason,
>>> Hive is a data warehouse system that sits on top of Hadoop. The key
>>> selling point here is that it allows users to write SQL-like queries to
>>> query their large scale data. These queries get compiled into Map Reduce
>>> which is then run on the Hadoop cluster just like any other Map Reduce jobs.
>>>
>>> Hadoop does all the parallel processing for you. All you have to do is
>>> set up a Hadoop cluster, install Hive on the cluster and run your Hive
>>> queries. All underlying processing will happen in parallel where possible.
>>>
>>> This is a good place to get started and learn more about Hive:
>>> https://cwiki.apache.org/confluence/display/Hive/GettingStarted
>>>
>>> Welcome and good luck!
>>>
>>> Mark
>>>
>>>
>>> On Thu, Jun 7, 2012 at 10:10 PM, jason Yang <[EMAIL PROTECTED]>wrote:
>>>
>>>> Hi, dear friends.
>>>>
>>>> I was wondering what's the popular way to do data mining on Hive?
>>>>
>>>> Since the data in Hive is distributed over the cluster, is there any
>>>> tool or solution could parallelize the data mining?
>>>>
>>>> Any suggestion would be appreciated.
>>>>
>>>> --
>>>> YANG, Lin
>>>>
>>>>
>>>
>>
>>
>> --
>> YANG, Lin
>>
>>
>