Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Rank(): here's how I did it, for better or worse...

Copy link to this message
Rank(): here's how I did it, for better or worse...
I agree, it's probably best to use a better engineered approach such as Edward's.  In the meantime, if anyone would benefit from a walk-through of my direct approach, here it is.  It combines Ritesh's direct ultra-simplistic method with Edward's correct HiveQL syntax.  As would be expected, it is sensitive to shell vagaries that would be better managed by a combo system like git and maven...but it works.

Create Rank.java:
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
public final class Rank extends UDF{
    private int counter;
    private String last_key;
    public int evaluate(final String key){
      if ( !key.equalsIgnoreCase(this.last_key) ) {
         this.counter = 0;
         this.last_key = key;
      return this.counter++;

Compile Rank.java to Rank.class.  Then bundle to Rank.jar.  Observe that the jar command is highly sensitive to the relative path to the .class file when naming the package inside the resulting .jar file:
$ mkdir ./RankTempDir
$ javac -classpath $HIVE_HOME/lib/hive-serde-0.8.1.jar:$HIVE_HOME/lib/hive-exec-0.8.1.jar:$HADOOP_HOME/hadoop-core.jar -d ./RankTempDir Rank.java
$ cd RankTempDir;
$ jar -cf ../Rank.jar ./com
$ cd ..

You will have to verify (via echo) HADOOP_HOME and HIVE_HOME and then will have to verify the name of the serde, exec, and core files in each directory.  The exact filenames are probably version specific.

Verify the package path in Rank.jar:
$ jar -tvf Rank.jar

You should see 'com/example/hive/udf/Rank.class'.  If you see a different path, the package has not been properly represented in the jar w.r.t. its designation in the .java file.

Run hive and prepare the session to use the UDF:

$ hive
hive> add jar Rank.jar;
hive> create temporary function rank as 'com.example.hive.udf.Rank';

You must either run hive from a directory containing Rank.jar or specify an alternate path to it in the "add" command.  Note that the Rank class's full package is specified in the "create" command and therefore must, logically, match the package in both the .java and the .jar files.

Consider a table named 'test' consisting of columns 'user', 'category', and 'value', containing the following data:

hive> select * from test;
user1 catA 1
user1 catB 11
user1 catC 111
user2 catA 222
user2 catB 22
user2 catC 2
user3 catA 3
user3 catB 5
user3 catC 4

So the top category for user1 is catC, for user2 is catA and for user3 is catB.  Say we want the top N valued categories for each user.  In the example below, N is 2 (it is indicated in the final WHERE clause).  Here is the format of the corresponding ranked query, and its result:

hive> SELECT user, category, value, ranked_col
    SELECT user, category, value, rank(user) ranked_col
        FROM (
            SELECT user, category, value
            FROM test
        DISTRIBUTE BY user
        SORT BY user, value desc
    ) a
) b
WHERE ranked_col < 2
ORDER BY user, ranked_col;
[wait for Hive query and MapReduce job(s) to finish]
user1 catC 111 0
user1 catB 11 1
user2 catA 222 0
user2 catB 22 1
user3 catB 5 0
user3 catC 4 1

Note that ranks are 0-indexed (of course I suppose that's a property of the specific .java we wrote above, and therefore is easily amenable to 1-indexing, which would more canonically connote the notion of a "rank").

Anyway, that's what I came up with.  I don't by any means claim it's the best approach.  Edward is surely right that the best method would be to use the powerful tools made available by the large developer community such as git and maven.


Keith Wiley     [EMAIL PROTECTED]     keithwiley.com    music.keithwiley.com

"And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!"
                                           --  Homer Simpson