I have a problem and I hope someone has an idea on how to solve it.
My dataset consists of just very simple key-value pairs of strings
coming from PostgreSQL using Sqoop.
1) I need to count how often a key occurs -> Easy
2) I need to count how often a key-value pair occurs -> Easy
I need to output this data to PostgreSQL again, into two tables:
a) "keys" with the columns: id, key_name, count
b) "values" with the columns: id, key_id, value_name, count
Now the ids I'm referring to don't exist yet and I'm looking into
solutions to generate them. They have to be integers/longs but they
don't have to be in any order/pattern. I'm not concerned about
performance either as this query will be run monthly at most.
Do you have any idea how I could introduce this new column into the
output of query 1)? I could easily introduce it into 2) with a join
then. I thought about using a custom reducer script but apart from the
fact that I've never done it so far it would require that there is
only one reducer so that I can simulate an auto-incrementer. My
current best idea is to write a regular MR job that processes the Hive
output but I'd love to do everything in Hive if possible.
I might very well approach this problem completely wrong so don't
hesitate to propose a better solution or bash me for my poor
understanding of Hive :)
Thanks for any input and help.