Saurabh Bhatnagar 2013-09-30, 15:32
Saurabh B 2013-09-30, 18:03
Nitin Pawar 2013-09-30, 18:41
Saurabh B 2013-09-30, 18:54
Check out these presentations from Data Science Maryland back in May.
1. working with Tweets in Hive:
2. then pulling stuff out of Hive to use with Mahout:
The Mahout talk didn't have a directly useful outcome (largely because it
tried to work with the tweets as individual text documents), but it does
get through all the mechanics of exactly what you state you want.
The meetup page also has links to video, if the slides don't give enough
On Mon, Sep 30, 2013 at 11:54 AM, Saurabh B <[EMAIL PROTECTED]>wrote:
> Hi Nitin,
> No offense taken. Thank you for your response. Part of this is also trying
> to find the right tool for the job.
> I am doing queries to determine the cuts of tweets that I want, then doing
> some modest normalization (through a python script) and then I want to
> create sequenceFiles from that.
> So far Hive seems to be the most convenient way to do this. But I can take
> a look at PIG too. It looked like the "STORED AS SEQUENCEFILE" gets me 99%
> way there. So I was wondering if there was a way to get those ids in there
> as well. The last piece is always the stumbler :)
> Thanks again,
> On Mon, Sep 30, 2013 at 2:41 PM, Nitin Pawar <[EMAIL PROTECTED]>wrote:
>> are you using hive to just convert your text files to sequence files?
>> If thats the case then you may want to look at the purpose why hive was
>> If you want to modify data or process data which does not involve any
>> kind of analytics functions on a routine basis.
>> If you want to do a data manipulation or enrichment and do not want to
>> code a lot of map reduce job, you can take a look at pig scripts.
>> basically what you want to do is generate an UUID for each of your tweet
>> and then feed it to mahout algorithms.
>> Sorry if I understood it wrong or it sounds rude.
Saurabh B 2013-09-30, 19:55