Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS, mail # user - Re: Reading json format input


+
jamal sasha 2013-05-30, 18:43
+
Shahab Yunus 2013-05-30, 18:46
+
jamal sasha 2013-05-30, 20:57
+
jamal sasha 2013-05-29, 21:54
Copy link to this message
-
Re: Reading json format input
Russell Jurney 2013-05-29, 22:13
Seriously consider Pig (free answer, 4 LOC):

my_data = LOAD 'my_data.json' USING
com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
words = FOREACH my_data GENERATE $0#'author' as author,
FLATTEN(TOKENIZE($0#'text')) as word;
word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
COUNT_STAR(words) AS word_count;
STORE word_counts INTO '/tmp/word_counts.txt';

It will be faster than the Java you'll likely write.
On Wed, May 29, 2013 at 2:54 PM, jamal sasha <[EMAIL PROTECTED]> wrote:

> Hi,
>    I am stuck again. :(
> My input data is in hdfs. I am again trying to do wordcount but there is
> slight difference.
> The data is in json format.
> So each line of data is:
>
> {"author":"foo", "text": "hello"}
> {"author":"foo123", "text": "hello world"}
> {"author":"foo234", "text": "hello this world"}
>
> So I want to do wordcount for text part.
> I understand that in mapper, I just have to pass this data as json and
> extract "text" and rest of the code is just the same but I am trying to
> switch from python to java hadoop.
> How do I do this.
> Thanks
>

--
Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
+
Michael Segel 2013-05-29, 23:30
+
jamal sasha 2013-05-29, 23:44
+
Rahul Bhattacharjee 2013-05-30, 03:12
+
Rishi Yadav 2013-05-29, 23:43
+
jamal sasha 2013-05-29, 23:45
+
Rishi Yadav 2013-05-30, 00:15