Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Reading json format input


Copy link to this message
-
Re: Reading json format input
for that, you have to only write intermediate data if word = "text"

String[] words = line.split("\\W+");

 for (String word : words) {

    if (word.equals("text"))

          context.write(new Text(word), new IntWritable(1));

 }
I  am assuming you have huge volume of data for it, otherwise MapReduce
will be an overkill and simple regex will do.

On Wed, May 29, 2013 at 4:45 PM, jamal sasha <[EMAIL PROTECTED]> wrote:

> Hi Rishi,
>    But I dont want the wordcount of all the words..
> In json, there is a field "text".. and those are the words I wish to count?
>
>
> On Wed, May 29, 2013 at 4:43 PM, Rishi Yadav <[EMAIL PROTECTED]>wrote:
>
>> Hi Jamal,
>>
>> I took your input and put it in sample wordcount program and it's working
>> just fine and giving this output.
>>
>> author 3
>> foo234 1
>> text 3
>> foo 1
>> foo123 1
>> hello 3
>> this 1
>> world 2
>>
>>
>> When we split using
>>
>> String[] words = input.split("\\W+");
>>
>> it takes care of all non-alphanumeric characters.
>>
>> Thanks and Regards,
>>
>> Rishi Yadav
>>
>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <[EMAIL PROTECTED]>wrote:
>>
>>> Hi,
>>>    I am stuck again. :(
>>> My input data is in hdfs. I am again trying to do wordcount but there is
>>> slight difference.
>>> The data is in json format.
>>> So each line of data is:
>>>
>>> {"author":"foo", "text": "hello"}
>>> {"author":"foo123", "text": "hello world"}
>>> {"author":"foo234", "text": "hello this world"}
>>>
>>> So I want to do wordcount for text part.
>>> I understand that in mapper, I just have to pass this data as json and
>>> extract "text" and rest of the code is just the same but I am trying to
>>> switch from python to java hadoop.
>>> How do I do this.
>>> Thanks
>>>
>>
>>
>