Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Re: Reading json format input


+
Pramod N 2013-05-30, 09:02
Copy link to this message
-
Re: Reading json format input
You have the entire string.
If you tokenize on commas ...

Starting with :
>> {"author":"foo", "text": "hello"}
>> {"author":"foo123", "text": "hello world"}
>> {"author":"foo234", "text": "hello this world"}

You end up with
{"author":"foo",    and "text":"hello"}

So you can ignore the first token, then again split the token on the colon (':')

This gives you "text" and "hello"}

You can again ignore the first token and you now have "hello"}

And now you can parse out the stuff within the quotes.

HTH
On May 29, 2013, at 6:44 PM, jamal sasha <[EMAIL PROTECTED]> wrote:

> Hi,
>   For some reason, this have to be in java :(
> I am trying to use org.json library, something like (in mapper)
> JSONObject jsn = new JSONObject(value.toString());
>
> String text = (String) jsn.get("text");
> StringTokenizer itr = new StringTokenizer(text);
>
> But its not working :(
> It would be better to get this thing properly but I wouldnt mind using a hack as well :)
>
>
> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <[EMAIL PROTECTED]> wrote:
> Yeah,
> I have to agree w Russell. Pig is definitely the way to go on this.
>
> If you want to do it as a Java program you will have to do some work on the input string but it too should be trivial.
> How formal do you want to go?
> Do you want to strip it down or just find the quote after the text part?
>
>
> On May 29, 2013, at 5:13 PM, Russell Jurney <[EMAIL PROTECTED]> wrote:
>
>> Seriously consider Pig (free answer, 4 LOC):
>>
>> my_data = LOAD 'my_data.json' USING com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>> words = FOREACH my_data GENERATE $0#'author' as author, FLATTEN(TOKENIZE($0#'text')) as word;
>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word, COUNT_STAR(words) AS word_count;
>> STORE word_counts INTO '/tmp/word_counts.txt';
>>
>> It will be faster than the Java you'll likely write.
>>
>>
>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <[EMAIL PROTECTED]> wrote:
>> Hi,
>>    I am stuck again. :(
>> My input data is in hdfs. I am again trying to do wordcount but there is slight difference.
>> The data is in json format.
>> So each line of data is:
>>
>> {"author":"foo", "text": "hello"}
>> {"author":"foo123", "text": "hello world"}
>> {"author":"foo234", "text": "hello this world"}
>>
>> So I want to do wordcount for text part.
>> I understand that in mapper, I just have to pass this data as json and extract "text" and rest of the code is just the same but I am trying to switch from python to java hadoop.
>> How do I do this.
>> Thanks
>>
>>
>>
>> --
>> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
>
>

+
jamal sasha 2013-05-30, 20:35
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB