Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Reading json format input

Copy link to this message
Re: Reading json format input
  For some reason, this have to be in java :(
I am trying to use org.json library, something like (in mapper)
JSONObject jsn = new JSONObject(value.toString());

String text = (String) jsn.get("text");
StringTokenizer itr = new StringTokenizer(text);

But its not working :(
It would be better to get this thing properly but I wouldnt mind using a
hack as well :)
On Wed, May 29, 2013 at 4:30 PM, Michael Segel <[EMAIL PROTECTED]>wrote:

> Yeah,
> I have to agree w Russell. Pig is definitely the way to go on this.
> If you want to do it as a Java program you will have to do some work on
> the input string but it too should be trivial.
> How formal do you want to go?
> Do you want to strip it down or just find the quote after the text part?
> On May 29, 2013, at 5:13 PM, Russell Jurney <[EMAIL PROTECTED]>
> wrote:
> Seriously consider Pig (free answer, 4 LOC):
> my_data = LOAD 'my_data.json' USING
> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
> words = FOREACH my_data GENERATE $0#'author' as author,
> FLATTEN(TOKENIZE($0#'text')) as word;
> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
> COUNT_STAR(words) AS word_count;
> STORE word_counts INTO '/tmp/word_counts.txt';
> It will be faster than the Java you'll likely write.
> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <[EMAIL PROTECTED]>wrote:
>> Hi,
>>    I am stuck again. :(
>> My input data is in hdfs. I am again trying to do wordcount but there is
>> slight difference.
>> The data is in json format.
>> So each line of data is:
>> {"author":"foo", "text": "hello"}
>> {"author":"foo123", "text": "hello world"}
>> {"author":"foo234", "text": "hello this world"}
>> So I want to do wordcount for text part.
>> I understand that in mapper, I just have to pass this data as json and
>> extract "text" and rest of the code is just the same but I am trying to
>> switch from python to java hadoop.
>> How do I do this.
>> Thanks
> --
> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.
> com