Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Combine multiple row values based upon a condition.


+
Martijn van Leeuwen 2013-02-02, 19:21
+
Dean Wampler 2013-02-03, 14:07
+
John Omernik 2013-02-03, 12:05
+
Martijn van Leeuwen 2013-02-03, 18:59
Copy link to this message
-
Re: Combine multiple row values based upon a condition.
Is there some think akin to a document I'd so we can assure all rows
belonging to the same document can be sent to one mapper?
On Feb 3, 2013 1:00 PM, "Martijn van Leeuwen" <[EMAIL PROTECTED]> wrote:

> Hi John,
>
> Here is some background about my data and what I want as output.
>
> I have a 215K documents containing text. From those text files I extract
> names of persons, organisations and locations by using the Stanford NER
> library. (see http://nlp.stanford.edu/software/CRF-NER.shtml)
>
> Looking at the following line:
>
> Jan Janssen was on this way to Klaas to sell vehicle Jan Janssen stole
> from his father.
>
> when the classifier is done annotating the line looks like this:
>
> <PERSON>Jan<PERSON><OFFSET>0<OFFSET>
> <PERSON>Janssen<PERSON><OFFSET>5<OFFSET> was on this way
> to <PERSON>Klaas<PERSON><OFFSET>26<OFFSET> to sell the vehicle
> <PERSON>Jan<PERSON><OFFSET>48<OFFSET>
> <PERSON>Janssen<PERSON><OFFSET>50<OFFSET> stole from his father.
>
> When looping through this annotated line you can save the persons and its
> offsets, please note that offset is a LONG value, inside a Map for example:
>
> MAP<STRING, LONG> entities
>
> Jan, 0
> Janssen, 5
> Klaas, 26
> Jan, 48
> Janssen, 50
>
> Jan Janssen in the line is actually the one person and not two. Jan occurs
> at offset 0, to determine if Janssen belongs to Jan I could subtract the
> length of Jan (3) + 1 (whitespace) from Janssen's offset (5) and if outcome
> isn't greater then 1 then combine the two person into one person.
>
> (offset Jansen) - (offset Jan + whitespace) not greater then 1
>
> If this is true then combine the two person and save this inside a new
> MAP<STRING, LONG[]> like
> Jan Janssen, [ 0 ].
>
> The next time we come across Jan Janssen inside the text then just save
> the offset. Which produces the following MAP<STRING, LONG[]>
>
> Jan Janssen, [0, 48]
>
> I hope this clarifies my question.
> If things are still unclear please don't hesitate to ask me to clarify my
> question further.
>
> Kind regards,
> Martijn
>
> On Feb 3, 2013, at 1:05 PM, John Omernik <[EMAIL PROTECTED]> wrote:
>
> Well there are some methods that may work, but I'd have to understand your
> data and your constraints more. You want to be able to (As it sounds) sort
> by offset, and then look at the one row, and then the next row, to
> determine if the the two items should be joined. It "looks" like you  are
> doing a string comparison between numbers ("100 "to "104" there is only one
> "position" out of three that is different (0 vs 4).  Trouble is, look at id
> 3 and id 4.  150 to 160 is only one position different as well, are you
> looking for Klaas Jan?  Also, is the ID fields filled from the first match?
> It seems like you have some very odd data here. I don't think you've
> provided enough information on the data for us to be able to help you.
>
>
>
> On Sat, Feb 2, 2013 at 1:21 PM, Martijn van Leeuwen <[EMAIL PROTECTED]>wrote:
>
>> Hi all,
>>
>> I new to Apache Hive and I am doing some test to see if it fits my needs,
>> one of the questions I have if it is possible to "peek" for the next row in
>> order to find out if the values should be combined. Let me explain by an
>> example.
>>
>> Let say my data looks like this
>>
>> Id name offset
>> 1 Jan 100
>> 2 Janssen 104
>> 3 Klaas 150
>> 4 Jan 160
>> 5 Janssen 164
>>
>> An my output to another table should be this
>>
>> Id fullname offsets
>> 1 Jan Janssen [ 100, 160 ]
>>
>> I would like to combine the name values from two rows where the offset of
>> the two rows are no more then 1 character apart.
>>
>> Is this type of data manipulation is possible and if it is could someone
>> point me to the right direction hopefully with some explaination?
>>
>> Kind regards
>> Martijn
>
>
>
>
+
Martijn van Leeuwen 2013-02-03, 19:27
+
Edward Capriolo 2013-02-03, 19:36
+
John Omernik 2013-02-03, 22:54
+
Martijn van Leeuwen 2013-02-04, 07:47
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB