yes there is. Each document has a UUID as its identifier. The actual output of my map reduce job that produces the list of person names looks like this
docId Name Type length offset
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Lea PERSON 3 10858
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Lea PERSON 3 11063
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Ken PERSON 3 11186
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Marottoli PERSON 9 11234
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Berkowitz PERSON 9 17073
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Lea PERSON 3 17095
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Stephanie PERSON 9 17330
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Putt PERSON 4 17340
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Stephanie PERSON 9 17347
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Stephanie PERSON 9 17480
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Putt PERSON 4 17490
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Berkowitz PERSON 9 19498
f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4 Stephanie PERSON 9 19530
Use the following code to produce a table inside Hive.
DROP TABLE IF EXISTS entities_extract;
CREATE TABLE entities_extract (doc_id STRING, name STRING, type STRING, len INT, offset BIGINT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOAD DATA LOCAL INPATH '/home/researcher/hadoop-runnables/files/entitie_extract_by_doc.txt' OVERWRITE INTO TABLE entities_extract;
On Feb 3, 2013, at 8:07 PM, John Omernik <[EMAIL PROTECTED]> wrote:
> Is there some think akin to a document I'd so we can assure all rows belonging to the same document can be sent to one mapper?
> On Feb 3, 2013 1:00 PM, "Martijn van Leeuwen" <[EMAIL PROTECTED]> wrote:
> Hi John,
> Here is some background about my data and what I want as output.
> I have a 215K documents containing text. From those text files I extract names of persons, organisations and locations by using the Stanford NER library. (see http://nlp.stanford.edu/software/CRF-NER.shtml)
> Looking at the following line:
> Jan Janssen was on this way to Klaas to sell vehicle Jan Janssen stole from his father.
> when the classifier is done annotating the line looks like this:
> <PERSON>Jan<PERSON><OFFSET>0<OFFSET> <PERSON>Janssen<PERSON><OFFSET>5<OFFSET> was on this way to <PERSON>Klaas<PERSON><OFFSET>26<OFFSET> to sell the vehicle <PERSON>Jan<PERSON><OFFSET>48<OFFSET> <PERSON>Janssen<PERSON><OFFSET>50<OFFSET> stole from his father.
> When looping through this annotated line you can save the persons and its offsets, please note that offset is a LONG value, inside a Map for example:
> MAP<STRING, LONG> entities
> Jan, 0
> Janssen, 5
> Klaas, 26
> Jan, 48
> Janssen, 50
> Jan Janssen in the line is actually the one person and not two. Jan occurs at offset 0, to determine if Janssen belongs to Jan I could subtract the length of Jan (3) + 1 (whitespace) from Janssen's offset (5) and if outcome isn't greater then 1 then combine the two person into one person.
> (offset Jansen) - (offset Jan + whitespace) not greater then 1
> If this is true then combine the two person and save this inside a new MAP<STRING, LONG> like
> Jan Janssen, [ 0 ].
> The next time we come across Jan Janssen inside the text then just save the offset. Which produces the following MAP<STRING, LONG>
> Jan Janssen, [0, 48]
> I hope this clarifies my question.
> If things are still unclear please don't hesitate to ask me to clarify my question further.
> Kind regards,
> On Feb 3, 2013, at 1:05 PM, John Omernik <[EMAIL PROTECTED]> wrote:
>> Well there are some methods that may work, but I'd have to understand your data and your constraints more. You want to be able to (As it sounds) sort by offset, and then look at the one row, and then the next row, to determine if the the two items should be joined. It "looks" like you are doing a string comparison between numbers ("100 "to "104" there is only one "position" out of three that is different (0 vs 4). Trouble is, look at id 3 and id 4. 150 to 160 is only one position different as well, are you looking for Klaas Jan? Also, is the ID fields filled from the first match? It seems like you have some very odd data here. I don't think you've provided enough information on the data for us to be able to help you.