Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Combine multiple row values based upon a condition.


+
Martijn van Leeuwen 2013-02-02, 19:21
+
Dean Wampler 2013-02-03, 14:07
+
John Omernik 2013-02-03, 12:05
+
Martijn van Leeuwen 2013-02-03, 18:59
+
John Omernik 2013-02-03, 19:07
+
Martijn van Leeuwen 2013-02-03, 19:27
+
Edward Capriolo 2013-02-03, 19:36
Copy link to this message
-
Re: Combine multiple row values based upon a condition.
Yes, I agree with this. If you did a hive transform to say a python script
that collected your offsets per doc id and used "distributed by" to ensure
that the script you sent the data too had all the data to work with , you
could then do the logic to join what you need to join together and, emit
the resultant set.

On Sun, Feb 3, 2013 at 1:36 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote:

> You may want to look at sort by, distribute by, and cluster by. This
> syntax controls which Reducers the data end up on and how it is sorted
> on each reducer.
>
> On Sun, Feb 3, 2013 at 2:27 PM, Martijn van Leeuwen
> <[EMAIL PROTECTED]> wrote:
> > yes there is. Each document has a UUID as its identifier. The actual
> output
> > of my map reduce job that produces the list of person names looks like
> this
> >
> > docId        Name Type length offset
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Lea     PERSON     3     10858
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Lea     PERSON     3     11063
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Ken     PERSON     3     11186
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Marottoli     PERSON     9
> > 11234
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Berkowitz     PERSON     9
> > 17073
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Lea     PERSON     3     17095
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9
> > 17330
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Putt     PERSON     4     17340
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9
> > 17347
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9
> > 17480
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Putt     PERSON     4     17490
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Berkowitz     PERSON     9
> > 19498
> > f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4     Stephanie     PERSON     9
> > 19530
> >
> > Use the following code to produce a table inside Hive.
> >
> > DROP TABLE IF EXISTS entities_extract;
> >
> >     CREATE TABLE entities_extract (doc_id STRING, name STRING, type
> STRING,
> > len INT, offset BIGINT)
> >     ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> >     LINES TERMINATED BY '\n'
> >     STORED AS TEXTFILE
> >     LOCATION '/research/45924/hive/entities_extract';
> >
> > LOAD DATA LOCAL INPATH
> > '/home/researcher/hadoop-runnables/files/entitie_extract_by_doc.txt'
> > OVERWRITE INTO TABLE entities_extract;
> >
> >
> >
> > On Feb 3, 2013, at 8:07 PM, John Omernik <[EMAIL PROTECTED]> wrote:
> >
> > Is there some think akin to a document I'd so we can assure all rows
> > belonging to the same document can be sent to one mapper?
> >
> > On Feb 3, 2013 1:00 PM, "Martijn van Leeuwen" <[EMAIL PROTECTED]>
> wrote:
> >>
> >> Hi John,
> >>
> >> Here is some background about my data and what I want as output.
> >>
> >> I have a 215K documents containing text. From those text files I extract
> >> names of persons, organisations and locations by using the Stanford NER
> >> library. (see http://nlp.stanford.edu/software/CRF-NER.shtml)
> >>
> >> Looking at the following line:
> >>
> >> Jan Janssen was on this way to Klaas to sell vehicle Jan Janssen stole
> >> from his father.
> >>
> >> when the classifier is done annotating the line looks like this:
> >>
> >> <PERSON>Jan<PERSON><OFFSET>0<OFFSET>
> >> <PERSON>Janssen<PERSON><OFFSET>5<OFFSET> was on this way to
> >> <PERSON>Klaas<PERSON><OFFSET>26<OFFSET> to sell the vehicle
> >> <PERSON>Jan<PERSON><OFFSET>48<OFFSET>
> >> <PERSON>Janssen<PERSON><OFFSET>50<OFFSET> stole from his father.
> >>
> >> When looping through this annotated line you can save the persons and
> its
> >> offsets, please note that offset is a LONG value, inside a Map for
> example:
> >>
> >> MAP<STRING, LONG> entities
> >>
> >> Jan, 0
> >> Janssen, 5
> >> Klaas, 26
> >> Jan, 48
> >> Janssen, 50
> >>
> >> Jan Janssen in the line is actually the one person and not two. Jan
> occurs
> >> at offset 0, to determine if Janssen belongs to Jan I could subtract the
+
Martijn van Leeuwen 2013-02-04, 07:47
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB