Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> CROSS/Self-Join Bug - Please Help :(


Copy link to this message
-
Re: CROSS/Self-Join Bug - Please Help :(
If you store immediately after the CROSS, it works. If you do another
FOREACH/GENERATE, etc. it does not.
On Wed, Dec 4, 2013 at 1:41 PM, Pradeep Gollakota <[EMAIL PROTECTED]>wrote:

> I tried to following script (not exactly the same) and it worked correctly
> for me.
>
> businesses = LOAD 'dataset' using PigStorage(',') AS (a, b, c,
> business_id: chararray, lat: double, lng: double);
> locations = FOREACH businesses GENERATE business_id, lat, lng;
> STORE locations INTO 'locations.tsv';
> locations2 = LOAD 'locations.tsv' AS (business_id, lat, long);
> loc_com = CROSS locations2, locations;
> dump loc_com;
>
> I’m wondering your problem has something to do with the way that the
> JsonStorage works. Another thing you can try is to load ‘locations.tsv’
> twice and do a self-cross on that.
>
>
> On Wed, Dec 4, 2013 at 1:21 PM, Russell Jurney <[EMAIL PROTECTED]
> >wrote:
>
> > I have this bug that is killing me, where I can't self-join/cross a
> dataset
> > with itself. Its blocking my work :(
> >
> > The script is like this:
> >
> > businesses = LOAD
> > 'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using
> > com.twitter.elephantbird.pig.load.JsonLoader() as json:map[];
> >
> > /* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar,
> > business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E
> Camelback
> > Rd
> > Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty &
> > Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business,
> > city=Phoenix} */
> > locations = FOREACH businesses GENERATE $0#'business_id' AS business_id,
> >                                       $0#'longitude' AS longitude,
> >                                       $0#'latitude' AS latitude;
> > STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv';
> > locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS
> > (business_id:chararray, longitude:double, latitude:double);
> > location_comparisons = CROSS locations_2, locations;
> >
> > distances = FOREACH businesses GENERATE locations.business_id AS
> > business_id_1,
> >                                         locations_2.business_id AS
> > business_id_2,
> >
> udfs.haversine(locations.longitude,
> >
>  locations.latitude,
> >
> >  locations_2.longitude,
> >
> >  locations_2.latitude) AS distance;
> > STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv';
> >
> >
> > I have also tried converting this to a self-join using JOIN BY '1', and
> > also locations_2 = locations, and I get the same error:
> >
> > *org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar
> has
> > more than one row in the output. 1st :
> > (rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd
> > :(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)*
> >
> > at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111)
> >
> > at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336)
> >
> > at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438)
> >
> > at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)
> >
> > at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
> >
> > at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298)
> >
> > at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
> >
> > at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
> >
> > at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)

Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB