Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> CROSS/Self-Join Bug - Please Help :(


Copy link to this message
-
Re: CROSS/Self-Join Bug - Please Help :(
I tried to following script (not exactly the same) and it worked correctly
for me.

businesses = LOAD 'dataset' using PigStorage(',') AS (a, b, c,
business_id: chararray, lat: double, lng: double);
locations = FOREACH businesses GENERATE business_id, lat, lng;
STORE locations INTO 'locations.tsv';
locations2 = LOAD 'locations.tsv' AS (business_id, lat, long);
loc_com = CROSS locations2, locations;
dump loc_com;

I’m wondering your problem has something to do with the way that the
JsonStorage works. Another thing you can try is to load ‘locations.tsv’
twice and do a self-cross on that.
On Wed, Dec 4, 2013 at 1:21 PM, Russell Jurney <[EMAIL PROTECTED]>wrote:

> I have this bug that is killing me, where I can't self-join/cross a dataset
> with itself. Its blocking my work :(
>
> The script is like this:
>
> businesses = LOAD
> 'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using
> com.twitter.elephantbird.pig.load.JsonLoader() as json:map[];
>
> /* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar,
> business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback
> Rd
> Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty &
> Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business,
> city=Phoenix} */
> locations = FOREACH businesses GENERATE $0#'business_id' AS business_id,
>                                       $0#'longitude' AS longitude,
>                                       $0#'latitude' AS latitude;
> STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv';
> locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS
> (business_id:chararray, longitude:double, latitude:double);
> location_comparisons = CROSS locations_2, locations;
>
> distances = FOREACH businesses GENERATE locations.business_id AS
> business_id_1,
>                                         locations_2.business_id AS
> business_id_2,
>                                         udfs.haversine(locations.longitude,
>                                                        locations.latitude,
>
>  locations_2.longitude,
>
>  locations_2.latitude) AS distance;
> STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv';
>
>
> I have also tried converting this to a self-join using JOIN BY '1', and
> also locations_2 = locations, and I get the same error:
>
> *org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has
> more than one row in the output. 1st :
> (rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd
> :(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)*
>
> at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
>
> at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
>
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>
> This makes no sense! What am I to do? I can't self-join :(
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB