Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - JOIN not printing properly


Copy link to this message
-
Re: JOIN not printing properly
Jacob Perkins 2011-11-04, 14:57
Have you taken a look at Pygmalion
(http://github.com/jeromatron/pygmalion) which makes it MUCH easier to
work with tabular data from Cassandra like you're trying to do?

For example:

what_cassandrastorage_should_really_produce = FOREACH rows GENERATE key
AS key, FromCassandraBag('url,cache_hit', columns) AS (url:chararray,
cache_hit:chararray);

DUMP what_cassandrastorage_should_really_produce;

(key1, http://www.google.com, hit)
(key2, http://www.google.com, hit)

Does that work for your use case?

--jacob
@thedatachef
On Fri, 2011-11-04 at 08:51 -0400, AD wrote:
> Hello,
>
>  I am pulling data from cassandra into pig which means it ends up like key,
> bag { (name,value),(name,value) }.  The info is logfiles so each column is
> a field in server logfile (like apache).  I have the following pig to
> combine 2 fields and count them but the GENERATE of the JOIN is not
> printing the right field.  Is there an easier way to solve this, and does
> anyone know why the join output is broken ?
>
> rows = LOAD 'cassandra://Keyspace1/Logs' USING CassandraStorage() AS (key,
> columns: bag {T: tuple(name, value)});
>
>  A = FOREACH rows GENERATE $0, flatten($1) ; //FLATTEN
> *(key1,url,http://www.google.com)*
> *(key1,cache_hit,hit)*
> *(key2,url,http://www.google.com)*
> *(key2,cache_hit,miss)*
>
>  B = group r2 by key ; // Combine url and cache_hit into one record
> *(key1,{(key1,url,http://www.google.com),(key1,cache_hit,hit)})*
> *(key2,{(key2,url,http://www.google.com),(key2,cache_hit,miss)})*
>
>  // Create 2 lists and then JOIN them
>
>  C = FOREACH B {
>  u = FILTER A by name == 'url';
>  GENERATE FLATTEN(u.(key,value)) ;
>  }
> * (key1,http://www.google.com)*
> * (key2,http://www.google.com)*
>
>  D = FOREACH B {
>  u2 = FILTER A by name == 'cache_hit';
>  GENERATE FLATTEN(u2.(key,value));
>  }
>  *(key1,hit)*
> * (key2,miss)*
>
>  E = join C by key, D by key ;
> *(key1,http://www.google.com,key1,hit)*
> *(key2,http://www.google.com,key2,miss)*
>
> describe E ;
> E: {C::u::key: chararray,C::u::value: chararray,D::u2::key:
> chararray,D::u2::value: chararray}
>
> F = FOREACH E GENERATE C::u::value, D::u2::value ;
>
> *dump F ;*
> *(http://www.google.com,http://www.google.com)  ?? Why not www.google.com,
> hit ????*
> *(http://www.google.com,http://www.google.com)*
> *
> *
> Any help appreciated.
> AD