Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Renormalisation of Cassandra CQL maps


Copy link to this message
-
Renormalisation of Cassandra CQL maps
Hi all,

I've sent this mail to cassandra user's list with no luck, so maybe Pig
user list will be a better place.

I am solving a issue with Pig integration with Cassandra using CqlLoader. I
don't know exactly if the problem is in CqlLoader, my low understanding of
Pig (I hope this is actually the case) or some bug in the combination of
Pig and CqlLoader.

I have a table using cql maps:

CREATE TABLE test (
  name text PRIMARY KEY,
  sources map<text, text>
)

I need to denormalise the map in order to perform some sanitary checks on
the rest of the DB (outer join using values from the map with another
tables in Cassandra keyspace). I want to create triples containing table
key, map key and map value for further joining. The size of the map is
anything between null and tens of records. The table test itself is pretty
small.

This is what I do:

grunt> data = LOAD 'cql://keyspace/test' USING CqlStorage();
grunt> describe data;
data: {name: chararray,sources: ()}
grunt> data1 = filter data by sources is not null;
grunt> dump data1;
(name1,((k1,s1),(k2,s2)))
grunt> data2 = foreach data1 generate name, flatten(sources);
grunt> dump data2;
(name1,(k1,s1),(k2,s2))
grunt> describe data2;
Schema for data2 unknown.
grunt> data3 = FOREACH data2 generate $0 as name, FLATTEN(TOBAG($1..$100));
// I know there will be max tens of records in the map
grunt> dump data3;
(name1,k1,s1)
(name1,k2,s2)
(name1,)
(name1,)
... 95 more lines here ...
grunt> data4 = FILTER data3 BY $1 IS NOT null;
grunt> dump data4;
(name1,k1,s1)
(name1,k2,s2)
grunt> describe data4;
data4: {name: bytearray,bytearray}
grunt> data5 = foreach data4 generate $0, $1;
grunt> dump data5;
(name1,k1)
(name1,k2)
grunt> p = foreach data4 generate $0, $2;
Details at logfile: /..../pig_xxx.log
>From the log file:
Pig Stack Trace
---------------
ERROR 1000:
<line 28, column 33> Out of bound access. Trying to access non-existent
column: 2. Schema name:bytearray,:bytearray has 2 column(s).

org.apache.pig.impl.plan.PlanValidationException: ERROR 1000:
<line 28, column 33> Out of bound access. Trying to access non-existent
column: 2. Schema name:bytearray,:bytearray has 2 column(s).
at
org.apache.pig.newplan.logical.expression.ProjectExpression.findColNum(ProjectExpression.java:197)
at
org.apache.pig.newplan.logical.expression.ProjectExpression.setColumnNumberFromAlias(ProjectExpression.java:174)

Considering the schema - no surprise. What is strange is the fact I see the
map values in dump (see dump data4), but I have no way to get them using
pig latin.

I tried to simulate the situation using PigStorage loader. This is the best
I got (not exactly the same, but roughly):

grunt> data = load 'test.csv' using PigStorage(',');
grunt> dump data;
(key1,mk1,mv1,mk2,mv2)
(key2)
(key3,mk1,mv3,mk2,mv4)
grunt> data1 = foreach data generate $0, TOTUPLE($1, $2), TOTUPLE($3, $4);
grunt> dump data1;
(key1,(mk1,mv1),(mk2,mv2))
(key2,(,),(,))
(key3,(mk1,mv3),(mk2,mv4))
grunt> data2 = FOREACH data1 generate $0 as name, FLATTEN(TOBAG($1..$2));
grunt> dump data2;
(key1,mk1,mv1)
(key1,mk2,mv2)
(key2,,)
(key2,,)
(key3,mk1,mv3)
(key3,mk2,mv4)
grunt> describe data2;
data2: {name: bytearray,bytearray,bytearray}

Which is exactly what I need. The only problem is this simulation doesn't
allow me to specify the arbitrary high value in the FLATTEN(TOBAG()) call -
I need to know in advance what is the size of the row.

Questions:

- Is this the correct way to renormalise the data? (I am a pig newbie).
- Couln't there be a problem with internal data representation returned
from CqlStorage? See the difference between data loaded from file and these
loaded from Cassandra.

Versions: Cassandra 1.2.11, Pig 0.12.

Thanks in advance,

Ondrej Cernos