Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Pig + Cassandra Example !!


Copy link to this message
-
Re: Pig + Cassandra Example !!
Storing to Cassandra requires a key->column->value data structure from pig.
 Here is one possible approach, requiring a udf to handle the pig
formatting interchange to cassandra:

-- sample pig script
A = LOAD 'foo' USING PigStorage() AS (key:chararray, name:chararray,
value:chararray);
B = FOREACH A GENERATE
FLATTEN(com.to.udf.ToCassandraUDF(TOTUPLE('PotentialKeyManipulation/',$0,'/toSomethingElse'),
TOTUPLE($1), $2));
STORE B INTO 'cassandra://cassandraNamespace/myColumnFamilyName' USING
org.apache.cassandra.hadoop.pig.CassandraStorage();

-- sample toCassandraUDF
package com.to.udf.ToCassandraUDF;

public class ToCassandraUDF extends EvalFunc<Tuple> {

public Tuple exec(Tuple input) throws IOException {
Tuple row = TupleFactory.getInstance().newTuple(2);
StringBuffer buf = null;
 Tuple keyParts = (Tuple) input.get(0);
buf = new StringBuffer();
for (Object o : keyParts.getAll()) {
if (o != null) {
buf.append(o.toString());
} else {
buf.append("null");
}
}
String key = buf.toString();
 Tuple columnParts = (Tuple) input.get(1);
buf = new StringBuffer();
for (Object o : columnParts.getAll()) {
if (o != null) {
buf.append(o.toString());
} else {
buf.append("null");
}
}
String columnName = buf.toString();

byte[] columnValueBytes = null;
if (input.size() > 2) {
Object value = input.get(2);
columnValueBytes = value.toString().getBytes();
} else {
columnValueBytes = new byte[0];
}

Tuple column = TupleFactory.getInstance().newTuple(2);
column.set(0, new DataByteArray(columnName.getBytes()));
column.set(1, new DataByteArray(columnValueBytes));

DataBag bagOfColumns = BagFactory.getInstance().newDefaultBag();
bagOfColumns.add(column);

row.set(0, key);
row.set(1, bagOfColumns);

return row;
}

public Schema outputSchema(Schema input) {
try {
Schema.FieldSchema keyField = new Schema.FieldSchema("key",
DataType.CHARARRAY);
Schema.FieldSchema nameField = new Schema.FieldSchema("name",
DataType.CHARARRAY);
Schema.FieldSchema valueField = new Schema.FieldSchema("value",
DataType.BYTEARRAY);

Schema bagSchema = new Schema();
bagSchema.add(nameField);
bagSchema.add(valueField);

Schema.FieldSchema columnsField = new Schema.FieldSchema("columns",
bagSchema, DataType.BAG);

Schema innerSchema = new Schema();
innerSchema.add(keyField);
innerSchema.add(columnsField);

Schema.FieldSchema cassandraTuple = new Schema.FieldSchema(
"cassandra_tuple", innerSchema, DataType.TUPLE);

Schema schema = new Schema(cassandraTuple);
schema.setTwoLevelAccessRequired(true);
return schema;
} catch (Exception e) {
return null;
}
}
}

Hope this helps.

-Dan

On Mon, Mar 18, 2013 at 11:15 AM, Mohammed Abdelkhalek <
[EMAIL PROTECTED]> wrote:

> Hi,
> i'm using hadoop 1.0.4, cassandra 1.2.2 and pig 0.11.0.
> Can any one help me with an example on how to use pig either for Storing to
> cassandra from *pig* using Cassandrastorage, or Loading rows from cassandra
> in order to use them with pig.
> Thanks.
>
> --
> Mohammed ABDELKHALEK
> Ingénieur d'état de l’École Mohammadia d'Ingénieurs
> Technologies et services de l'information
> Téléphone: +212 6 45 64 65 68
>
> Préservons l'environnement. N'imprimez ce courriel que si nécessaire.
> Please consider the environment before printing this e-mail.
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB