Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> FOREACH nested block aliases and output schema field names


Copy link to this message
-
FOREACH nested block aliases and output schema field names
Hey all,

I have the following FOREACH with nested block:

```
node_in = FOREACH (GROUP edge BY destination_id) {
  in_degree = COUNT(edge);
  in_edges = edge.(source_id, weight);
  in_edges_sorted = ORDER in_edges BY weight DESC;
  in_indices = SomeUDF(in_edges_sorted.source_id);
  in_weights = AnotherUDF(in_edges_sorted.weight);
  GENERATE group AS id, in_degree, in_indices, in_weights;
}
```

It seems as though Pig doesn't reuse the aliases within the block
(e.g. "in_degree") when reporting the resulting schema for node_in.
This means fields $1.. have no names and must be referenced by
position in subsequent statements (d'oh).

One simple, though verbose, work-around is:

```
node_in = FOREACH (GROUP scored_edge_out_norm BY destination_id) {
  ...
  GENERATE group AS id, in_degree AS in_degree, in_indices AS
in_indices, in_weights AS in_weights;
}
```

So, what would it take to get Pig to reuse the nested block aliases as
field names in the output schema?

Cheers,
Andy
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB