Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Selecting fields from records with varying spaces?


Copy link to this message
-
RE: Selecting fields from records with varying spaces?
Santhosh Srinivasan 2009-06-11, 18:09
Hi Marco,

1. I opened a JIRA that addresses the request for multi-byte delimiters
in PigStorage (https://issues.apache.org/jira/browse/PIG-842). Other
users have made a similar request.

2. TOKENIZE produces a bag of tuples; each tuple contains a string. Bags
contain unordered tuples. The language does not support accessing tuples
by position. As a result, you will not be able to do things like:

frm2date = foreach frm2words generate $0.$3, $0.$4, $0.$6;

In order to access the elements of the tuple, you should flatten the
bag. However, this does not suit your use case.

3. I am glad to see that streaming helped you solve the problem.

Was it the performance of streaming that left you unsatisfied?
Or
Was it the fact that you had to use streaming and go out of the
language?

We would like to hear your feedback.

Thanks,
Santhosh

-----Original Message-----
From: Marco Nicosia [mailto:[EMAIL PROTECTED]]
Sent: Thursday, June 11, 2009 2:09 AM
To: [EMAIL PROTECTED]
Subject: Selecting fields from records with varying spaces?

I believe that my question/problem primarily extends from my inability
to access fields within a bag_of_tokenTuples. Here's an example:

> grunt> cc = load 'cloud-computing' using TextLoader() as
line:chararray;
> grunt> frm = filter cc by ($0 matches '^From .*');
> grunt> frm2 = limit frm 2;
> grunt> frm2words = foreach frm2 generate TOKENIZE(line);
> grunt> dump frm2words
>
({(From),(grbounce-npTeJAUAAACwIMcQBPj4db4Q5Z5lpOBJ=marco=escape.org@goo
glegroups.com),(Thu),(Apr),(23),(10:28:54),(2009)})
>
({(From),(grbounce-npTeJAUAAACwIMcQBPj4db4Q5Z5lpOBJ=marco=escape.org@goo
glegroups.com),(Thu),(Apr),(23),(10:29:54),(2009)})
> grunt> frm2date = foreach frm2words generate $0.$3, $0.$4, $0.$6;
> 2009-06-11 08:46:55,977 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 1000: Error during parsing. Out of bound access. Trying to
access non-existent column: 3. Schema {tuple_of_tokens: (token:
chararray)} has 1 column(s).
> Details at logfile: /home/marco/pig_1244709828265.log

I don't know quite why there's only one column. I guess that column
could be the bag of tokens, but I've tried dereferencing into that,
and gotten nowhere. I must be missing something fundamental?

Eventually I gave up fussing with bags of tokens which contain
tuples, and turned to PigStorage (way less efficient to split all
records before filtering!), which yielded a totally different problem.

Is it possible to get PigStorage to use anything other than a single
character as a field separator? Using PigStorage(' '), the two
strings, "Jan 23" and "Jan  9" are interpreted as two and three
fields respectively.

Here's a proof:
> grunt> cc = load 'cloud-computing' using PigStorage(' ');
> grunt> frm = filter cc by ($0 == 'From');
> grunt> flds = group frm by (ARITY(*));
> grunt> frmarity = foreach flds generate $0, COUNT($1);
> grunt> dump frmarity
> (8,531L)
> (9,314L)

Each line really is the same number of fields, it's just that some
have an extra space, which is messing PigStorage up.

If you've read this far, you might as well see how I finally "solved"
this, and unsatisfied, decided to write this e-mail:

> grunt> cc = load 'cloud-computing' using TextLoader() as
line:chararray;
> grunt> frm = filter cc by ($0 matches '^From .*');
> grunt> frm2 = limit frm 2;
> grunt> frmdates = stream frm2 through `awk '{print $4,$5,$7}'`;
> grunt> dump frmdates
> (Apr 23 2009)
> (Apr 23 2009)

Terrible!

_______________________________________________________________________
Marco E. Nicosia  |  http://www.escape.org/~marco/  |  [EMAIL PROTECTED]