Leo 2013-08-16, 02:17
Probably these can help:
http://pig.apache.org/docs/r0.11.1/func.html#pigstorage (look for
I've never tried this, but probably you could group by tagsource and then
On Fri, Aug 16, 2013 at 6:17 AM, Leo <[EMAIL PROTECTED]> wrote:
> Hi, I want to add a row/line number to the data I read from multiple CSVs.
> However I want the running number reflect the line number *per input file*,
> not overall.
> I am happy to write a Python UDF for this. So far I have in the UDF:
> --- Python file udf.py ---
> lineNum = 0
> @outputSchema("lnum:int, f1:chararray")
> def makeData(line):
> global lineNum
> lineNum += 1
> return lineNum, line.tostring()
> which is called from Pig:
> --- Pig file use-udf.pig ---
> register 'udf.py' using jython as udfs;
> data = load 'datadir' using TextLoader() as line;
> udfified = foreach data generate udfs.makeData(line);
> dump udfified;
> This approach works, *but* the running number increases over multiple
> files in the directory "datadir". That is *not* what I want! I need the row
> number starting with 1 for each file in datadir. Maybe I can reset the
> lineNum variable per input file?
> Any idea how to achieve this? Either with plain Pig or with Python UDFs?
> Many thanks, Leo
Pradeep Gollakota 2013-08-21, 15:32