Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> dev How can I add a row number per input file to the data


+
Leo 2013-08-16, 02:17
Copy link to this message
-
Re: dev How can I add a row number per input file to the data
Hi!

Probably these can help:
http://pig.apache.org/docs/r0.11.1/basic.html#rank
http://pig.apache.org/docs/r0.11.1/func.html#pigstorage (look for
-tagsource)

I've never tried this, but probably you could group by tagsource and then
apply RANK

Ruslan
On Fri, Aug 16, 2013 at 6:17 AM, Leo <[EMAIL PROTECTED]> wrote:

> Hi, I want to add a row/line number to the data I read from multiple CSVs.
> However I want the running number reflect the line number *per input file*,
> not overall.
>
> I am happy to write a Python UDF for this. So far I have in the UDF:
>
>     --- Python file udf.py ---
>     lineNum = 0
>
>     @outputSchema("lnum:int, f1:chararray")
>     def makeData(line):
>         global lineNum
>         lineNum += 1
>         return lineNum, line.tostring()
>
> which is called from Pig:
>
>     --- Pig file use-udf.pig ---
>     register 'udf.py' using jython as udfs;
>
>     data = load 'datadir' using TextLoader() as line;
>     udfified = foreach data generate udfs.makeData(line);
>
>     dump udfified;
>
> This approach works, *but* the running number increases over multiple
> files in the directory "datadir". That is *not* what I want! I need the row
> number starting with 1 for each file in datadir. Maybe I can reset the
> lineNum variable per input file?
>
> Any idea how to achieve this? Either with plain Pig or with Python UDFs?
>
> Many thanks, Leo
>
+
Pradeep Gollakota 2013-08-21, 15:32
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB