Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> dev How can I add a row number per input file to the data


Copy link to this message
-
Re: dev How can I add a row number per input file to the data
That's an interesting approach! Although, I'm not sure if RANK is supported
as a nested foreach operator. If it is supported, then this approach would
work. The documentation doesn't show that RANK is a supported nested
foreach operator.

http://pig.apache.org/docs/r0.11.1/basic.html#foreach
On Wed, Aug 21, 2013 at 11:03 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]>wrote:

> Hi!
>
> Probably these can help:
> http://pig.apache.org/docs/r0.11.1/basic.html#rank
> http://pig.apache.org/docs/r0.11.1/func.html#pigstorage (look for
> -tagsource)
>
> I've never tried this, but probably you could group by tagsource and then
> apply RANK
>
> Ruslan
>
>
> On Fri, Aug 16, 2013 at 6:17 AM, Leo <[EMAIL PROTECTED]> wrote:
>
> > Hi, I want to add a row/line number to the data I read from multiple
> CSVs.
> > However I want the running number reflect the line number *per input
> file*,
> > not overall.
> >
> > I am happy to write a Python UDF for this. So far I have in the UDF:
> >
> >     --- Python file udf.py ---
> >     lineNum = 0
> >
> >     @outputSchema("lnum:int, f1:chararray")
> >     def makeData(line):
> >         global lineNum
> >         lineNum += 1
> >         return lineNum, line.tostring()
> >
> > which is called from Pig:
> >
> >     --- Pig file use-udf.pig ---
> >     register 'udf.py' using jython as udfs;
> >
> >     data = load 'datadir' using TextLoader() as line;
> >     udfified = foreach data generate udfs.makeData(line);
> >
> >     dump udfified;
> >
> > This approach works, *but* the running number increases over multiple
> > files in the directory "datadir". That is *not* what I want! I need the
> row
> > number starting with 1 for each file in datadir. Maybe I can reset the
> > lineNum variable per input file?
> >
> > Any idea how to achieve this? Either with plain Pig or with Python UDFs?
> >
> > Many thanks, Leo
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB