|
|
-
Displaying source log file names in pig logs
Guy Bayes 2010-10-21, 16:57
We have a job that processes several hundred files in a directory
We generally glob the directory in a single load statement
Sometimes the jobs chokes on a bad row in a single file
I could have sworn that pig printed the file name of the chunks it is processing in the task log but cannot see it
Does anyone know under what conditions file names are printed, or how to find the file that is causing the issues?
Thanks Guy >
+
Guy Bayes 2010-10-21, 16:57
-
Re: Displaying source log file names in pig logs
Romain Rigaux 2010-10-25, 16:02
Hi,
I don't think that filenames are directly available but I do something like this in order to get them (I did not try with Pig 0.7+ yet):
Create a new loader inheriting from PigStorage and get the "location" path of the data. Then either:
- print it if everything happens in the same task - append it in each records
Hope this helps,
Romain
On Thu, Oct 21, 2010 at 9:57 AM, Guy Bayes <[EMAIL PROTECTED]> wrote:
> We have a job that processes several hundred files in a directory > > We generally glob the directory in a single load statement > > Sometimes the jobs chokes on a bad row in a single file > > I could have sworn that pig printed the file name of the chunks it is > processing in the task log but cannot see it > > Does anyone know under what conditions file names are printed, or how to > find the file that is causing the issues? > > Thanks > Guy > > >
+
Romain Rigaux 2010-10-25, 16:02
-
Re: Displaying source log file names in pig logs
Guy Bayes 2010-10-25, 16:09
I'm pretty sure they are suppose to be on the Input split of the tasktracker logs aren't they?
For some reason all the Input-Slits are null
Input-split file: null Input-split start-offset: -1 Input-split length: -1
thanks Guy
On Mon, Oct 25, 2010 at 9:02 AM, Romain Rigaux <[EMAIL PROTECTED]>wrote:
> Hi,thanks > > > I don't think that filenames are directly available but I do something like > this in order to get them (I did not try with Pig 0.7+ yet): > > Create a new loader inheriting from PigStorage and get the "location" path > of the data. Then either: > > - print it if everything hasupposeppens in the same task > - append it in each records > > Hope this helps, > > Romain > > On Thu, Oct 21, 2010 at 9:57 AM, Guy Bayes <[EMAIL PROTECTED]> wrote: > > > We have a job that processes several hundred files in a directory > > > > We generally glob the directory in a single load statement > > > > Sometimes the jobs chokes on a bad row in a single file > > > > I could have sworn that pig printed the file name of the chunks it is > > processing in the task log but cannot see it > > > > Does anyone know under what conditions file names are printed, or how to > > find the file that is causing the issues? > > > > Thanks > > Guy > > > > > >
-- you may be acquainted with the night but i have seen the darkness in the day and you must know it is a terrifying sight...
+
Guy Bayes 2010-10-25, 16:09
|
|
All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by
Sematext