Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: Number of records in an HDFS file


Copy link to this message
-
Re: Number of records in an HDFS file
I am just spitballing here.

You might want to override the FileOutputFormatter's commit job method ,
which while committing the job , writes the value of the job output record
counter (I think there is a standard counter to give the number of records
outputted by the job) to a file in HDFS.

Not sure if we can plug a custom FOC to a pig workflow.

Another thing is , you can create a workflow statement in pig (in the same
pig script that we are taking about) to get the count of the final bag and
then store it in a file. Can you not ?

Thanks,
Rahul
On Mon, May 13, 2013 at 11:46 PM, Mix Nin <[EMAIL PROTECTED]> wrote:

> Ok, let re modify my requirement. I should have specified in the beginning
> itself.
>
> I need to get count of records in an HDFS file created by a PIG script and
> the store the count in a text file. This should be done automatically on a
> daily basis without manual intervention
>
>
> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
> [EMAIL PROTECTED]> wrote:
>
>> How about the second approach , get the application/job id which the pig
>> creates and submits to cluster and then find the job output counter for
>> that job from the JT.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <[EMAIL PROTECTED]> wrote:
>>
>>> It is a text file.
>>>
>>> If we want to use wc, we need to copy file from HDFS and then use wc,
>>> and this may take time. Is there a way without copying file from HDFS to
>>> local directory?
>>>
>>> Thanks
>>>
>>>
>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>> few pointers.
>>>>
>>>> what kind of files are we talking about. for text you can use wc , for
>>>> avro data files you can use avro-tools.
>>>>
>>>> or get the job that pig is generating , get the counters for that job
>>>> from the jt of your hadoop cluster.
>>>>
>>>> Thanks,
>>>>  Rahul
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> What is the bets way to get the count of records in an HDFS file
>>>>> generated by a PIG script.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>
>>>
>>
>