Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: Number of records in an HDFS file


Copy link to this message
-
Re: Number of records in an HDFS file
If it is just counting the no. of records in a file then how about having a
short 3 liner :
LOGS= LOAD 'log';
LOGS_GROUP= GROUP LOGS ALL;
LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS);

It did the trick for me.

Warm Regards,
Tariq
cloudfront.blogspot.com
On Mon, May 13, 2013 at 11:57 PM, Shahab Yunus <[EMAIL PROTECTED]>wrote:

> Not terribly efficient but at the top of my head: GROUP ALL and then do a
> COUNT (or COUNT (*). You can implement a follow-up script or add this in
> the existing script once the file has been generated.
>
> Regards,
> Shahab
>
>
> On Mon, May 13, 2013 at 2:16 PM, Mix Nin <[EMAIL PROTECTED]> wrote:
>
>> Ok, let re modify my requirement. I should have specified in the
>> beginning itself.
>>
>> I need to get count of records in an HDFS file created by a PIG script
>> and the store the count in a text file. This should be done automatically
>> on a daily basis without manual intervention
>>
>>
>> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
>> [EMAIL PROTECTED]> wrote:
>>
>>> How about the second approach , get the application/job id which the pig
>>> creates and submits to cluster and then find the job output counter for
>>> that job from the JT.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <[EMAIL PROTECTED]> wrote:
>>>
>>>> It is a text file.
>>>>
>>>> If we want to use wc, we need to copy file from HDFS and then use wc,
>>>> and this may take time. Is there a way without copying file from HDFS to
>>>> local directory?
>>>>
>>>> Thanks
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>>> [EMAIL PROTECTED]> wrote:
>>>>
>>>>> few pointers.
>>>>>
>>>>> what kind of files are we talking about. for text you can use wc , for
>>>>> avro data files you can use avro-tools.
>>>>>
>>>>> or get the job that pig is generating , get the counters for that job
>>>>> from the jt of your hadoop cluster.
>>>>>
>>>>> Thanks,
>>>>>  Rahul
>>>>>
>>>>>
>>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <[EMAIL PROTECTED]> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> What is the bets way to get the count of records in an HDFS file
>>>>>> generated by a PIG script.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>