Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Re: Number of records in an HDFS file


Copy link to this message
-
Re: Number of records in an HDFS file
Mohammad Tariq 2013-05-13, 19:07
Agree with Shahab.

Warm Regards,
Tariq
cloudfront.blogspot.com
On Tue, May 14, 2013 at 12:32 AM, Shahab Yunus <[EMAIL PROTECTED]>wrote:

> The count file will be a very small file, right? Once it is generated on
> HDFS, you can automate its downloading or movement anywhere you want. This
> should not take much time.
>
> Regards,
> Shahab
>
>
> On Mon, May 13, 2013 at 2:58 PM, Mix Nin <[EMAIL PROTECTED]> wrote:
>
>> Hi,
>>
>> The final count file should reside in local directory, but not in HDFS
>> directory. The above scripts will store text file in HDFS directory.
>> The count file would need to be sent to other team who do not work on
>> HDFS.
>>
>> Thanks
>>
>>
>>
>> On Mon, May 13, 2013 at 11:36 AM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:
>>
>>> If it is just counting the no. of records in a file then how about
>>> having a short 3 liner :
>>> LOGS= LOAD 'log';
>>> LOGS_GROUP= GROUP LOGS ALL;
>>> LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS);
>>>
>>> It did the trick for me.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Mon, May 13, 2013 at 11:57 PM, Shahab Yunus <[EMAIL PROTECTED]>wrote:
>>>
>>>> Not terribly efficient but at the top of my head: GROUP ALL and then do
>>>> a COUNT (or COUNT (*). You can implement a follow-up script or add this in
>>>> the existing script once the file has been generated.
>>>>
>>>> Regards,
>>>> Shahab
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 2:16 PM, Mix Nin <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> Ok, let re modify my requirement. I should have specified in the
>>>>> beginning itself.
>>>>>
>>>>> I need to get count of records in an HDFS file created by a PIG script
>>>>> and the store the count in a text file. This should be done automatically
>>>>> on a daily basis without manual intervention
>>>>>
>>>>>
>>>>> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
>>>>> [EMAIL PROTECTED]> wrote:
>>>>>
>>>>>> How about the second approach , get the application/job id which the
>>>>>> pig creates and submits to cluster and then find the job output counter for
>>>>>> that job from the JT.
>>>>>>
>>>>>> Thanks,
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <[EMAIL PROTECTED]>wrote:
>>>>>>
>>>>>>> It is a text file.
>>>>>>>
>>>>>>> If we want to use wc, we need to copy file from HDFS and then use
>>>>>>> wc, and this may take time. Is there a way without copying file from HDFS
>>>>>>> to local directory?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>>
>>>>>>>> few pointers.
>>>>>>>>
>>>>>>>> what kind of files are we talking about. for text you can use wc ,
>>>>>>>> for avro data files you can use avro-tools.
>>>>>>>>
>>>>>>>> or get the job that pig is generating , get the counters for that
>>>>>>>> job from the jt of your hadoop cluster.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>  Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <[EMAIL PROTECTED]>wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> What is the bets way to get the count of records in an HDFS file
>>>>>>>>> generated by a PIG script.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>