Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: Number of records in an HDFS file


Copy link to this message
-
Re: Number of records in an HDFS file
Agree with Shahab.

Warm Regards,
Tariq
cloudfront.blogspot.com
On Tue, May 14, 2013 at 12:32 AM, Shahab Yunus <[EMAIL PROTECTED]>wrote:

> The count file will be a very small file, right? Once it is generated on
> HDFS, you can automate its downloading or movement anywhere you want. This
> should not take much time.
>
> Regards,
> Shahab
>
>
> On Mon, May 13, 2013 at 2:58 PM, Mix Nin <[EMAIL PROTECTED]> wrote:
>
>> Hi,
>>
>> The final count file should reside in local directory, but not in HDFS
>> directory. The above scripts will store text file in HDFS directory.
>> The count file would need to be sent to other team who do not work on
>> HDFS.
>>
>> Thanks
>>
>>
>>
>> On Mon, May 13, 2013 at 11:36 AM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:
>>
>>> If it is just counting the no. of records in a file then how about
>>> having a short 3 liner :
>>> LOGS= LOAD 'log';
>>> LOGS_GROUP= GROUP LOGS ALL;
>>> LOG_COUNT = FOREACH LOGS_GROUP GENERATE COUNT(LOGS);
>>>
>>> It did the trick for me.
>>>
>>> Warm Regards,
>>> Tariq
>>> cloudfront.blogspot.com
>>>
>>>
>>> On Mon, May 13, 2013 at 11:57 PM, Shahab Yunus <[EMAIL PROTECTED]>wrote:
>>>
>>>> Not terribly efficient but at the top of my head: GROUP ALL and then do
>>>> a COUNT (or COUNT (*). You can implement a follow-up script or add this in
>>>> the existing script once the file has been generated.
>>>>
>>>> Regards,
>>>> Shahab
>>>>
>>>>
>>>> On Mon, May 13, 2013 at 2:16 PM, Mix Nin <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> Ok, let re modify my requirement. I should have specified in the
>>>>> beginning itself.
>>>>>
>>>>> I need to get count of records in an HDFS file created by a PIG script
>>>>> and the store the count in a text file. This should be done automatically
>>>>> on a daily basis without manual intervention
>>>>>
>>>>>
>>>>> On Mon, May 13, 2013 at 11:13 AM, Rahul Bhattacharjee <
>>>>> [EMAIL PROTECTED]> wrote:
>>>>>
>>>>>> How about the second approach , get the application/job id which the
>>>>>> pig creates and submits to cluster and then find the job output counter for
>>>>>> that job from the JT.
>>>>>>
>>>>>> Thanks,
>>>>>> Rahul
>>>>>>
>>>>>>
>>>>>> On Mon, May 13, 2013 at 11:37 PM, Mix Nin <[EMAIL PROTECTED]>wrote:
>>>>>>
>>>>>>> It is a text file.
>>>>>>>
>>>>>>> If we want to use wc, we need to copy file from HDFS and then use
>>>>>>> wc, and this may take time. Is there a way without copying file from HDFS
>>>>>>> to local directory?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 13, 2013 at 11:04 AM, Rahul Bhattacharjee <
>>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>>
>>>>>>>> few pointers.
>>>>>>>>
>>>>>>>> what kind of files are we talking about. for text you can use wc ,
>>>>>>>> for avro data files you can use avro-tools.
>>>>>>>>
>>>>>>>> or get the job that pig is generating , get the counters for that
>>>>>>>> job from the jt of your hadoop cluster.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>  Rahul
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, May 13, 2013 at 11:21 PM, Mix Nin <[EMAIL PROTECTED]>wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> What is the bets way to get the count of records in an HDFS file
>>>>>>>>> generated by a PIG script.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB