Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume >> mail # user >> sleep() in script doesn't work when called by exec Source


+
Wang, Yongkun | Yongkun |... 2013-08-19, 02:29
+
Brock Noland 2013-08-19, 13:08
+
Wang, Yongkun | Yongkun |... 2013-08-20, 08:43
+
Brock Noland 2013-08-20, 14:58
Copy link to this message
-
RE: sleep() in script doesn't work when called by exec Source
Yes, I am curious what you mean as well. When testing I had dropped a few 15GB files in the spoolDir and while they processed slowly they did complete. In fact, my only issue with that test was the last hop HDFS sinks couldn't keep up and I had to add a couple more to keep upstream channels from filling up.

Thanks,
Paul
From: Brock Noland [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, August 20, 2013 7:59 AM
To: [EMAIL PROTECTED]
Subject: Re: sleep() in script doesn't work when called by exec Source

Hi,

Can you share the details of this?  It shouldn't die with large files.

On Tue, Aug 20, 2013 at 3:43 AM, Wang, Yongkun | Yongkun | BDD <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Thanks Brock.

I tried spooling directory, if the file dropped in spoolDir was too large, flume also died. There should be a blocking.
Will start a standalone script process to drop small files.

Best Regards,
Yongkun Wang

On 2013/08/19, at 22:08, Brock Noland wrote:
In your case I would look at the spooling directory source.

On Sun, Aug 18, 2013 at 9:29 PM, Wang, Yongkun | Yongkun | BDD <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Hi,

I am testing with apache-flume-1.4.0-bin.
I made a naive python script for exec source to do throttling by calling sleep() function.
But the sleep() doesn't work when called by exec source.
Any ideas about this or do you have some simply solution for throttling instead of a custom source?

Flume config:
agent.sources = src1

agent.sources.src1.type = exec

agent.sources.src1.command = read-file-throttle.py

read-file-throttle.py:
#!/usr/bin/python

import time

count=0

pre_time=time.time()

with open("apache.log") as infile:

    for line in infile:

        line = line.strip()

        print line

        count += 1

        if count % 50000 == 0:

            now_time = time.time()

            diff = now_time - pre_time

            if diff < 10:

                #print "sleeping %s seconds ..." % (diff)

                time.sleep(diff)

                pre_time = now_time
Thank you very much.

Best Regards,
Yongkun Wang

--
Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org<http://mrunit.apache.org/>
--
Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
+
Wang, Yongkun | Yongkun |... 2013-08-23, 05:26
+
Paul Chavez 2013-08-23, 21:26
+
Paul Chavez 2013-08-19, 17:56
+
Wang, Yongkun | Yongkun |... 2013-08-20, 08:44