Hi Mahak,

To quickly answer your question.
Scenario 1 : A feed instance runs at 17:30 for replication but a file
ending in 1730 isn't available yet. So, the instance is rescheduled for a
later time and this keeps on happening until the file is found or the late
arrival cut off time (an hour in this case) is reached.
- Assuming its a feed with f*requency minutes(10),* this scenario has
nothing to do with late-data, when the availability flag is ready, the
replication kicks off, otherwise the 17:30 replication instance will be in
"Waiting" state. Once the availability flag is found the instance goes to
"Running" state and replicates the data to target cluster and this instance
17:30 is considered as "Success".

Scenario 2: A feed instance runs at 17:30 for replication and finds that a
file ending in 1720 is now available which wasn't available when the last
replication instance ran(at 17:20). So, now it copies both the files (the
one ending in 1730 and the one ending in 1720).
- No it wont copy data from both the instances, since 17:20 is available
for the first time, it simply copies 17:20's data alone. And feed instance
for 17:30 will check for data under 17:30 directory alone. Both are
independent instances.
Late arrival works for both Feed and Process and the details on the
functionality is available in Falcon documentation.
Please check
"Late Data" section.
Since your question is related to Feed replication (late-data) I will try
to answer here:
1. From Feed definition, lets say we have

<late-arrival cut-off="hours(6)"/>

2. From falcon runtime.properties
A feed cut-off policy is required for late-data handling for Feeds.
allowed policies: periodic, exp-backoff(exponential backoff) and final
Ex: periodic with delay=hours(2),

Here, falcon would replicate the feed once every hour 17:00, 18:00 and so
late-arrival specifies, since how *long this feed should be checked for
late data changes in the Source cluster*. In this case 6 hours.
So, for the instance 17:00, it is honoured till(17+6) 23:00 hour and for
instance 18:00, 00:00 (next day) and so on.

*When to check?* is specified by the cut-off policy, here it says periodic,
hours(2), so falcon checks for changes every 2 hours in source cluster
So, falcon would check the instance 17:00 at time 19:00 for the data in
source cluster, followed by 21:00 and finally at 23:00.

*How changes are detected?* Falcon maintains the data size for every
instance run, so it records the size of data at first run (17:00)
if it detects a different size in source input in next period check 19:00,
it simply reruns the entire replication by *overriding* the previous
replicated data.

Hope it answers your question.


On Thu, Jun 25, 2015 at 10:02 PM, Mahak Mukhi <[EMAIL PROTECTED]lid
<javascript:_e(%7B%7D,'cvml','[EMAIL PROTECTED]lid');>> wrote:
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB