Hi Mahak,

To quickly answer your question.
Scenario 1 : A feed instance runs at 17:30 for replication but a file
ending in 1730 isn't available yet. So, the instance is rescheduled for a
later time and this keeps on happening until the file is found or the late
arrival cut off time (an hour in this case) is reached.
- Assuming its a feed with f*requency minutes(10),* this scenario has
nothing to do with late-data, when the availability flag is ready, the
replication kicks off, otherwise the 17:30 replication instance will be in
"Waiting" state. Once the availability flag is found the instance goes to
"Running" state and replicates the data to target cluster and this instance
17:30 is considered as "Success".

Scenario 2: A feed instance runs at 17:30 for replication and finds that a
file ending in 1720 is now available which wasn't available when the last
replication instance ran(at 17:20). So, now it copies both the files (the
one ending in 1730 and the one ending in 1720).
- No it wont copy data from both the instances, since 17:20 is available
for the first time, it simply copies 17:20's data alone. And feed instance
for 17:30 will check for data under 17:30 directory alone. Both are
independent instances.
Late arrival works for both Feed and Process and the details on the
functionality is available in Falcon documentation.
Please check
http://falcon.apache.org/0.6-incubating/EntitySpecification.html#Feed_Specification
"Late Data" section.
Since your question is related to Feed replication (late-data) I will try
to answer here:
1. From Feed definition, lets say we have
 <frequency>hours(1)</frequency>

<late-arrival cut-off="hours(6)"/>

2. From falcon runtime.properties
A feed cut-off policy is required for late-data handling for Feeds.
allowed policies: periodic, exp-backoff(exponential backoff) and final
Ex: periodic with delay=hours(2),

Here, falcon would replicate the feed once every hour 17:00, 18:00 and so
on.
late-arrival specifies, since how *long this feed should be checked for
late data changes in the Source cluster*. In this case 6 hours.
So, for the instance 17:00, it is honoured till(17+6) 23:00 hour and for
instance 18:00, 00:00 (next day) and so on.

*When to check?* is specified by the cut-off policy, here it says periodic,
hours(2), so falcon checks for changes every 2 hours in source cluster
input.
So, falcon would check the instance 17:00 at time 19:00 for the data in
source cluster, followed by 21:00 and finally at 23:00.

*How changes are detected?* Falcon maintains the data size for every
instance run, so it records the size of data at first run (17:00)
if it detects a different size in source input in next period check 19:00,
it simply reruns the entire replication by *overriding* the previous
replicated data.

Hope it answers your question.

Thanks,
-Idris

On Thu, Jun 25, 2015 at 10:02 PM, Mahak Mukhi <[EMAIL PROTECTED]lid
<javascript:_e(%7B%7D,'cvml','[EMAIL PROTECTED]lid');>> wrote:
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB