前言
本文会以Server端角度解析过期实例定期剔除原理
针对正常下线的Client实例,在其应用程序关闭前会触发调用unregister()主动下线请求Eureka Server
但是对于那些非正常下线的eureka client实例(如内存溢出,进程被kill,或服务器宕机等),在应用关闭前并不会触发unregister() 主动下线
所以eureka采用了失效剔除的方式主动剔除掉那些已经不能提供服务的client实例,这种机制就是失效剔除
失效剔除
eureka server 会以60s(默认)为间隔时间,以后台任务的方式定期清除掉在90s?(默认)内未接收到心跳续约的Client
这里埋下伏笔,是否真的会在90s内未收到心跳续约就剔除呢?
关键参数
- Server端
该参数用于指定Server端失效剔除定时任务的执行间隔时间,不配置默认为60s执行一次,可通过修改配置自定义间隔时间
eureka:
server:
eviction-interval-timer-in-ms: 6000 # 默认值
- Client端
该参数通过client端设置,并在register注册时通过POST请求传至Server端,若Client并未配置该参数,则采用server端的默认值90s
官方表示:
该值设置的太大,会造成流量依旧会打到某实例,及时该实例已经不能提供服务了
若该值设置的太小,会造成由于网络抖动,造成实例服务明明还是正常的,结果server因为没有在参数指定事件内收到心跳续约,误将其剔除,造成服务下线的假象存在
eureka:
instance:
lease-expiration-duration-in-seconds: 90 # 默认值
时序图

大致说下剔除任务的启动流程
- 版本
spring-cloud-netflix-eureka-server-2.1.1.RELEASE.jar
-
META-INF/spring.factories文件
org.springframework.boot.autoconfigure.EnableAutoConfiguration=\ org.springframework.cloud.netflix.eureka.server.EurekaServerAutoConfiguration
-
由spring加载自动装配,且@Import了EurekaServerInitializerConfiguration配置类
@Configuration
@Import(EurekaServerInitializerConfiguration.class)
@ConditionalOnBean(EurekaServerMarkerConfiguration.Marker.class)
@EnableConfigurationProperties({ EurekaDashboardProperties.class,
InstanceRegistryProperties.class })
@PropertySource("classpath:/eureka/server.properties")
public class EurekaServerAutoConfiguration extends WebMvcConfigurerAdapter {
...
}
-
EurekaServerInitializerConfiguration 实现了SmartLifecycle ,因此在spring容器的bean加载和初始化完毕后会执行所有实现Lifecycle接口的类的start方法
-
后续的初始化EvictionTask任务的schedule的逻辑已经在时序图上画出了
源码
- UML类图

- TimerTask + Timer
JDK1.3 提供的用于被定时器计划调度一次或多次执行的任务抽象类,其结合Timer类完成计划任务的调度,Eureka的EvictionTask就是基于二者实现的任务后台调度
- EvictionTask
EvictionTask 是 AbstractInstanceRegistry的内部类,且实现了java.util.TimerTask 抽象类
public abstract class AbstractInstanceRegistry implements InstanceRegistry {
// 失效剔除定时器,以后台方式运行
private Timer evictionTimer = new Timer("Eureka-EvictionTimer", true);
// 启动失效剔除定时器
protected void postInit() {
renewsLastMin.start();
if (evictionTaskRef.get() != null) {
evictionTaskRef.get().cancel();
}
evictionTaskRef.set(new EvictionTask());
// 初次调度延迟时间为evictionIntervalTimerInMs,调度间隔时间也是:EvictionIntervalTimerInMs
evictionTimer.schedule(evictionTaskRef.get(),
serverConfig.getEvictionIntervalTimerInMs(),
serverConfig.getEvictionIntervalTimerInMs());
}
// 失效剔除任务内部类,继承TimerTask,并重写run方法
class EvictionTask extends TimerTask {
// 上次执行纳米级毫秒数引用
private final AtomicLong lastExecutionNanosRef = new AtomicLong(0l);
@Override
public void run() {
try {
// 获取补偿时间
long compensationTimeMs = getCompensationTimeMs();
logger.info("Running the evict task with compensationTime {}ms", compensationTimeMs);
// 执行失效剔除
evict(compensationTimeMs);
} catch (Throwable e) {
logger.error("Could not run the evict task", e);
}
}
/**
* compute a compensation time defined as the actual time this task was executed since the prev iteration,
* vs the configured amount of time for execution. This is useful for cases where changes in time (due to
* clock skew or gc for example) causes the actual eviction task to execute later than the desired time
* according to the configured cycle.
* 计算本次任务执行的时间和上次任务执行的时间差,若不超过默认的60s,则返回0,超过则返回超过的时间差作为补偿时间
*/
long getCompensationTimeMs() {
// 获取当前时间纳米毫秒数
long currNanos = getCurrentTimeNano();
// 利用AtomicLong的getAndSet 先获取上次执行时的毫秒数,如果时第一次执行run方法的调度,则返回0, 并将lastExecutionNanosRef值设为当前时间纳米毫秒数
long lastNanos = lastExecutionNanosRef.getAndSet(currNanos);
if (lastNanos == 0l) {
// 第一次执行失效剔除任务时进入
return 0l;
}
// 计算此次执行与上次执行的时间差
long elapsedMs = TimeUnit.NANOSECONDS.toMillis(currNanos - lastNanos);
// 查看时间差是否大于失效剔除任务间隔执行时间,即默认60s
long compensationTime = elapsedMs - serverConfig.getEvictionIntervalTimerInMs();
// 如果未超过默认的60S, 返回0; 否则返回超过的时间差
return compensationTime <= 0l ? 0l : compensationTime;
}
// 获取当前时间纳米毫秒数
long getCurrentTimeNano() { // for testing
return System.nanoTime();
}
}
// 真正的失效剔除方法
public void evict(long additionalLeaseMs) {
logger.debug("Running the evict task");
// 1. 判断是否启用租约到期,启用租约到期(返回true)才能执行下面的失效剔除,false则直接return
// isLeaseExpirationEnabled()方法由PeerAwareInstanceRegistryImpl实例执行,内部需要满足两种条件下才能返回true
// 1:Server关闭了自我保护模式,即不需要自动保护机制,启用租约到期失效,直接返回true
// 2:Server启用了自我保护模式,但还没有触发自动保护机制时,也会返回true,也就是满足期望最小每分钟续租次数numberOfRenewsPerMinThreshold > 0 且
// 每分钟心跳次数 > 期望最小每分钟续租次数numberOfRenewsPerMinThreshold
//
//
// 这里另外提一下自我保护机制的触发规则:
// 期望最小每分钟续租次数即自我保护阀值(numberOfRenewsPerMinThreshold)=
// 服务总数(expectedNumberOfClientsSendingRenews,每有一个client注册,该值就会 + 1) *
// 每分钟续约数(根据60.0 / Client配置的RenewalIntervalSeconds值计算出来) *
// 自我保护续约百分比阀值因子(默认值0.85)
当 每分钟实际的续约次数 <= numberOfRenewsPerMinThreshold时,就会触发自我保护机制,不再剔除失效过期的实例
if (!isLeaseExpirationEnabled()) {
logger.debug("DS: lease expiration is currently disabled.");
return;
}
// We collect first all expired items, to evict them in random order. For large eviction sets,
// if we do not that, we might wipe out whole apps before self preservation kicks in. By randomizing it,
// the impact should be evenly distributed across all applications.
// 2. 定义一个失效租约的集合
List<Lease<InstanceInfo>> expiredLeases = new ArrayList<>();
// 2.1 遍历所有注册表租约信息,
for (Entry<String, Map<String, Lease<InstanceInfo>>> groupEntry : registry.entrySet()) {
Map<String, Lease<InstanceInfo>> leaseMap = groupEntry.getValue();
if (leaseMap != null) {
for (Entry<String, Lease<InstanceInfo>> leaseEntry : leaseMap.entrySet()) {
Lease<InstanceInfo> lease = leaseEntry.getValue();
// 2.2. 判断lease租约信息是否失效
if (lease.isExpired(additionalLeaseMs) && lease.getHolder() != null) {
// 2.3. 将失效的租约添加到失效租约集合中
expiredLeases.add(lease);
}
}
}
}
// To compensate for GC pauses or drifting local time, we need to use current registry size as a base for
// triggering self-preservation. Without that we would wipe out full registry.
// 补偿由于GC或本地时间漂移等原因造成的情况,我们需要使用当前注册表大小作为基础为了不触发自我保护,
// 如果没有它,我们就会消灭完整的注册表
// 3. 获取当前注册表大小
int registrySize = (int) getLocalRegistrySize();
// 4. 注册大小阈值:注册表大小 * 自我保护阀值因子(默认是0.85)
int registrySizeThreshold = (int) (registrySize * serverConfig.getRenewalPercentThreshold());
// 5. 剔除限制:当前注册表大小 - 注册大小阈值
int evictionLimit = registrySize - registrySizeThreshold;
// 6. 获取需要去剔除的数量:已失效租约数量和剔除限制两者中小的那个值
int toEvict = Math.min(expiredLeases.size(), evictionLimit);
if (toEvict > 0) {
logger.info("Evicting {} items (expired={}, evictionLimit={})", toEvict, expiredLeases.size(), evictionLimit);
// 6.1. 获取随机数
Random random = new Random(System.currentTimeMillis());
for (int i = 0; i < toEvict; i++) {
// Pick a random item (Knuth shuffle algorithm)
// 通过洗牌算法,选择一个随机的失效租约
int next = i + random.nextInt(expiredLeases.size() - i);
Collections.swap(expiredLeases, i, next);
Lease<InstanceInfo> lease = expiredLeases.get(i);
// 6.3. 获取失效租约持有实例的appName以及instanceId
String appName = lease.getHolder().getAppName();
String id = lease.getHolder().getId();
// 6.4. 增加失效剔除实例数量
EXPIRED.increment();
logger.warn("DS: Registry: expired lease for {}/{}", appName, id);
// 6.5. 调用内部cancel方法取消注册,等价于Client主动下线了
internalCancel(appName, id, false);
}
}
}
}
- 总结一下失效剔除任务内部执行的流程

判断注册表中租约是否过期逻辑
- Lease 租约类
/**
* 租约类,用于描述基于时间的T(InstanceInfo注册实例)的可用性信息
*/
public class Lease<T> {
// 定义枚举类,描述租约行为(注册,取消注册,续约)
enum Action {
Register, Cancel, Renew
};
// 默认租约持续时间 - 90秒
public static final int DEFAULT_DURATION_IN_SECS = 90;
// 租约持有的实例信息
private T holder;
// 剔除时间
private long evictionTimestamp;
// 实例注册时间
private long registrationTimestamp;
// 服务启动时间
private long serviceUpTimestamp;
// Make it volatile so that the expiration task would see this quicker
// 上次心跳更新时间,采用volatile修饰,以便失效剔除任务可以立即看到该值,保证多线程下的可见性
private volatile long lastUpdateTimestamp;
// 租约持续时间毫秒数表示
private long duration;
public Lease(T r, int durationInSecs) {
holder = r;
registrationTimestamp = System.currentTimeMillis();
lastUpdateTimestamp = registrationTimestamp;
duration = (durationInSecs * 1000);
}
/**
* Cancels the lease by updating the eviction time.
* 取消租约被调用,则更新evictionTimestamp值为当前时间
*/
public void cancel() {
if (evictionTimestamp <= 0) {
evictionTimestamp = System.currentTimeMillis();
}
}
/**
* 续约租约,即更新其lastUpdateTimestamp值为当前时间戳 + 租约持续时间毫秒数
*
* Renew the lease, use renewal duration if it was specified by the
* associated {@link T} during registration, otherwise default duration is
* {@link #DEFAULT_DURATION_IN_SECS}.
*/
public void renew() {
lastUpdateTimestamp = System.currentTimeMillis() + duration;
}
/**
* 判断是否租约已过期
* Checks if the lease of a given {@link com.netflix.appinfo.InstanceInfo} has expired or not.
*/
public boolean isExpired() {
return isExpired(0l);
}
/**
* Checks if the lease of a given {@link com.netflix.appinfo.InstanceInfo} has expired or not.
*
* Note that due to renew() doing the 'wrong" thing and setting lastUpdateTimestamp to +duration more than
* what it should be, the expiry will actually be 2 * duration. This is a minor bug and should only affect
* instances that ungracefully shutdown. Due to possible wide ranging impact to existing usage, this will
* not be fixed.
*
* 注意由于补偿时间的存在,判断是否过期时,需要把这个时间加上去
*
*
* @param additionalLeaseMs any additional lease time to add to the lease evaluation in ms.
*/
public boolean isExpired(long additionalLeaseMs) {
return (evictionTimestamp > 0 || System.currentTimeMillis() > (lastUpdateTimestamp + duration + additionalLeaseMs));
}
}
-
当Client发送心跳续约时,会触发Lease的renew()方法,即更新lastUpdateTimestamp值为:当前时间戳 + 租约持续时间
-
判断租约是否过期逻辑:
若失效剔除时间(evictionTimestamp值)大于0,即表示Lease的cancel()被触发,则表示已失效
或当前时间戳大于上次更新时间 + 租约持续时间 + 补偿时间
真正的过期失效剔除时间并不是默认的90s
/**
* Checks if the lease of a given {@link com.netflix.appinfo.InstanceInfo} has expired or not.
*
* Note that due to renew() doing the 'wrong" thing and setting lastUpdateTimestamp to +duration more than
* what it should be, the expiry will actually be 2 * duration. This is a minor bug and should only affect
* instances that ungracefully shutdown. Due to possible wide ranging impact to existing usage, this will
* not be fixed.
*
* @param additionalLeaseMs any additional lease time to add to the lease evaluation in ms.
*/
public boolean isExpired(long additionalLeaseMs) {
return (evictionTimestamp > 0 || System.currentTimeMillis() > (lastUpdateTimestamp + duration + additionalLeaseMs));
}
the expiry will actually be 2 * duration. This is a minor bug and should only affect instances that ungracefully shutdown.
Due to possible wide ranging impact to existing usage, this will not be fixed
真实的过期时间有效期限实际上是2倍的duration时间
方法注释说明了这是一个小bug,而且只会影响不正常关闭的实例(没有在应用挺值钱主动发送下线cancel请求的client实例),由于可能对现有使用产生大范围的影响,官方表示这个小bug不会修复掉
剔除动作
- internalCancel(appName, id, false)
/**
* {@link #cancel(String, String, boolean)} method is overridden by {@link PeerAwareInstanceRegistry}, so each
* cancel request is replicated to the peers. This is however not desired for expires which would be counted
* in the remote peers as valid cancellations, so self preservation mode would not kick-in.
*/
protected boolean internalCancel(String appName, String id, boolean isReplication) {
try {
// 1. 获取读锁
read.lock();
// 2. 增加取消实例数量
CANCEL.increment(isReplication);
// 3. 获取注册appName对应的子Map信息
Map<String, Lease<InstanceInfo>> gMap = registry.get(appName);
Lease<InstanceInfo> leaseToCancel = null;
if (gMap != null) {
// 3.1 从子Map中删除该实例对应的租约信息,并返回该租约信息
leaseToCancel = gMap.remove(id);
}
// 4. 同步增加最近取消的实例到canceledQueue中
synchronized (recentCanceledQueue) {
recentCanceledQueue.add(new Pair<Long, String>(System.currentTimeMillis(), appName + "(" + id + ")"));
}
InstanceStatus instanceStatus = overriddenInstanceStatusMap.remove(id);
if (instanceStatus != null) {
logger.debug("Removed instance id {} from the overridden map which has value {}", id, instanceStatus.name());
}
if (leaseToCancel == null) {
CANCEL_NOT_FOUND.increment(isReplication);
logger.warn("DS: Registry: cancel failed because Lease is not registered for: {}/{}", appName, id);
return false;
} else {
// 5. 执行租约信息的cancel方法,就是更新租约信息的evictionTimestamp值为当前时间戳
leaseToCancel.cancel();
// 6. 获取租约持有的实例信息
InstanceInfo instanceInfo = leaseToCancel.getHolder();
String vip = null;
String svip = null;
if (instanceInfo != null) {
instanceInfo.setActionType(ActionType.DELETED);
recentlyChangedQueue.add(new RecentlyChangedItem(leaseToCancel));
instanceInfo.setLastUpdatedTimestamp();
vip = instanceInfo.getVIPAddress();
svip = instanceInfo.getSecureVipAddress();
}
// 7. 失效该实例对应的Guava缓存
invalidateCache(appName, vip, svip);
logger.info("Cancelled instance {}/{} (replication={})", appName, id, isReplication);
return true;
}
} finally {
// 8. 释放读锁
read.unlock();
}
}
简述一下
-
获取注册表Map中该实例对应的子Map,并remove掉该实例
-
调用该删除租约信息的cancel方法,更新其evictionTimestamp值为当前时间戳,即记录实例何时被剔除的
-
失效该实例的所在的responseCache,其他客户端在抓取注册表信息时就会拉取不到失效的服务实例了
写在最后
文章最后给自己也给其他人提几个问题,如果能够回答上,代表你已经明白了Eureka Server在失效剔除这里的原理
1:何为失效剔除?为什么需要有失效剔除任务?
2:失效剔除任务默认多久触发一次,可以通过哪个参数自定义?
3:何为自我保护机制?为什么需要有自我保护机制?什么情况下Server会触发自我保护?
4:真实的实例失效剔除时间默认是90s吗?为什么?
网友评论