美文网首页
Eureka续约之定期剔除

Eureka续约之定期剔除

作者: 0爱上1 | 来源:发表于2019-05-01 21:37 被阅读0次

前言

本文会以Server端角度解析过期实例定期剔除原理

针对正常下线的Client实例,在其应用程序关闭前会触发调用unregister()主动下线请求Eureka Server

但是对于那些非正常下线的eureka client实例(如内存溢出,进程被kill,或服务器宕机等),在应用关闭前并不会触发unregister() 主动下线

所以eureka采用了失效剔除的方式主动剔除掉那些已经不能提供服务的client实例,这种机制就是失效剔除

失效剔除

eureka server 会以60s(默认)为间隔时间,以后台任务的方式定期清除掉在90s?(默认)内未接收到心跳续约的Client

这里埋下伏笔,是否真的会在90s内未收到心跳续约就剔除呢?

关键参数

  • Server端

该参数用于指定Server端失效剔除定时任务的执行间隔时间,不配置默认为60s执行一次,可通过修改配置自定义间隔时间

  eureka:
    server:
      eviction-interval-timer-in-ms: 6000 # 默认值
  • Client端

该参数通过client端设置,并在register注册时通过POST请求传至Server端,若Client并未配置该参数,则采用server端的默认值90s

官方表示:

该值设置的太大,会造成流量依旧会打到某实例,及时该实例已经不能提供服务了
若该值设置的太小,会造成由于网络抖动,造成实例服务明明还是正常的,结果server因为没有在参数指定事件内收到心跳续约,误将其剔除,造成服务下线的假象存在

  eureka:
     instance:
        lease-expiration-duration-in-seconds: 90 # 默认值

时序图

Eureka失效剔除任务

大致说下剔除任务的启动流程

  • 版本

spring-cloud-netflix-eureka-server-2.1.1.RELEASE.jar

  1. META-INF/spring.factories文件

     org.springframework.boot.autoconfigure.EnableAutoConfiguration=\
     org.springframework.cloud.netflix.eureka.server.EurekaServerAutoConfiguration
    
  2. 由spring加载自动装配,且@Import了EurekaServerInitializerConfiguration配置类

@Configuration
@Import(EurekaServerInitializerConfiguration.class)
@ConditionalOnBean(EurekaServerMarkerConfiguration.Marker.class)
@EnableConfigurationProperties({ EurekaDashboardProperties.class,
    InstanceRegistryProperties.class })
@PropertySource("classpath:/eureka/server.properties")
public class EurekaServerAutoConfiguration extends WebMvcConfigurerAdapter {
...
}
  1. EurekaServerInitializerConfiguration 实现了SmartLifecycle ,因此在spring容器的bean加载和初始化完毕后会执行所有实现Lifecycle接口的类的start方法

  2. 后续的初始化EvictionTask任务的schedule的逻辑已经在时序图上画出了

源码

  • UML类图
EvictionTask
  • TimerTask + Timer

JDK1.3 提供的用于被定时器计划调度一次或多次执行的任务抽象类,其结合Timer类完成计划任务的调度,Eureka的EvictionTask就是基于二者实现的任务后台调度

  • EvictionTask

EvictionTask 是 AbstractInstanceRegistry的内部类,且实现了java.util.TimerTask 抽象类

        public abstract class AbstractInstanceRegistry implements InstanceRegistry {

    // 失效剔除定时器,以后台方式运行
    private Timer evictionTimer = new Timer("Eureka-EvictionTimer", true);

    // 启动失效剔除定时器
    protected void postInit() {
        renewsLastMin.start();
        if (evictionTaskRef.get() != null) {
            evictionTaskRef.get().cancel();
        }
        evictionTaskRef.set(new EvictionTask());
        // 初次调度延迟时间为evictionIntervalTimerInMs,调度间隔时间也是:EvictionIntervalTimerInMs
        evictionTimer.schedule(evictionTaskRef.get(),
                serverConfig.getEvictionIntervalTimerInMs(),
                serverConfig.getEvictionIntervalTimerInMs());
    }

    // 失效剔除任务内部类,继承TimerTask,并重写run方法
    class EvictionTask extends TimerTask {

        // 上次执行纳米级毫秒数引用
        private final AtomicLong lastExecutionNanosRef = new AtomicLong(0l);

        @Override
        public void run() {
            try {
                // 获取补偿时间
                long compensationTimeMs = getCompensationTimeMs();
                logger.info("Running the evict task with compensationTime {}ms", compensationTimeMs);
                
                // 执行失效剔除
                evict(compensationTimeMs);
            } catch (Throwable e) {
                logger.error("Could not run the evict task", e);
            }
        }

        /**
        * compute a compensation time defined as the actual time this task was executed since the prev iteration,
        * vs the configured amount of time for execution. This is useful for cases where changes in time (due to
        * clock skew or gc for example) causes the actual eviction task to execute later than the desired time
        * according to the configured cycle.
        * 计算本次任务执行的时间和上次任务执行的时间差,若不超过默认的60s,则返回0,超过则返回超过的时间差作为补偿时间
        */
        long getCompensationTimeMs() {
            // 获取当前时间纳米毫秒数
            long currNanos = getCurrentTimeNano();
            
            // 利用AtomicLong的getAndSet 先获取上次执行时的毫秒数,如果时第一次执行run方法的调度,则返回0, 并将lastExecutionNanosRef值设为当前时间纳米毫秒数
            long lastNanos = lastExecutionNanosRef.getAndSet(currNanos);

            if (lastNanos == 0l) {
                // 第一次执行失效剔除任务时进入
                return 0l;
            }

            // 计算此次执行与上次执行的时间差
            long elapsedMs = TimeUnit.NANOSECONDS.toMillis(currNanos - lastNanos);
            
            // 查看时间差是否大于失效剔除任务间隔执行时间,即默认60s
            long compensationTime = elapsedMs - serverConfig.getEvictionIntervalTimerInMs();
            
            // 如果未超过默认的60S, 返回0; 否则返回超过的时间差
            return compensationTime <= 0l ? 0l : compensationTime;
        }

        // 获取当前时间纳米毫秒数
        long getCurrentTimeNano() {  // for testing
            return System.nanoTime();
        }

    }

    // 真正的失效剔除方法
    public void evict(long additionalLeaseMs) {

        logger.debug("Running the evict task");

        // 1. 判断是否启用租约到期,启用租约到期(返回true)才能执行下面的失效剔除,false则直接return
        // isLeaseExpirationEnabled()方法由PeerAwareInstanceRegistryImpl实例执行,内部需要满足两种条件下才能返回true
        // 1:Server关闭了自我保护模式,即不需要自动保护机制,启用租约到期失效,直接返回true
        // 2:Server启用了自我保护模式,但还没有触发自动保护机制时,也会返回true,也就是满足期望最小每分钟续租次数numberOfRenewsPerMinThreshold > 0 且 
        // 每分钟心跳次数 > 期望最小每分钟续租次数numberOfRenewsPerMinThreshold
        // 
        // 
        // 这里另外提一下自我保护机制的触发规则:
        // 期望最小每分钟续租次数即自我保护阀值(numberOfRenewsPerMinThreshold)= 
        // 服务总数(expectedNumberOfClientsSendingRenews,每有一个client注册,该值就会 + 1) * 
        // 每分钟续约数(根据60.0 / Client配置的RenewalIntervalSeconds值计算出来) * 
        // 自我保护续约百分比阀值因子(默认值0.85)
        当 每分钟实际的续约次数 <= numberOfRenewsPerMinThreshold时,就会触发自我保护机制,不再剔除失效过期的实例
        if (!isLeaseExpirationEnabled()) {
            logger.debug("DS: lease expiration is currently disabled.");
            return;
        }

        // We collect first all expired items, to evict them in random order. For large eviction sets,
        // if we do not that, we might wipe out whole apps before self preservation kicks in. By randomizing it,
        // the impact should be evenly distributed across all applications.

        // 2. 定义一个失效租约的集合
        List<Lease<InstanceInfo>> expiredLeases = new ArrayList<>();

        // 2.1 遍历所有注册表租约信息,
        for (Entry<String, Map<String, Lease<InstanceInfo>>> groupEntry : registry.entrySet()) {
            Map<String, Lease<InstanceInfo>> leaseMap = groupEntry.getValue();
            if (leaseMap != null) {
                for (Entry<String, Lease<InstanceInfo>> leaseEntry : leaseMap.entrySet()) {
                    Lease<InstanceInfo> lease = leaseEntry.getValue();
                    // 2.2. 判断lease租约信息是否失效
                    if (lease.isExpired(additionalLeaseMs) && lease.getHolder() != null) {
                        // 2.3. 将失效的租约添加到失效租约集合中
                        expiredLeases.add(lease);
                    }
                }
            }
        }

        // To compensate for GC pauses or drifting local time, we need to use current registry size as a base for
        // triggering self-preservation. Without that we would wipe out full registry.

        // 补偿由于GC或本地时间漂移等原因造成的情况,我们需要使用当前注册表大小作为基础为了不触发自我保护,
        // 如果没有它,我们就会消灭完整的注册表

        // 3. 获取当前注册表大小
        int registrySize = (int) getLocalRegistrySize();

        // 4. 注册大小阈值:注册表大小 * 自我保护阀值因子(默认是0.85)
        int registrySizeThreshold = (int) (registrySize * serverConfig.getRenewalPercentThreshold());

        // 5. 剔除限制:当前注册表大小 - 注册大小阈值
        int evictionLimit = registrySize - registrySizeThreshold;

        // 6. 获取需要去剔除的数量:已失效租约数量和剔除限制两者中小的那个值
        int toEvict = Math.min(expiredLeases.size(), evictionLimit);
        if (toEvict > 0) {
            logger.info("Evicting {} items (expired={}, evictionLimit={})", toEvict, expiredLeases.size(), evictionLimit);
            // 6.1. 获取随机数
            Random random = new Random(System.currentTimeMillis());
            for (int i = 0; i < toEvict; i++) {
                // Pick a random item (Knuth shuffle algorithm)
                // 通过洗牌算法,选择一个随机的失效租约
                int next = i + random.nextInt(expiredLeases.size() - i);
                Collections.swap(expiredLeases, i, next);
                Lease<InstanceInfo> lease = expiredLeases.get(i);

                // 6.3. 获取失效租约持有实例的appName以及instanceId
                String appName = lease.getHolder().getAppName();
                String id = lease.getHolder().getId();
                
                // 6.4. 增加失效剔除实例数量
                EXPIRED.increment();
                logger.warn("DS: Registry: expired lease for {}/{}", appName, id);
                
                // 6.5. 调用内部cancel方法取消注册,等价于Client主动下线了
                internalCancel(appName, id, false);
            }
        }
    }
}
  • 总结一下失效剔除任务内部执行的流程
剔除任务内部执行流程图

判断注册表中租约是否过期逻辑

  • Lease 租约类
    /**
* 租约类,用于描述基于时间的T(InstanceInfo注册实例)的可用性信息
*/
public class Lease<T> {

    // 定义枚举类,描述租约行为(注册,取消注册,续约)
    enum Action {
        Register, Cancel, Renew
    };

    // 默认租约持续时间 - 90秒
    public static final int DEFAULT_DURATION_IN_SECS = 90;

    // 租约持有的实例信息
    private T holder;

    // 剔除时间
    private long evictionTimestamp;
    
    // 实例注册时间
    private long registrationTimestamp;
    
    // 服务启动时间
    private long serviceUpTimestamp;
    
    // Make it volatile so that the expiration task would see this quicker
    // 上次心跳更新时间,采用volatile修饰,以便失效剔除任务可以立即看到该值,保证多线程下的可见性
    private volatile long lastUpdateTimestamp;

    // 租约持续时间毫秒数表示
    private long duration;

    public Lease(T r, int durationInSecs) {
        holder = r;
        registrationTimestamp = System.currentTimeMillis();
        lastUpdateTimestamp = registrationTimestamp;
        duration = (durationInSecs * 1000);

    }
    
  /**
 * Cancels the lease by updating the eviction time.
 * 取消租约被调用,则更新evictionTimestamp值为当前时间
 */
  public void cancel() {
    if (evictionTimestamp <= 0) {
        evictionTimestamp = System.currentTimeMillis();
    }
  }
    
    /**
    * 续约租约,即更新其lastUpdateTimestamp值为当前时间戳 + 租约持续时间毫秒数
    *
    * Renew the lease, use renewal duration if it was specified by the
    * associated {@link T} during registration, otherwise default duration is
    * {@link #DEFAULT_DURATION_IN_SECS}.
    */
    public void renew() {
        lastUpdateTimestamp = System.currentTimeMillis() + duration;

    }

    /**
    * 判断是否租约已过期
    * Checks if the lease of a given {@link com.netflix.appinfo.InstanceInfo} has expired or not.
    */
    public boolean isExpired() {
        return isExpired(0l);
    }

    /**
    * Checks if the lease of a given {@link com.netflix.appinfo.InstanceInfo} has expired or not.
    *
    * Note that due to renew() doing the 'wrong" thing and setting lastUpdateTimestamp to +duration more than
    * what it should be, the expiry will actually be 2 * duration. This is a minor bug and should only affect
    * instances that ungracefully shutdown. Due to possible wide ranging impact to existing usage, this will
    * not be fixed.
    *
    * 注意由于补偿时间的存在,判断是否过期时,需要把这个时间加上去
    *
    *
    * @param additionalLeaseMs any additional lease time to add to the lease evaluation in ms.
    */
    public boolean isExpired(long additionalLeaseMs) {
        return (evictionTimestamp > 0 || System.currentTimeMillis() > (lastUpdateTimestamp + duration + additionalLeaseMs));
    }
}    
  1. 当Client发送心跳续约时,会触发Lease的renew()方法,即更新lastUpdateTimestamp值为:当前时间戳 + 租约持续时间

  2. 判断租约是否过期逻辑:

若失效剔除时间(evictionTimestamp值)大于0,即表示Lease的cancel()被触发,则表示已失效
或当前时间戳大于上次更新时间 + 租约持续时间 + 补偿时间

真正的过期失效剔除时间并不是默认的90s

/**
 * Checks if the lease of a given {@link com.netflix.appinfo.InstanceInfo} has expired or not.
 *
 * Note that due to renew() doing the 'wrong" thing and setting lastUpdateTimestamp to +duration more than
 * what it should be, the expiry will actually be 2 * duration. This is a minor bug and should only affect
 * instances that ungracefully shutdown. Due to possible wide ranging impact to existing usage, this will
 * not be fixed.
 *
 * @param additionalLeaseMs any additional lease time to add to the lease evaluation in ms.
 */
public boolean isExpired(long additionalLeaseMs) {
    return (evictionTimestamp > 0 || System.currentTimeMillis() > (lastUpdateTimestamp + duration + additionalLeaseMs));
}

the expiry will actually be 2 * duration. This is a minor bug and should only affect instances that ungracefully shutdown.
Due to possible wide ranging impact to existing usage, this will not be fixed

真实的过期时间有效期限实际上是2倍的duration时间

方法注释说明了这是一个小bug,而且只会影响不正常关闭的实例(没有在应用挺值钱主动发送下线cancel请求的client实例),由于可能对现有使用产生大范围的影响,官方表示这个小bug不会修复掉

剔除动作

  • internalCancel(appName, id, false)
/**
 * {@link #cancel(String, String, boolean)} method is overridden by {@link PeerAwareInstanceRegistry}, so each
 * cancel request is replicated to the peers. This is however not desired for expires which would be counted
 * in the remote peers as valid cancellations, so self preservation mode would not kick-in.
 */
protected boolean internalCancel(String appName, String id, boolean isReplication) {
    try {
        // 1. 获取读锁
        read.lock();

        // 2. 增加取消实例数量
        CANCEL.increment(isReplication);

        // 3. 获取注册appName对应的子Map信息
        Map<String, Lease<InstanceInfo>> gMap = registry.get(appName);
        Lease<InstanceInfo> leaseToCancel = null;
        if (gMap != null) {

            // 3.1 从子Map中删除该实例对应的租约信息,并返回该租约信息
            leaseToCancel = gMap.remove(id);
        }

        // 4. 同步增加最近取消的实例到canceledQueue中
        synchronized (recentCanceledQueue) {
            recentCanceledQueue.add(new Pair<Long, String>(System.currentTimeMillis(), appName + "(" + id + ")"));
        }

        InstanceStatus instanceStatus = overriddenInstanceStatusMap.remove(id);
        if (instanceStatus != null) {
            logger.debug("Removed instance id {} from the overridden map which has value {}", id, instanceStatus.name());
        }
        if (leaseToCancel == null) {
            CANCEL_NOT_FOUND.increment(isReplication);
            logger.warn("DS: Registry: cancel failed because Lease is not registered for: {}/{}", appName, id);
            return false;
        } else {

            // 5. 执行租约信息的cancel方法,就是更新租约信息的evictionTimestamp值为当前时间戳
            leaseToCancel.cancel();

            // 6. 获取租约持有的实例信息
            InstanceInfo instanceInfo = leaseToCancel.getHolder();
            String vip = null;
            String svip = null;
            if (instanceInfo != null) {
                instanceInfo.setActionType(ActionType.DELETED);
                recentlyChangedQueue.add(new RecentlyChangedItem(leaseToCancel));
                instanceInfo.setLastUpdatedTimestamp();
                vip = instanceInfo.getVIPAddress();
                svip = instanceInfo.getSecureVipAddress();
            }
            // 7. 失效该实例对应的Guava缓存
            invalidateCache(appName, vip, svip);
            logger.info("Cancelled instance {}/{} (replication={})", appName, id, isReplication);
            return true;
        }
    } finally {
        // 8. 释放读锁
        read.unlock();
    }
}

简述一下

  1. 获取注册表Map中该实例对应的子Map,并remove掉该实例

  2. 调用该删除租约信息的cancel方法,更新其evictionTimestamp值为当前时间戳,即记录实例何时被剔除的

  3. 失效该实例的所在的responseCache,其他客户端在抓取注册表信息时就会拉取不到失效的服务实例了


写在最后

文章最后给自己也给其他人提几个问题,如果能够回答上,代表你已经明白了Eureka Server在失效剔除这里的原理

1:何为失效剔除?为什么需要有失效剔除任务?

2:失效剔除任务默认多久触发一次,可以通过哪个参数自定义?

3:何为自我保护机制?为什么需要有自我保护机制?什么情况下Server会触发自我保护?

4:真实的实例失效剔除时间默认是90s吗?为什么?

相关文章

网友评论

      本文标题:Eureka续约之定期剔除

      本文链接:https://www.haomeiwen.com/subject/eegdnqtx.html