背景
flink native on k8s 1.12.0 application mode
执行一段时间后报错
io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 95217669 (95251515)
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) [flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) [flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) [flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) [flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) [flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) [flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) [flink-dist_2.12-1.12.0.jar:1.12.0]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) [flink-dist_2.12-1.12.0.jar:1.12.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
解决方案:
在Flink里面采用了Watcher来监控Pod的状态变化,当Watcher被异常close的时候就会触发fatal
error进而导致JobManager的重启。可能原因是Pod和APIServer之间的网络是不是不稳定,从而导致这个问题经常出现。
在1.12.2版本会被修复。
网友评论