本文主要介绍kpa如何配置,以及以revision为角度介绍其中的各项配置的作用。
简介
有全局配置和每个revision的配置。如果revision没有配置全局有配置的话就会使用全局配置。全局配置配置在config-autoscaler
的configmap中。
算法
Autoscaler基于每个Pod的平均请求数(并发数)进行自动扩缩容,默认并发数为100。Pod数=并发请求总数/容器并发数。如果服务中并发数设置为10,并且加载了50个并发请求的服务,则Autoscaler就会创建5个Pod。
Autoscaler实现了两种操作模式的缩放算法,Stable稳定模式和Panic恐慌模式:
Stable稳定模式。
在稳定模式下,Autoscaler调整Deployment的大小,以实现每个Pod所需的平均并发数。Pod的并发数是根据60秒窗口内接收所有数据请求的平均数来计算的。
Panic恐慌模式。
Autoscaler计算60秒窗口内的平均并发数,系统需要1分钟稳定在所需的并发级别。但是,Autoscaler也会计算6秒的恐慌窗口,如果该窗口达到目标并发的2倍,则会进入恐慌模式。在恐慌模式下,Autoscaler在更短、更敏感的紧急窗口上工作。一旦紧急情况持续60秒后,Autoscaler将返回初始的60秒稳定窗口。
| Panic Target---> +--| 20 | | | <------Panic Window | | Stable Target---> +-------------------------|--| 10 CONCURRENCY | | | | <-----------Stable Window | | | --------------------------+-------------------------+--+ 0 120 60 0 TIME
扩容类型
The type of Autoscaler implementation (KPA or HPA) can be configured by using the class
annotation.
- Global settings key:
pod-autoscaler-class
- Per-revision annotation key:
autoscaling.knative.dev/class
- Possible values:
"kpa.autoscaling.knative.dev"
or"hpa.autoscaling.knative.dev"
- Default:
"kpa.autoscaling.knative.dev"
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: helloworld-go
namespace: default
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/class: "kpa.autoscaling.knative.dev"
spec:
containers:
- image: gcr.io/knative-samples/helloworld-go
configMap全局控制
参数及功能
- container-concurrency-target-percentage: 容器并发请求数比例。容器实际最大并发数 = 容器最大并发请求数 * 容器并发请求数比例。例如,Revision 设置的容器最大并发请求数为:10,容器并发请求数比例为:70%, 那么在稳定状态下,实际容器的最大并发请求数为:7。这里额外注意一点容器最大并发(某一时刻同时进行中的请求数)请求数
ContainerConcurrency
是定义在revision中的,默认值是0,代表不对最大并发做限制。 - container-concurrency-target-default:容器并发请求默认值。当 Revision 中未设置容器最大并发请求数时,使用该默认值作为容器最大并发请求数
- requests-per-second-target-default: 每秒请求并发(RPS)默认值。当使用RPS进行度量时,autoscaler 会依据此值进行扩缩容判断
- target-burst-capacity:突发请求容量。在突发流量场景下,切换到 Activator 模式进行流量控制。取值范围为[-1,+∞)。-1表示一直使用 Activator 模式;0表示不使用突发流量功能。
- stable-window: 稳定窗口期。稳定模式窗口期
- panic-window-percentage:恐慌窗口比例。通过恐慌窗口比例,计算恐慌窗口期。恐慌窗口期 = 恐慌窗口比例 * 稳定窗口期/100
- panic-threshold-percentage:恐慌模式比例阈值。当前并发请求数大于容器最大并发请求数 * 恐慌比例阈值,并且达到恐慌窗口期,则进入恐慌模式。
- max-scale-up-rate:最大扩容速率。每次扩容允许的最大速率。当前最大扩容数 = 最大扩容速率 * Ready的Pod数量
- max-scale-down-rate:最大缩容速率。
- enable-scale-to-zero:允许缩容至0。
- scale-to-zero-grace-period:缩容至0优雅下线时间。
- scale-to-zero-pod-retention-period定义了在自动缩放器决定缩放到零后,最后一个pod将保留的最小时间。这个标志适用于pod启动非常慢,并且流量激增(需要更小的时间窗口来快速操作),但断断续续的情况。这个配置和“scale-to-zero-grace-period”将有效地决定最后一个pod在没有流量后多久停止。
- pod-autoscaler-class定义缩放类型
- activator-capacity单个activator的task的容量。针对activator代理的请求,最小值是1,可以通过这个链接http://bit.ly/38XiCZ3查看计算算法
——- 以下内容不针对缩放,但是也是这个configmap下的网络配置项。
- initial-scale此值集群范围内生效,定义revision创建的时候的初始的副本数。可以被revision中的"autoscaling.knative.dev/initialScale" annotation配置覆盖。除非allow-zero-initial-scale is true否则必须大于0
- allow-zero-initial-scale控制
initial-scale
和revision中的"autoscaling.knative.dev/initialScale" annotation配置是否允许被设置为0 - max-scale此值集群范围内生效。定义一个revision最大的副本数,可以被revision中"autoscaling.knative.dev/maxScale" annotation配置覆盖。如果设置为0就是无上限。
- scale-down-delay定义在决策器计算出来需要缩容后,缩容动作的延迟执行时间。如果在延迟期内又来了很多请求,那么就避免了新启动pod带来的冷启动开销。默认值为0,立刻缩容。
- max-scale-limit一个revision缩放副本数上限的全局限制。如果设置为大于0的值,那么revision中的maxScale必须不为0并且不大于这个值。
例子:
apiVersion: v1
kind: ConfigMap
metadata:
name: config-autoscaler
namespace: knative-serving
data:
container-concurrency-target-percentage: "70"
container-concurrency-target-default: "100"
requests-per-second-target-default: "200"
target-burst-capacity: "200"
stable-window: "60s"
panic-window-percentage: "10.0"
panic-threshold-percentage: "200.0"
max-scale-up-rate: "1000.0"
max-scale-down-rate: "2.0"
enable-scale-to-zero: "false"
scale-to-zero-grace-period: "30s"
scale-to-zero-pod-retention-period: "0s"
pod-autoscaler-class: "kpa.autoscaling.knative.dev"
activator-capacity: "100.0"
initial-scale: "1"
allow-zero-initial-scale: "false"
max-scale: "0"
scale-down-delay: "0s"
max-scale-limit: "0"
revision中配置
每个revision可以自由定义自己的缩放配置,配置项在annotion中。其中大部分配置项与configmap中的是一样的,将configmap中的配置key转化为小驼峰并且添加autoscaling.knative.dev/
前缀就是配置的key。
- autoscaling.knative.dev/class 缩放类型kpa.autoscaling.knative.dev还是hpa.autoscaling.knative.dev默认kpa
- autoscaling.knative.dev/minScale 最小副本数
- autoscaling.knative.dev/maxScale 最大副本数
- autoscaling.knative.dev/initialScale 初始副本数
- autoscaling.knative.dev/scaleDownDelay 执行缩小副本数的延迟时间
- autoscaling.knative.dev/metric 基于什么指标进行缩放:并发
concurrency
、每秒请求数rps
、cpu
、memory
- autoscaling.knative.dev/target 缩放指标的目标值。仅针对cpu(单核百分数)和memory(MiB)。
- autoscaling.knative.dev/scaleToZeroPodRetentionPeriod 缩放到0之前等待时间
- autoscaling.knative.dev/metricAggregationAlgorithm 注意:这是一个alpha特性,可能会被修改或移除。此指标的作用在于定义autoscaler平均指标的算法。仅作用于使用kpa扩缩容的服务。
linear
、指数衰减weightedExponential
- autoscaling.knative.dev/window 计算指标的时间窗口(为这个时间窗口内的平均值),越大越平滑反应越迟钝。仅作用于kpa扩缩容方式。最小6s最大1h。
- autoscaling.knative.dev/targetUtilizationPercentage 定义revision所需的目标资源利用率,范围[1,100].对于concurrency和rps都生效。为了提高服务的可用性,并不是到达target才开始扩容,而是到达这个限制时就开始扩容。比如concurrency定义为10,此值为70(百分比),那么一个pod最大并发仍然是10,但是并发达到7的时候就开始进行扩容了。
- autoscaling.knative.dev/targetBurstCapacity 定义revision的突发流量容量-1不限 0没有 大于0实际的值,其他值报错。
- autoscaling.knative.dev/panicWindowPercentage 重要指标恐慌时间窗口百分比。最小值为1最大为100.由于autoscaler每2s进行一次计算(config-autoscaler的tick-interval配置项),所以小于2s可能会错过点。如果设置为百分之一由于非常小,那么指标窗口最小就需要3.4分钟。任何小于1的值将不会生效。
- autoscaling.knative.dev/panicThresholdPercentage 定义在panic时间窗口内指标达到多少百分比的时候将会触发恐慌模式。最小值为110,最大值为1000.越小越灵敏。
配置详解
target burst capacity (TBC)
相比HPA,Knative会考虑更多的场景,其中一个比较重要的是流量突发的时候。 除了Autoscaler会进入Panic模式更快的去扩容外,上面也提到过Activator和Queue-proxy本身也带有缓存请求的功能,这个功能的目的也是为了在请求流量突发来不及处理时,进行缓存再转发。但是非冷启动情况下的请求缓存,会被视为一种迫不得已的兜底行为,同时会带来一定的请求时延。 Knative中会有很多的配置,可以调整用于突发流量的场景。下面为常见的配置项:
- container concurrency:容器并发度,如果设置为0,则视为不限制容器并发。
- target utilization:目标使用率,达到该使用率后会被视为达到容器的并发度,会触发扩容。
- target burst capacity (TBC) :可容忍的请求爆发容量,这个参数比较关键而且不太好理解。
举一个例子,假设我们的容器container concurrency被设置为50,目标使用率target utilization为80%,即每个容器目标是接收50*80%=40
的并发请求,TBC设置为100。当有180的并发请求进来后,我们很容易算出最终会被扩容为5个副本(这里会按照目标请求40来计算),但实际上5个副本的最大容量为5*50=250
个并发请求,则实际剩余可容忍的爆发为250-180=70
个并发。
当剩余可容忍的并发小于TBC时,Knative会让流量经过Activator,而不是直接发送到后端服务,那么在这个例子中,由于70<TBC=100,此时相当于Knative认为剩余的爆发请求容量不足以支撑目标的可容忍容量(TBC),所以流量全部都会走到Activator再进行负载均衡转发,因为Activator可以感知到哪些容器目前接收的请求已经达到极限,哪些容器却还能继续接收更多请求。
同时,如果在示例的场景中,再突然进来了100个请求,在扩容来不及的情况下,Activator会代理70个请求到后端服务,同时缓存30个请求,等后端服务有更多容量时再转发处理。
由此可见,Activator缓存请求并非只是在冷启动时,在突发流量场景下,Activator也会起到相同的作用,而冷启动其实只是后端服务副本为0的一种特殊场景而已。
从上面的分析可以看出,TBC参数十分重要,会影响到什么时候请求流量会经过Activator,什么时候则直接从网关到后端。但是,如果在非冷启动的时候,流量也经过Activator,增加了一层链路,对于延迟敏感的服务,有点得不偿失。
那TBC应该设置成多少呢?这对于很多人来说,也是一个头疼的问题。
这里给出一些参考:
- 对于非CPU密集型的服务,例如常规的web应用、静态资源服务等,建议直接将TBC设置为0,同时container concurrency也设置为0,即不限制容器的请求并发。 当TBC=0时,系统剩余的可爆发请求容量会永远大于TBC,也意味着除了冷启动的时候,请求流量永远不会走到Activator,同时我建议设置Knative Service的最小保留副本数为1,这样Activator组件其实都不会被用到,也减少了一层链路。 其实,我们使用Serverless的时候,很多的服务都是常规的在线业务,使用如上的配置,可以最小化请求延迟,增大RPS。
- 对于非常CPU密集型的服务、单线程等极端场景的服务,需要严格限制单个容器的请求并发,一般都需要设置container concurrency<5,此时建议设置TBC=-1,因为这种场景下往往扩容等不及请求流量的突增。 当TBC=-1,也意味着所有的请求,都会走到Activator,Activator会重新转发请求或者缓存超额的请求,这个时候的Activator组件在非冷启动情况下就有存在的价值。
- 对于除了上述两种情况,还需要限制一些并发请求的场景,此时一般container concurrency>5,建议设置TBC为一个预期的值,例如100,至于这个值具体设置多少,需要系统管理员根据实际的场景去评估。
附:
来自configmap中的说明
################################
# #
# EXAMPLE CONFIGURATION #
# #
################################
# This block is not actually functional configuration,
# but serves to illustrate the available configuration
# options and document them in a way that is accessible
# to users that `kubectl edit` this config map.
#
# These sample configuration options may be copied out of
# this example block and unindented to be in the data block
# to actually change the configuration.
# The Revision ContainerConcurrency field specifies the maximum number
# of requests the Container can handle at once. Container concurrency
# target percentage is how much of that maximum to use in a stable
# state. E.g. if a Revision specifies ContainerConcurrency of 10, then
# the Autoscaler will try to maintain 7 concurrent connections per pod
# on average.
# Note: this limit will be applied to container concurrency set at every
# level (ConfigMap, Revision Spec or Annotation).
# For legacy and backwards compatibility reasons, this value also accepts
# fractional values in (0, 1] interval (i.e. 0.7 ⇒ 70%).
# Thus minimal percentage value must be greater than 1.0, or it will be
# treated as a fraction.
# NOTE: that this value does not affect actual number of concurrent requests
# the user container may receive, but only the average number of requests
# that the revision pods will receive.
container-concurrency-target-percentage: "70"
# The container concurrency target default is what the Autoscaler will
# try to maintain when concurrency is used as the scaling metric for the
# Revision and the Revision specifies unlimited concurrency.
# When revision explicitly specifies container concurrency, that value
# will be used as a scaling target for autoscaler.
# When specifying unlimited concurrency, the autoscaler will
# horizontally scale the application based on this target concurrency.
# This is what we call "soft limit" in the documentation, i.e. it only
# affects number of pods and does not affect the number of requests
# individual pod processes.
# The value must be a positive number such that the value multiplied
# by container-concurrency-target-percentage is greater than 0.01.
# NOTE: that this value will be adjusted by application of
# container-concurrency-target-percentage, i.e. by default
# the system will target on average 70 concurrent requests
# per revision pod.
# NOTE: Only one metric can be used for autoscaling a Revision.
container-concurrency-target-default: "100"
# The requests per second (RPS) target default is what the Autoscaler will
# try to maintain when RPS is used as the scaling metric for a Revision and
# the Revision specifies unlimited RPS. Even when specifying unlimited RPS,
# the autoscaler will horizontally scale the application based on this
# target RPS.
# Must be greater than 1.0.
# NOTE: Only one metric can be used for autoscaling a Revision.
requests-per-second-target-default: "200"
# The target burst capacity specifies the size of burst in concurrent
# requests that the system operator expects the system will receive.
# Autoscaler will try to protect the system from queueing by introducing
# Activator in the request path if the current spare capacity of the
# service is less than this setting.
# If this setting is 0, then Activator will be in the request path only
# when the revision is scaled to 0.
# If this setting is > 0 and container-concurrency-target-percentage is
# 100% or 1.0, then activator will always be in the request path.
# -1 denotes unlimited target-burst-capacity and activator will always
# be in the request path.
# Other negative values are invalid.
target-burst-capacity: "200"
# When operating in a stable mode, the autoscaler operates on the
# average concurrency over the stable window.
# Stable window must be in whole seconds.
stable-window: "60s"
# When observed average concurrency during the panic window reaches
# panic-threshold-percentage the target concurrency, the autoscaler
# enters panic mode. When operating in panic mode, the autoscaler
# scales on the average concurrency over the panic window which is
# panic-window-percentage of the stable-window.
# Must be in the [1, 100] range.
# When computing the panic window it will be rounded to the closest
# whole second, at least 1s.
panic-window-percentage: "10.0"
# The percentage of the container concurrency target at which to
# enter panic mode when reached within the panic window.
panic-threshold-percentage: "200.0"
# Max scale up rate limits the rate at which the autoscaler will
# increase pod count. It is the maximum ratio of desired pods versus
# observed pods.
# Cannot be less or equal to 1.
# I.e with value of 2.0 the number of pods can at most go N to 2N
# over single Autoscaler period (2s), but at least N to
# N+1, if Autoscaler needs to scale up.
max-scale-up-rate: "1000.0"
# Max scale down rate limits the rate at which the autoscaler will
# decrease pod count. It is the maximum ratio of observed pods versus
# desired pods.
# Cannot be less or equal to 1.
# I.e. with value of 2.0 the number of pods can at most go N to N/2
# over single Autoscaler evaluation period (2s), but at
# least N to N-1, if Autoscaler needs to scale down.
max-scale-down-rate: "2.0"
# Scale to zero feature flag.
enable-scale-to-zero: "true"
# Scale to zero grace period is the time an inactive revision is left
# running before it is scaled to zero (must be positive, but recommended
# at least a few seconds if running with mesh networking).
# This is the upper limit and is provided not to enforce timeout after
# the revision stopped receiving requests for stable window, but to
# ensure network reprogramming to put activator in the path has completed.
# If the system determines that a shorter period is satisfactory,
# then the system will only wait that amount of time before scaling to 0.
# NOTE: this period might actually be 0, if activator has been
# in the request path sufficiently long.
# If there is necessity for the last pod to linger longer use
# scale-to-zero-pod-retention-period flag.
scale-to-zero-grace-period: "30s"
# Scale to zero pod retention period defines the minimum amount
# of time the last pod will remain after Autoscaler has decided to
# scale to zero.
# This flag is for the situations where the pod startup is very expensive
# and the traffic is bursty (requiring smaller windows for fast action),
# but patchy.
# The larger of this flag and `scale-to-zero-grace-period` will effectively
# determine how the last pod will hang around.
scale-to-zero-pod-retention-period: "0s"
# pod-autoscaler-class specifies the default pod autoscaler class
# that should be used if none is specified. If omitted, the Knative
# Horizontal Pod Autoscaler (KPA) is used by default.
pod-autoscaler-class: "kpa.autoscaling.knative.dev"
# The capacity of a single activator task.
# The `unit` is one concurrent request proxied by the activator.
# activator-capacity must be at least 1.
# This value is used for computation of the Activator subset size.
# See the algorithm here: http://bit.ly/38XiCZ3.
# TODO(vagababov): tune after actual benchmarking.
activator-capacity: "100.0"
# initial-scale is the cluster-wide default value for the initial target
# scale of a revision after creation, unless overridden by the
# "autoscaling.knative.dev/initialScale" annotation.
# This value must be greater than 0 unless allow-zero-initial-scale is true.
initial-scale: "1"
# allow-zero-initial-scale controls whether either the cluster-wide initial-scale flag,
# or the "autoscaling.knative.dev/initialScale" annotation, can be set to 0.
allow-zero-initial-scale: "false"
# max-scale is the cluster-wide default value for the max scale of a revision,
# unless overridden by the "autoscaling.knative.dev/maxScale" annotation.
# If set to 0, the revision has no maximum scale.
max-scale: "0"
# scale-down-delay is the amount of time that must pass at reduced
# concurrency before a scale down decision is applied. This can be useful,
# for example, to maintain replica count and avoid a cold start penalty if
# more requests come in within the scale down delay period.
# The default, 0s, imposes no delay at all.
scale-down-delay: "0s"
# max-scale-limit sets the maximum permitted value for the max scale of a revision.
# When this is set to a positive value, a revision with a maxScale above that value
# (including a maxScale of "0" = unlimited) is disallowed.
# A value of zero (the default) allows any limit, including unlimited.
max-scale-limit: "0"