Knative KPA

本文主要介绍kpa如何配置，以及以revision为角度介绍其中的各项配置的作用。

简介

有全局配置和每个revision的配置。如果revision没有配置全局有配置的话就会使用全局配置。全局配置配置在config-autoscaler的configmap中。

算法

Autoscaler基于每个Pod的平均请求数（并发数）进行自动扩缩容，默认并发数为100。Pod数=并发请求总数/容器并发数。如果服务中并发数设置为10，并且加载了50个并发请求的服务，则Autoscaler就会创建5个Pod。

Autoscaler实现了两种操作模式的缩放算法，Stable稳定模式和Panic恐慌模式：

Stable稳定模式。
在稳定模式下，Autoscaler调整Deployment的大小，以实现每个Pod所需的平均并发数。Pod的并发数是根据60秒窗口内接收所有数据请求的平均数来计算的。

Panic恐慌模式。

Autoscaler计算60秒窗口内的平均并发数，系统需要1分钟稳定在所需的并发级别。但是，Autoscaler也会计算6秒的恐慌窗口，如果该窗口达到目标并发的2倍，则会进入恐慌模式。在恐慌模式下，Autoscaler在更短、更敏感的紧急窗口上工作。一旦紧急情况持续60秒后，Autoscaler将返回初始的60秒稳定窗口。

                                                       |
                                  Panic Target--->  +--| 20
                                                    |  |
                                                    | <------Panic Window
                                                    |  |
       Stable Target--->  +-------------------------|--| 10   CONCURRENCY
                          |                         |  |
                          |                      <-----------Stable Window
                          |                         |  |
--------------------------+-------------------------+--+ 0
120                       60                           0
                     TIME

扩容类型

The type of Autoscaler implementation (KPA or HPA) can be configured by using the class annotation.

Global settings key: pod-autoscaler-class
Per-revision annotation key: autoscaling.knative.dev/class
Possible values: "kpa.autoscaling.knative.dev" or "hpa.autoscaling.knative.dev"
Default: "kpa.autoscaling.knative.dev"

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/class: "kpa.autoscaling.knative.dev"
    spec:
      containers:
        - image: gcr.io/knative-samples/helloworld-go

configMap全局控制

参数及功能

container-concurrency-target-percentage: 容器并发请求数比例。容器实际最大并发数 = 容器最大并发请求数 * 容器并发请求数比例。例如，Revision 设置的容器最大并发请求数为：10，容器并发请求数比例为：70%，那么在稳定状态下，实际容器的最大并发请求数为：7。这里额外注意一点容器最大并发（某一时刻同时进行中的请求数）请求数ContainerConcurrency是定义在revision中的，默认值是0，代表不对最大并发做限制。
container-concurrency-target-default：容器并发请求默认值。当 Revision 中未设置容器最大并发请求数时，使用该默认值作为容器最大并发请求数
requests-per-second-target-default: 每秒请求并发（RPS）默认值。当使用RPS进行度量时，autoscaler 会依据此值进行扩缩容判断
target-burst-capacity：突发请求容量。在突发流量场景下，切换到 Activator 模式进行流量控制。取值范围为[-1,+∞)。-1表示一直使用 Activator 模式；0表示不使用突发流量功能。
stable-window：稳定窗口期。稳定模式窗口期
panic-window-percentage：恐慌窗口比例。通过恐慌窗口比例，计算恐慌窗口期。恐慌窗口期 = 恐慌窗口比例 * 稳定窗口期/100
panic-threshold-percentage：恐慌模式比例阈值。当前并发请求数大于容器最大并发请求数 * 恐慌比例阈值，并且达到恐慌窗口期，则进入恐慌模式。
max-scale-up-rate：最大扩容速率。每次扩容允许的最大速率。当前最大扩容数 = 最大扩容速率 * Ready的Pod数量
max-scale-down-rate：最大缩容速率。
enable-scale-to-zero：允许缩容至0。
scale-to-zero-grace-period：缩容至0优雅下线时间。
scale-to-zero-pod-retention-period定义了在自动缩放器决定缩放到零后，最后一个pod将保留的最小时间。这个标志适用于pod启动非常慢，并且流量激增(需要更小的时间窗口来快速操作)，但断断续续的情况。这个配置和“scale-to-zero-grace-period”将有效地决定最后一个pod在没有流量后多久停止。
pod-autoscaler-class定义缩放类型
activator-capacity单个activator的task的容量。针对activator代理的请求，最小值是1，可以通过这个链接http://bit.ly/38XiCZ3查看计算算法

——- 以下内容不针对缩放，但是也是这个configmap下的网络配置项。

initial-scale此值集群范围内生效，定义revision创建的时候的初始的副本数。可以被revision中的"autoscaling.knative.dev/initialScale" annotation配置覆盖。除非allow-zero-initial-scale is true否则必须大于0
allow-zero-initial-scale控制initial-scale和revision中的"autoscaling.knative.dev/initialScale" annotation配置是否允许被设置为0
max-scale此值集群范围内生效。定义一个revision最大的副本数，可以被revision中"autoscaling.knative.dev/maxScale" annotation配置覆盖。如果设置为0就是无上限。
scale-down-delay定义在决策器计算出来需要缩容后，缩容动作的延迟执行时间。如果在延迟期内又来了很多请求，那么就避免了新启动pod带来的冷启动开销。默认值为0，立刻缩容。
max-scale-limit一个revision缩放副本数上限的全局限制。如果设置为大于0的值，那么revision中的maxScale必须不为0并且不大于这个值。

例子：

apiVersion: v1
kind: ConfigMap
metadata:
 name: config-autoscaler
 namespace: knative-serving
data:
 container-concurrency-target-percentage: "70"
 container-concurrency-target-default: "100"
 requests-per-second-target-default: "200"
 target-burst-capacity: "200"
 stable-window: "60s"
 panic-window-percentage: "10.0"
 panic-threshold-percentage: "200.0"
 max-scale-up-rate: "1000.0"
 max-scale-down-rate: "2.0"
 enable-scale-to-zero: "false"
 scale-to-zero-grace-period: "30s"
 scale-to-zero-pod-retention-period: "0s"
 pod-autoscaler-class: "kpa.autoscaling.knative.dev"
 activator-capacity: "100.0"
 initial-scale: "1"
 allow-zero-initial-scale: "false"
 max-scale: "0"
 scale-down-delay: "0s"
 max-scale-limit: "0"

revision中配置

每个revision可以自由定义自己的缩放配置，配置项在annotion中。其中大部分配置项与configmap中的是一样的，将configmap中的配置key转化为小驼峰并且添加autoscaling.knative.dev/前缀就是配置的key。

autoscaling.knative.dev/class 缩放类型kpa.autoscaling.knative.dev还是hpa.autoscaling.knative.dev默认kpa
autoscaling.knative.dev/minScale 最小副本数
autoscaling.knative.dev/maxScale 最大副本数
autoscaling.knative.dev/initialScale 初始副本数
autoscaling.knative.dev/scaleDownDelay 执行缩小副本数的延迟时间
autoscaling.knative.dev/metric 基于什么指标进行缩放:并发concurrency、每秒请求数rps、cpu、memory
autoscaling.knative.dev/target 缩放指标的目标值。仅针对cpu（单核百分数）和memory（MiB）。
autoscaling.knative.dev/scaleToZeroPodRetentionPeriod 缩放到0之前等待时间
autoscaling.knative.dev/metricAggregationAlgorithm 注意：这是一个alpha特性，可能会被修改或移除。此指标的作用在于定义autoscaler平均指标的算法。仅作用于使用kpa扩缩容的服务。linear、指数衰减weightedExponential
autoscaling.knative.dev/window 计算指标的时间窗口（为这个时间窗口内的平均值），越大越平滑反应越迟钝。仅作用于kpa扩缩容方式。最小6s最大1h。
autoscaling.knative.dev/targetUtilizationPercentage 定义revision所需的目标资源利用率，范围[1,100].对于concurrency和rps都生效。为了提高服务的可用性，并不是到达target才开始扩容，而是到达这个限制时就开始扩容。比如concurrency定义为10，此值为70（百分比），那么一个pod最大并发仍然是10，但是并发达到7的时候就开始进行扩容了。
autoscaling.knative.dev/targetBurstCapacity 定义revision的突发流量容量-1不限 0没有大于0实际的值，其他值报错。
autoscaling.knative.dev/panicWindowPercentage 重要指标恐慌时间窗口百分比。最小值为1最大为100.由于autoscaler每2s进行一次计算(config-autoscaler的tick-interval配置项)，所以小于2s可能会错过点。如果设置为百分之一由于非常小，那么指标窗口最小就需要3.4分钟。任何小于1的值将不会生效。
autoscaling.knative.dev/panicThresholdPercentage 定义在panic时间窗口内指标达到多少百分比的时候将会触发恐慌模式。最小值为110，最大值为1000.越小越灵敏。

配置详解

target burst capacity (TBC)

相比HPA，Knative会考虑更多的场景，其中一个比较重要的是流量突发的时候。除了Autoscaler会进入Panic模式更快的去扩容外，上面也提到过Activator和Queue-proxy本身也带有缓存请求的功能，这个功能的目的也是为了在请求流量突发来不及处理时，进行缓存再转发。但是非冷启动情况下的请求缓存，会被视为一种迫不得已的兜底行为，同时会带来一定的请求时延。 Knative中会有很多的配置，可以调整用于突发流量的场景。下面为常见的配置项：

container concurrency：容器并发度，如果设置为0，则视为不限制容器并发。
target utilization：目标使用率，达到该使用率后会被视为达到容器的并发度，会触发扩容。
target burst capacity (TBC) ：可容忍的请求爆发容量，这个参数比较关键而且不太好理解。

举一个例子，假设我们的容器container concurrency被设置为50，目标使用率target utilization为80%，即每个容器目标是接收50*80%=40的并发请求，TBC设置为100。当有180的并发请求进来后，我们很容易算出最终会被扩容为5个副本（这里会按照目标请求40来计算），但实际上5个副本的最大容量为5*50=250个并发请求，则实际剩余可容忍的爆发为250-180=70个并发。当剩余可容忍的并发小于TBC时，Knative会让流量经过Activator，而不是直接发送到后端服务，那么在这个例子中，由于70<TBC=100，此时相当于Knative认为剩余的爆发请求容量不足以支撑目标的可容忍容量（TBC），所以流量全部都会走到Activator再进行负载均衡转发，因为Activator可以感知到哪些容器目前接收的请求已经达到极限，哪些容器却还能继续接收更多请求。同时，如果在示例的场景中，再突然进来了100个请求，在扩容来不及的情况下，Activator会代理70个请求到后端服务，同时缓存30个请求，等后端服务有更多容量时再转发处理。由此可见，Activator缓存请求并非只是在冷启动时，在突发流量场景下，Activator也会起到相同的作用，而冷启动其实只是后端服务副本为0的一种特殊场景而已。从上面的分析可以看出，TBC参数十分重要，会影响到什么时候请求流量会经过Activator，什么时候则直接从网关到后端。但是，如果在非冷启动的时候，流量也经过Activator，增加了一层链路，对于延迟敏感的服务，有点得不偿失。那TBC应该设置成多少呢？这对于很多人来说，也是一个头疼的问题。这里给出一些参考：

对于非CPU密集型的服务，例如常规的web应用、静态资源服务等，建议直接将TBC设置为0，同时container concurrency也设置为0，即不限制容器的请求并发。当TBC=0时，系统剩余的可爆发请求容量会永远大于TBC，也意味着除了冷启动的时候，请求流量永远不会走到Activator，同时我建议设置Knative Service的最小保留副本数为1，这样Activator组件其实都不会被用到，也减少了一层链路。其实，我们使用Serverless的时候，很多的服务都是常规的在线业务，使用如上的配置，可以最小化请求延迟，增大RPS。
对于非常CPU密集型的服务、单线程等极端场景的服务，需要严格限制单个容器的请求并发，一般都需要设置container concurrency<5，此时建议设置TBC=-1，因为这种场景下往往扩容等不及请求流量的突增。当TBC=-1，也意味着所有的请求，都会走到Activator，Activator会重新转发请求或者缓存超额的请求，这个时候的Activator组件在非冷启动情况下就有存在的价值。
对于除了上述两种情况，还需要限制一些并发请求的场景，此时一般container concurrency>5，建议设置TBC为一个预期的值，例如100，至于这个值具体设置多少，需要系统管理员根据实际的场景去评估。

附：

来自configmap中的说明

    ################################
    #                              #
    #    EXAMPLE CONFIGURATION     #
    #                              #
    ################################

    # This block is not actually functional configuration,
    # but serves to illustrate the available configuration
    # options and document them in a way that is accessible
    # to users that `kubectl edit` this config map.
    #
    # These sample configuration options may be copied out of
    # this example block and unindented to be in the data block
    # to actually change the configuration.

    # The Revision ContainerConcurrency field specifies the maximum number
    # of requests the Container can handle at once. Container concurrency
    # target percentage is how much of that maximum to use in a stable
    # state. E.g. if a Revision specifies ContainerConcurrency of 10, then
    # the Autoscaler will try to maintain 7 concurrent connections per pod
    # on average.
    # Note: this limit will be applied to container concurrency set at every
    # level (ConfigMap, Revision Spec or Annotation).
    # For legacy and backwards compatibility reasons, this value also accepts
    # fractional values in (0, 1] interval (i.e. 0.7 ⇒ 70%).
    # Thus minimal percentage value must be greater than 1.0, or it will be
    # treated as a fraction.
    # NOTE: that this value does not affect actual number of concurrent requests
    #       the user container may receive, but only the average number of requests
    #       that the revision pods will receive.
    container-concurrency-target-percentage: "70"

    # The container concurrency target default is what the Autoscaler will
    # try to maintain when concurrency is used as the scaling metric for the
    # Revision and the Revision specifies unlimited concurrency.
    # When revision explicitly specifies container concurrency, that value
    # will be used as a scaling target for autoscaler.
    # When specifying unlimited concurrency, the autoscaler will
    # horizontally scale the application based on this target concurrency.
    # This is what we call "soft limit" in the documentation, i.e. it only
    # affects number of pods and does not affect the number of requests
    # individual pod processes.
    # The value must be a positive number such that the value multiplied
    # by container-concurrency-target-percentage is greater than 0.01.
    # NOTE: that this value will be adjusted by application of
    #       container-concurrency-target-percentage, i.e. by default
    #       the system will target on average 70 concurrent requests
    #       per revision pod.
    # NOTE: Only one metric can be used for autoscaling a Revision.
    container-concurrency-target-default: "100"

    # The requests per second (RPS) target default is what the Autoscaler will
    # try to maintain when RPS is used as the scaling metric for a Revision and
    # the Revision specifies unlimited RPS. Even when specifying unlimited RPS,
    # the autoscaler will horizontally scale the application based on this
    # target RPS.
    # Must be greater than 1.0.
    # NOTE: Only one metric can be used for autoscaling a Revision.
    requests-per-second-target-default: "200"

    # The target burst capacity specifies the size of burst in concurrent
    # requests that the system operator expects the system will receive.
    # Autoscaler will try to protect the system from queueing by introducing
    # Activator in the request path if the current spare capacity of the
    # service is less than this setting.
    # If this setting is 0, then Activator will be in the request path only
    # when the revision is scaled to 0.
    # If this setting is > 0 and container-concurrency-target-percentage is
    # 100% or 1.0, then activator will always be in the request path.
    # -1 denotes unlimited target-burst-capacity and activator will always
    # be in the request path.
    # Other negative values are invalid.
    target-burst-capacity: "200"

    # When operating in a stable mode, the autoscaler operates on the
    # average concurrency over the stable window.
    # Stable window must be in whole seconds.
    stable-window: "60s"

    # When observed average concurrency during the panic window reaches
    # panic-threshold-percentage the target concurrency, the autoscaler
    # enters panic mode. When operating in panic mode, the autoscaler
    # scales on the average concurrency over the panic window which is
    # panic-window-percentage of the stable-window.
    # Must be in the [1, 100] range.
    # When computing the panic window it will be rounded to the closest
    # whole second, at least 1s.
    panic-window-percentage: "10.0"

    # The percentage of the container concurrency target at which to
    # enter panic mode when reached within the panic window.
    panic-threshold-percentage: "200.0"

    # Max scale up rate limits the rate at which the autoscaler will
    # increase pod count. It is the maximum ratio of desired pods versus
    # observed pods.
    # Cannot be less or equal to 1.
    # I.e with value of 2.0 the number of pods can at most go N to 2N
    # over single Autoscaler period (2s), but at least N to
    # N+1, if Autoscaler needs to scale up.
    max-scale-up-rate: "1000.0"

    # Max scale down rate limits the rate at which the autoscaler will
    # decrease pod count. It is the maximum ratio of observed pods versus
    # desired pods.
    # Cannot be less or equal to 1.
    # I.e. with value of 2.0 the number of pods can at most go N to N/2
    # over single Autoscaler evaluation period (2s), but at
    # least N to N-1, if Autoscaler needs to scale down.
    max-scale-down-rate: "2.0"

    # Scale to zero feature flag.
    enable-scale-to-zero: "true"

    # Scale to zero grace period is the time an inactive revision is left
    # running before it is scaled to zero (must be positive, but recommended
    # at least a few seconds if running with mesh networking).
    # This is the upper limit and is provided not to enforce timeout after
    # the revision stopped receiving requests for stable window, but to
    # ensure network reprogramming to put activator in the path has completed.
    # If the system determines that a shorter period is satisfactory,
    # then the system will only wait that amount of time before scaling to 0.
    # NOTE: this period might actually be 0, if activator has been
    # in the request path sufficiently long.
    # If there is necessity for the last pod to linger longer use
    # scale-to-zero-pod-retention-period flag.
    scale-to-zero-grace-period: "30s"

    # Scale to zero pod retention period defines the minimum amount
    # of time the last pod will remain after Autoscaler has decided to
    # scale to zero.
    # This flag is for the situations where the pod startup is very expensive
    # and the traffic is bursty (requiring smaller windows for fast action),
    # but patchy.
    # The larger of this flag and `scale-to-zero-grace-period` will effectively
    # determine how the last pod will hang around.
    scale-to-zero-pod-retention-period: "0s"

    # pod-autoscaler-class specifies the default pod autoscaler class
    # that should be used if none is specified. If omitted, the Knative
    # Horizontal Pod Autoscaler (KPA) is used by default.
    pod-autoscaler-class: "kpa.autoscaling.knative.dev"

    # The capacity of a single activator task.
    # The `unit` is one concurrent request proxied by the activator.
    # activator-capacity must be at least 1.
    # This value is used for computation of the Activator subset size.
    # See the algorithm here: http://bit.ly/38XiCZ3.
    # TODO(vagababov): tune after actual benchmarking.
    activator-capacity: "100.0"

    # initial-scale is the cluster-wide default value for the initial target
    # scale of a revision after creation, unless overridden by the
    # "autoscaling.knative.dev/initialScale" annotation.
    # This value must be greater than 0 unless allow-zero-initial-scale is true.
    initial-scale: "1"

    # allow-zero-initial-scale controls whether either the cluster-wide initial-scale flag,
    # or the "autoscaling.knative.dev/initialScale" annotation, can be set to 0.
    allow-zero-initial-scale: "false"

    # max-scale is the cluster-wide default value for the max scale of a revision,
    # unless overridden by the "autoscaling.knative.dev/maxScale" annotation.
    # If set to 0, the revision has no maximum scale.
    max-scale: "0"

    # scale-down-delay is the amount of time that must pass at reduced
    # concurrency before a scale down decision is applied. This can be useful,
    # for example, to maintain replica count and avoid a cold start penalty if
    # more requests come in within the scale down delay period.
    # The default, 0s, imposes no delay at all.
    scale-down-delay: "0s"

    # max-scale-limit sets the maximum permitted value for the max scale of a revision.
    # When this is set to a positive value, a revision with a maxScale above that value
    # (including a maxScale of "0" = unlimited) is disallowed.
    # A value of zero (the default) allows any limit, including unlimited.
    max-scale-limit: "0"

参考文档

Supported Autoscaler types

自动扩缩容 - Autoscaler

跟我一起学Knative(4)–Serving 自动扩缩容

基于流量请求数实现服务自动扩缩容