Knative KPA

Posted by Taction on Tuesday, December 7, 2021

本文主要介绍kpa如何配置,以及以revision为角度介绍其中的各项配置的作用。

简介

有全局配置和每个revision的配置。如果revision没有配置全局有配置的话就会使用全局配置。全局配置配置在config-autoscaler的configmap中。

算法

Autoscaler基于每个Pod的平均请求数(并发数)进行自动扩缩容,默认并发数为100。Pod数=并发请求总数/容器并发数。如果服务中并发数设置为10,并且加载了50个并发请求的服务,则Autoscaler就会创建5个Pod。

Autoscaler实现了两种操作模式的缩放算法,Stable稳定模式和Panic恐慌模式:

  • Stable稳定模式。

    在稳定模式下,Autoscaler调整Deployment的大小,以实现每个Pod所需的平均并发数。Pod的并发数是根据60秒窗口内接收所有数据请求的平均数来计算的。

  • Panic恐慌模式。

    Autoscaler计算60秒窗口内的平均并发数,系统需要1分钟稳定在所需的并发级别。但是,Autoscaler也会计算6秒的恐慌窗口,如果该窗口达到目标并发的2倍,则会进入恐慌模式。在恐慌模式下,Autoscaler在更短、更敏感的紧急窗口上工作。一旦紧急情况持续60秒后,Autoscaler将返回初始的60秒稳定窗口。

                                                           |
                                      Panic Target--->  +--| 20
                                                        |  |
                                                        | <------Panic Window
                                                        |  |
           Stable Target--->  +-------------------------|--| 10   CONCURRENCY
                              |                         |  |
                              |                      <-----------Stable Window
                              |                         |  |
    --------------------------+-------------------------+--+ 0
    120                       60                           0
                         TIME
    
扩容类型

The type of Autoscaler implementation (KPA or HPA) can be configured by using the class annotation.

  • Global settings key: pod-autoscaler-class
  • Per-revision annotation key: autoscaling.knative.dev/class
  • Possible values: "kpa.autoscaling.knative.dev" or "hpa.autoscaling.knative.dev"
  • Default: "kpa.autoscaling.knative.dev"
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/class: "kpa.autoscaling.knative.dev"
    spec:
      containers:
        - image: gcr.io/knative-samples/helloworld-go

configMap全局控制

参数及功能

  • container-concurrency-target-percentage: 容器并发请求数比例。容器实际最大并发数 = 容器最大并发请求数 * 容器并发请求数比例。例如,Revision 设置的容器最大并发请求数为:10,容器并发请求数比例为:70%, 那么在稳定状态下,实际容器的最大并发请求数为:7。这里额外注意一点容器最大并发(某一时刻同时进行中的请求数)请求数ContainerConcurrency是定义在revision中的,默认值是0,代表不对最大并发做限制。
  • container-concurrency-target-default:容器并发请求默认值。当 Revision 中未设置容器最大并发请求数时,使用该默认值作为容器最大并发请求数
  • requests-per-second-target-default: 每秒请求并发(RPS)默认值。当使用RPS进行度量时,autoscaler 会依据此值进行扩缩容判断
  • target-burst-capacity:突发请求容量。在突发流量场景下,切换到 Activator 模式进行流量控制。取值范围为[-1,+∞)。-1表示一直使用 Activator 模式;0表示不使用突发流量功能。
  • stable-window: 稳定窗口期。稳定模式窗口期
  • panic-window-percentage:恐慌窗口比例。通过恐慌窗口比例,计算恐慌窗口期。恐慌窗口期 = 恐慌窗口比例 * 稳定窗口期/100
  • panic-threshold-percentage:恐慌模式比例阈值。当前并发请求数大于容器最大并发请求数 * 恐慌比例阈值,并且达到恐慌窗口期,则进入恐慌模式。
  • max-scale-up-rate:最大扩容速率。每次扩容允许的最大速率。当前最大扩容数 = 最大扩容速率 * Ready的Pod数量
  • max-scale-down-rate:最大缩容速率。
  • enable-scale-to-zero:允许缩容至0。
  • scale-to-zero-grace-period:缩容至0优雅下线时间。
  • scale-to-zero-pod-retention-period定义了在自动缩放器决定缩放到零后,最后一个pod将保留的最小时间。这个标志适用于pod启动非常慢,并且流量激增(需要更小的时间窗口来快速操作),但断断续续的情况。这个配置和“scale-to-zero-grace-period”将有效地决定最后一个pod在没有流量后多久停止。
  • pod-autoscaler-class定义缩放类型
  • activator-capacity单个activator的task的容量。针对activator代理的请求,最小值是1,可以通过这个链接http://bit.ly/38XiCZ3查看计算算法

——- 以下内容不针对缩放,但是也是这个configmap下的网络配置项。

  • initial-scale此值集群范围内生效,定义revision创建的时候的初始的副本数。可以被revision中的"autoscaling.knative.dev/initialScale" annotation配置覆盖。除非allow-zero-initial-scale is true否则必须大于0
  • allow-zero-initial-scale控制initial-scale和revision中的"autoscaling.knative.dev/initialScale" annotation配置是否允许被设置为0
  • max-scale此值集群范围内生效。定义一个revision最大的副本数,可以被revision中"autoscaling.knative.dev/maxScale" annotation配置覆盖。如果设置为0就是无上限。
  • scale-down-delay定义在决策器计算出来需要缩容后,缩容动作的延迟执行时间。如果在延迟期内又来了很多请求,那么就避免了新启动pod带来的冷启动开销。默认值为0,立刻缩容。
  • max-scale-limit一个revision缩放副本数上限的全局限制。如果设置为大于0的值,那么revision中的maxScale必须不为0并且不大于这个值。
例子:
apiVersion: v1
kind: ConfigMap
metadata:
 name: config-autoscaler
 namespace: knative-serving
data:
 container-concurrency-target-percentage: "70"
 container-concurrency-target-default: "100"
 requests-per-second-target-default: "200"
 target-burst-capacity: "200"
 stable-window: "60s"
 panic-window-percentage: "10.0"
 panic-threshold-percentage: "200.0"
 max-scale-up-rate: "1000.0"
 max-scale-down-rate: "2.0"
 enable-scale-to-zero: "false"
 scale-to-zero-grace-period: "30s"
 scale-to-zero-pod-retention-period: "0s"
 pod-autoscaler-class: "kpa.autoscaling.knative.dev"
 activator-capacity: "100.0"
 initial-scale: "1"
 allow-zero-initial-scale: "false"
 max-scale: "0"
 scale-down-delay: "0s"
 max-scale-limit: "0"

revision中配置

每个revision可以自由定义自己的缩放配置,配置项在annotion中。其中大部分配置项与configmap中的是一样的,将configmap中的配置key转化为小驼峰并且添加autoscaling.knative.dev/前缀就是配置的key。

  • autoscaling.knative.dev/class 缩放类型kpa.autoscaling.knative.dev还是hpa.autoscaling.knative.dev默认kpa
  • autoscaling.knative.dev/minScale 最小副本数
  • autoscaling.knative.dev/maxScale 最大副本数
  • autoscaling.knative.dev/initialScale 初始副本数
  • autoscaling.knative.dev/scaleDownDelay 执行缩小副本数的延迟时间
  • autoscaling.knative.dev/metric 基于什么指标进行缩放:并发concurrency、每秒请求数rpscpumemory
  • autoscaling.knative.dev/target 缩放指标的目标值。仅针对cpu(单核百分数)和memory(MiB)。
  • autoscaling.knative.dev/scaleToZeroPodRetentionPeriod 缩放到0之前等待时间
  • autoscaling.knative.dev/metricAggregationAlgorithm 注意:这是一个alpha特性,可能会被修改或移除。此指标的作用在于定义autoscaler平均指标的算法。仅作用于使用kpa扩缩容的服务。linear、指数衰减weightedExponential
  • autoscaling.knative.dev/window 计算指标的时间窗口(为这个时间窗口内的平均值),越大越平滑反应越迟钝。仅作用于kpa扩缩容方式。最小6s最大1h。
  • autoscaling.knative.dev/targetUtilizationPercentage 定义revision所需的目标资源利用率,范围[1,100].对于concurrency和rps都生效。为了提高服务的可用性,并不是到达target才开始扩容,而是到达这个限制时就开始扩容。比如concurrency定义为10,此值为70(百分比),那么一个pod最大并发仍然是10,但是并发达到7的时候就开始进行扩容了。
  • autoscaling.knative.dev/targetBurstCapacity 定义revision的突发流量容量-1不限 0没有 大于0实际的值,其他值报错。
  • autoscaling.knative.dev/panicWindowPercentage 重要指标恐慌时间窗口百分比。最小值为1最大为100.由于autoscaler每2s进行一次计算(config-autoscaler的tick-interval配置项),所以小于2s可能会错过点。如果设置为百分之一由于非常小,那么指标窗口最小就需要3.4分钟。任何小于1的值将不会生效。
  • autoscaling.knative.dev/panicThresholdPercentage 定义在panic时间窗口内指标达到多少百分比的时候将会触发恐慌模式。最小值为110,最大值为1000.越小越灵敏。

配置详解

target burst capacity (TBC)

相比HPA,Knative会考虑更多的场景,其中一个比较重要的是流量突发的时候。 除了Autoscaler会进入Panic模式更快的去扩容外,上面也提到过Activator和Queue-proxy本身也带有缓存请求的功能,这个功能的目的也是为了在请求流量突发来不及处理时,进行缓存再转发。但是非冷启动情况下的请求缓存,会被视为一种迫不得已的兜底行为,同时会带来一定的请求时延。 Knative中会有很多的配置,可以调整用于突发流量的场景。下面为常见的配置项:

  • container concurrency:容器并发度,如果设置为0,则视为不限制容器并发。
  • target utilization:目标使用率,达到该使用率后会被视为达到容器的并发度,会触发扩容。
  • target burst capacity (TBC) :可容忍的请求爆发容量,这个参数比较关键而且不太好理解。

举一个例子,假设我们的容器container concurrency被设置为50,目标使用率target utilization为80%,即每个容器目标是接收50*80%=40的并发请求,TBC设置为100。当有180的并发请求进来后,我们很容易算出最终会被扩容为5个副本(这里会按照目标请求40来计算),但实际上5个副本的最大容量为5*50=250个并发请求,则实际剩余可容忍的爆发为250-180=70个并发。 当剩余可容忍的并发小于TBC时,Knative会让流量经过Activator,而不是直接发送到后端服务,那么在这个例子中,由于70<TBC=100,此时相当于Knative认为剩余的爆发请求容量不足以支撑目标的可容忍容量(TBC),所以流量全部都会走到Activator再进行负载均衡转发,因为Activator可以感知到哪些容器目前接收的请求已经达到极限,哪些容器却还能继续接收更多请求。 同时,如果在示例的场景中,再突然进来了100个请求,在扩容来不及的情况下,Activator会代理70个请求到后端服务,同时缓存30个请求,等后端服务有更多容量时再转发处理。 由此可见,Activator缓存请求并非只是在冷启动时,在突发流量场景下,Activator也会起到相同的作用,而冷启动其实只是后端服务副本为0的一种特殊场景而已。 从上面的分析可以看出,TBC参数十分重要,会影响到什么时候请求流量会经过Activator,什么时候则直接从网关到后端。但是,如果在非冷启动的时候,流量也经过Activator,增加了一层链路,对于延迟敏感的服务,有点得不偿失。 那TBC应该设置成多少呢?这对于很多人来说,也是一个头疼的问题。 这里给出一些参考:

  1. 对于非CPU密集型的服务,例如常规的web应用、静态资源服务等,建议直接将TBC设置为0,同时container concurrency也设置为0,即不限制容器的请求并发。 当TBC=0时,系统剩余的可爆发请求容量会永远大于TBC,也意味着除了冷启动的时候,请求流量永远不会走到Activator,同时我建议设置Knative Service的最小保留副本数为1,这样Activator组件其实都不会被用到,也减少了一层链路。 其实,我们使用Serverless的时候,很多的服务都是常规的在线业务,使用如上的配置,可以最小化请求延迟,增大RPS。
  2. 对于非常CPU密集型的服务、单线程等极端场景的服务,需要严格限制单个容器的请求并发,一般都需要设置container concurrency<5,此时建议设置TBC=-1,因为这种场景下往往扩容等不及请求流量的突增。 当TBC=-1,也意味着所有的请求,都会走到Activator,Activator会重新转发请求或者缓存超额的请求,这个时候的Activator组件在非冷启动情况下就有存在的价值。
  3. 对于除了上述两种情况,还需要限制一些并发请求的场景,此时一般container concurrency>5,建议设置TBC为一个预期的值,例如100,至于这个值具体设置多少,需要系统管理员根据实际的场景去评估。

附:

来自configmap中的说明
    ################################
    #                              #
    #    EXAMPLE CONFIGURATION     #
    #                              #
    ################################

    # This block is not actually functional configuration,
    # but serves to illustrate the available configuration
    # options and document them in a way that is accessible
    # to users that `kubectl edit` this config map.
    #
    # These sample configuration options may be copied out of
    # this example block and unindented to be in the data block
    # to actually change the configuration.

    # The Revision ContainerConcurrency field specifies the maximum number
    # of requests the Container can handle at once. Container concurrency
    # target percentage is how much of that maximum to use in a stable
    # state. E.g. if a Revision specifies ContainerConcurrency of 10, then
    # the Autoscaler will try to maintain 7 concurrent connections per pod
    # on average.
    # Note: this limit will be applied to container concurrency set at every
    # level (ConfigMap, Revision Spec or Annotation).
    # For legacy and backwards compatibility reasons, this value also accepts
    # fractional values in (0, 1] interval (i.e. 0.7 ⇒ 70%).
    # Thus minimal percentage value must be greater than 1.0, or it will be
    # treated as a fraction.
    # NOTE: that this value does not affect actual number of concurrent requests
    #       the user container may receive, but only the average number of requests
    #       that the revision pods will receive.
    container-concurrency-target-percentage: "70"

    # The container concurrency target default is what the Autoscaler will
    # try to maintain when concurrency is used as the scaling metric for the
    # Revision and the Revision specifies unlimited concurrency.
    # When revision explicitly specifies container concurrency, that value
    # will be used as a scaling target for autoscaler.
    # When specifying unlimited concurrency, the autoscaler will
    # horizontally scale the application based on this target concurrency.
    # This is what we call "soft limit" in the documentation, i.e. it only
    # affects number of pods and does not affect the number of requests
    # individual pod processes.
    # The value must be a positive number such that the value multiplied
    # by container-concurrency-target-percentage is greater than 0.01.
    # NOTE: that this value will be adjusted by application of
    #       container-concurrency-target-percentage, i.e. by default
    #       the system will target on average 70 concurrent requests
    #       per revision pod.
    # NOTE: Only one metric can be used for autoscaling a Revision.
    container-concurrency-target-default: "100"

    # The requests per second (RPS) target default is what the Autoscaler will
    # try to maintain when RPS is used as the scaling metric for a Revision and
    # the Revision specifies unlimited RPS. Even when specifying unlimited RPS,
    # the autoscaler will horizontally scale the application based on this
    # target RPS.
    # Must be greater than 1.0.
    # NOTE: Only one metric can be used for autoscaling a Revision.
    requests-per-second-target-default: "200"

    # The target burst capacity specifies the size of burst in concurrent
    # requests that the system operator expects the system will receive.
    # Autoscaler will try to protect the system from queueing by introducing
    # Activator in the request path if the current spare capacity of the
    # service is less than this setting.
    # If this setting is 0, then Activator will be in the request path only
    # when the revision is scaled to 0.
    # If this setting is > 0 and container-concurrency-target-percentage is
    # 100% or 1.0, then activator will always be in the request path.
    # -1 denotes unlimited target-burst-capacity and activator will always
    # be in the request path.
    # Other negative values are invalid.
    target-burst-capacity: "200"

    # When operating in a stable mode, the autoscaler operates on the
    # average concurrency over the stable window.
    # Stable window must be in whole seconds.
    stable-window: "60s"

    # When observed average concurrency during the panic window reaches
    # panic-threshold-percentage the target concurrency, the autoscaler
    # enters panic mode. When operating in panic mode, the autoscaler
    # scales on the average concurrency over the panic window which is
    # panic-window-percentage of the stable-window.
    # Must be in the [1, 100] range.
    # When computing the panic window it will be rounded to the closest
    # whole second, at least 1s.
    panic-window-percentage: "10.0"

    # The percentage of the container concurrency target at which to
    # enter panic mode when reached within the panic window.
    panic-threshold-percentage: "200.0"

    # Max scale up rate limits the rate at which the autoscaler will
    # increase pod count. It is the maximum ratio of desired pods versus
    # observed pods.
    # Cannot be less or equal to 1.
    # I.e with value of 2.0 the number of pods can at most go N to 2N
    # over single Autoscaler period (2s), but at least N to
    # N+1, if Autoscaler needs to scale up.
    max-scale-up-rate: "1000.0"

    # Max scale down rate limits the rate at which the autoscaler will
    # decrease pod count. It is the maximum ratio of observed pods versus
    # desired pods.
    # Cannot be less or equal to 1.
    # I.e. with value of 2.0 the number of pods can at most go N to N/2
    # over single Autoscaler evaluation period (2s), but at
    # least N to N-1, if Autoscaler needs to scale down.
    max-scale-down-rate: "2.0"

    # Scale to zero feature flag.
    enable-scale-to-zero: "true"

    # Scale to zero grace period is the time an inactive revision is left
    # running before it is scaled to zero (must be positive, but recommended
    # at least a few seconds if running with mesh networking).
    # This is the upper limit and is provided not to enforce timeout after
    # the revision stopped receiving requests for stable window, but to
    # ensure network reprogramming to put activator in the path has completed.
    # If the system determines that a shorter period is satisfactory,
    # then the system will only wait that amount of time before scaling to 0.
    # NOTE: this period might actually be 0, if activator has been
    # in the request path sufficiently long.
    # If there is necessity for the last pod to linger longer use
    # scale-to-zero-pod-retention-period flag.
    scale-to-zero-grace-period: "30s"

    # Scale to zero pod retention period defines the minimum amount
    # of time the last pod will remain after Autoscaler has decided to
    # scale to zero.
    # This flag is for the situations where the pod startup is very expensive
    # and the traffic is bursty (requiring smaller windows for fast action),
    # but patchy.
    # The larger of this flag and `scale-to-zero-grace-period` will effectively
    # determine how the last pod will hang around.
    scale-to-zero-pod-retention-period: "0s"

    # pod-autoscaler-class specifies the default pod autoscaler class
    # that should be used if none is specified. If omitted, the Knative
    # Horizontal Pod Autoscaler (KPA) is used by default.
    pod-autoscaler-class: "kpa.autoscaling.knative.dev"

    # The capacity of a single activator task.
    # The `unit` is one concurrent request proxied by the activator.
    # activator-capacity must be at least 1.
    # This value is used for computation of the Activator subset size.
    # See the algorithm here: http://bit.ly/38XiCZ3.
    # TODO(vagababov): tune after actual benchmarking.
    activator-capacity: "100.0"

    # initial-scale is the cluster-wide default value for the initial target
    # scale of a revision after creation, unless overridden by the
    # "autoscaling.knative.dev/initialScale" annotation.
    # This value must be greater than 0 unless allow-zero-initial-scale is true.
    initial-scale: "1"

    # allow-zero-initial-scale controls whether either the cluster-wide initial-scale flag,
    # or the "autoscaling.knative.dev/initialScale" annotation, can be set to 0.
    allow-zero-initial-scale: "false"

    # max-scale is the cluster-wide default value for the max scale of a revision,
    # unless overridden by the "autoscaling.knative.dev/maxScale" annotation.
    # If set to 0, the revision has no maximum scale.
    max-scale: "0"

    # scale-down-delay is the amount of time that must pass at reduced
    # concurrency before a scale down decision is applied. This can be useful,
    # for example, to maintain replica count and avoid a cold start penalty if
    # more requests come in within the scale down delay period.
    # The default, 0s, imposes no delay at all.
    scale-down-delay: "0s"

    # max-scale-limit sets the maximum permitted value for the max scale of a revision.
    # When this is set to a positive value, a revision with a maxScale above that value
    # (including a maxScale of "0" = unlimited) is disallowed.
    # A value of zero (the default) allows any limit, including unlimited.
    max-scale-limit: "0"
参考文档

Supported Autoscaler types

自动扩缩容 - Autoscaler

跟我一起学Knative(4)–Serving 自动扩缩容

基于流量请求数实现服务自动扩缩容