vllm.v1.attention.ops.triton_reshape_and_cache_flash ¶
triton_reshape_and_cache_flash_int4_per_token_head ¶
triton_reshape_and_cache_flash_int4_per_token_head(
key: Tensor,
value: Tensor,
key_cache: Tensor,
value_cache: Tensor,
k_scale_cache: Tensor,
v_scale_cache: Tensor,
slot_mapping: Tensor,
)
Quantize key/value to packed int4 with Gaussian-friendly levels.
Source code in vllm/v1/attention/ops/triton_reshape_and_cache_flash.py
triton_reshape_and_cache_flash_per_token_head_quant ¶
triton_reshape_and_cache_flash_per_token_head_quant(
key: Tensor,
value: Tensor,
key_cache: Tensor,
value_cache: Tensor,
k_scale_cache: Tensor,
v_scale_cache: Tensor,
slot_mapping: Tensor,
)
Quantize key/value per (token, head) and write to paged cache.
Computes one scale = absmax / QUANT_MAX per (token, head), stores quantized data in key_cache/value_cache, and stores the float32 scale in k_scale_cache/v_scale_cache.
The quantization range (QUANT_MAX, QUANT_MIN) is derived from the cache tensor dtype so the same code path works for int8 and fp8.