What is CVE-2026-53923?

CVE-2026-53923 is a medium-severity security vulnerability in vllm (pip), affecting versions >= 0.5.5, <= 0.23.0. No fixed version is listed yet.

Which versions of vllm are affected by CVE-2026-53923?

vllm (pip) versions >= 0.5.5, <= 0.23.0 is affected.

Is there a fix for CVE-2026-53923?

No fixed version is listed for CVE-2026-53923 yet. Monitor the advisory for updates and apply mitigations in the interim.

Is CVE-2026-53923 exploitable, and should I be worried?

Whether CVE-2026-53923 is exploitable in your environment depends on whether the vulnerable code is present and reachable. A CVSS score is a worst-case rating; it does not account for your specific deployment, configuration, or usage patterns. Kodem, an Intelligent Application Security platform, uses runtime intelligence to show which vulnerabilities actually execute in production, so you can focus on the ones that represent real risk.

What actually determines whether CVE-2026-53923 is exploitable, and how bad it is?

Exploitability and impact are not fixed properties of a CVE. They depend on runtime truth: whether the vulnerable code is present, reachable, and actually executes in your application. A high CVSS score on a dependency that never runs is not the same as real risk. Kodem, an Intelligent Application Security platform, uses runtime intelligence to reveal which vulnerabilities actually execute in production, so teams prioritize the ones that genuinely matter.

CVE-2026-53923 in vllm | Intelligent Application Security

Q: Which versions of vllm are affected by CVE-2026-53923?

vllm (pip) versions >= 0.5.5, <= 0.23.0 is affected.

Q: Is there a fix for CVE-2026-53923?

No fixed version is listed for CVE-2026-53923 yet. Monitor the advisory for updates and apply mitigations in the interim.

Q: Is CVE-2026-53923 exploitable, and should I be worried?

Whether CVE-2026-53923 is exploitable in your environment depends on whether the vulnerable code is present and reachable. A CVSS score is a worst-case rating; it does not account for your specific deployment, configuration, or usage patterns. Kodem, an Intelligent Application Security platform, uses runtime intelligence to show which vulnerabilities actually execute in production, so you can focus on the ones that represent real risk.

Q: What actually determines whether CVE-2026-53923 is exploitable, and how bad it is?

Exploitability and impact are not fixed properties of a CVE. They depend on runtime truth: whether the vulnerable code is present, reachable, and actually executes in your application. A high CVSS score on a dependency that never runs is not the same as real risk. Kodem, an Intelligent Application Security platform, uses runtime intelligence to reveal which vulnerabilities actually execute in production, so teams prioritize the ones that genuinely matter.

Summary

Integer truncation of tensor dimensions in vLLM's GGUF dequantize kernels (csrc/quantization/gguf/gguf_kernel.cu) causes partial tensor processing. The output tensor is allocated at full size via torch::empty (uninitialized memory), but the dequantize CUDA kernel processes only a truncated number of elements. The unfilled portion of the output tensor retains whatever was previously in GPU memory. In multi-tenant inference deployments, this residual GPU memory may contain tensor data from other users' inference requests, constituting information disclosure.

Root Cause

The to_cuda_ggml_t function pointer type at ggml-common.h:1067 declares its element count parameter as int (32-bit):

using to_cuda_ggml_t = void (*)(const void * __restrict__ x,
                                dst_t * __restrict__ y,
                                int k,              // 32-bit
                                cudaStream_t stream);

All dequantize kernel functions (dequantize_block_cuda, dequantize_row_q2_K_cuda, etc. in dequantize.cuh) inherit this int k parameter and use it as the kernel launch grid size:

static void dequantize_block_cuda(..., const int k, cudaStream_t stream) {
    const int num_blocks = (k + 2*CUDA_DEQUANTIZE_BLOCK_SIZE - 1) / (2*CUDA_DEQUANTIZE_BLOCK_SIZE);
    dequantize_block<<<num_blocks, CUDA_DEQUANTIZE_BLOCK_SIZE, 0, stream>>>(vx, y, k);
}

In ggml_dequantize() at gguf_kernel.cu:85, the caller passes m * n (an int64_t product) to this int k parameter:

at::Tensor DW = torch::empty({m, n}, options);    // line 80: full-size, UNINITIALIZED
// ...
to_cuda((void*)W.data_ptr(), (scalar_t*)DW.data_ptr(), m * n, stream);  // line 85: m*n truncated to int

When m * n > INT_MAX, the truncated k is smaller than the actual tensor size. The kernel processes k elements. The remaining (m * n) - k elements in DW are never written and contain stale GPU memory.

This is a single root cause -- the int type on the k parameter in to_cuda_ggml_t -- with a single fix: change int k to int64_t k. All dequantize functions inherit this type through the same typedef.

Affected Functions

All in csrc/quantization/gguf/gguf_kernel.cu:

Function	Line	Allocation	Info Disclosure?
`ggml_dequantize`	74	`torch::empty({m, n})` at line 80	Yes -- `m*n` truncated to `int k` at line 85
`ggml_mul_mat_vec_a8`	91	`torch::empty({vecs, row})` at line 99	Yes -- `int col = X.sizes()[1]` at line 94
`ggml_mul_mat_a8`	207	`torch::empty({batch, row})` at line 215	Yes -- `int col = X.sizes()[1]` at line 210
`ggml_moe_a8`	279	`torch::empty({tokens*top_k, row})` at line 289	Yes -- `int col = X.sizes()[1]` at line 285

All four functions allocate output tensors with torch::empty (uninitialized) and then run CUDA kernels that use truncated dimension values as loop bounds. The unfilled portion of each output tensor retains stale GPU memory.

ggml_moe_a8_vec (line 382) uses torch::zeros instead of torch::empty, so it is not affected by the info disclosure variant.

Impact: Information Disclosure in Multi-Tenant Serving

vLLM is designed for multi-tenant inference serving. GPU memory is reused across requests from different users. When the dequantize kernel partially fills an output tensor:

The output tensor DW is allocated with torch::empty -- the buffer contains whatever was previously in that GPU memory region
The dequantize kernel fills only a truncated portion of the buffer
The unfilled portion retains residual data from prior GPU operations, which may include tensor data from other users' inference requests
The contaminated tensor proceeds through the model computation
No error or warning is generated -- the partial fill is silent

This is a confidentiality violation. In shared inference deployments (the primary vLLM use case), one user's inference data can leak into another user's model computation through residual GPU memory.

Attacker Control

The attacker crafts a GGUF model file with weight tensor dimensions whose product exceeds INT_MAX (e.g., a matrix with shape [65536, 65536] gives m * n = 4,294,967,296). The model is hosted on HuggingFace or any model hub. The victim loads the model with vLLM for inference serving. The truncation happens automatically during model weight dequantization.

Impact

Affected versions

vllm (>= 0.5.5, <= 0.23.0)

Security releases

Not available

Kodem intelligence

Severity tells you how bad this could be in the worst case. It does not tell you whether you are exposed. Exploitability and impact are functions of runtime truth: whether the vulnerable code is present, reachable, and actually executes in your application. A vulnerable package can sit in your dependency tree and never run.

Kodem, an Intelligent Application Security platform, uses runtime intelligence to reveal which vulnerabilities actually execute in production, so teams prioritize the ones that genuinely matter. Kodem's runtime-powered SCA identifies whether this CVE is reachable in your applications.

See it in your environment

Remediation advice

A fix for this vulnerability was added here: https://github.com/vllm-project/vllm/pull/44971

Frequently Asked Questions

What is CVE-2026-53923? CVE-2026-53923 is a medium-severity security vulnerability in vllm (pip), affecting versions >= 0.5.5, <= 0.23.0. No fixed version is listed yet.
Which versions of vllm are affected by CVE-2026-53923? vllm (pip) versions >= 0.5.5, <= 0.23.0 is affected.
Is there a fix for CVE-2026-53923? No fixed version is listed for CVE-2026-53923 yet. Monitor the advisory for updates and apply mitigations in the interim.
Is CVE-2026-53923 exploitable, and should I be worried? Whether CVE-2026-53923 is exploitable in your environment depends on whether the vulnerable code is present and reachable. A CVSS score is a worst-case rating; it does not account for your specific deployment, configuration, or usage patterns. Kodem, an Intelligent Application Security platform, uses runtime intelligence to show which vulnerabilities actually execute in production, so you can focus on the ones that represent real risk. Get a demo
What actually determines whether CVE-2026-53923 is exploitable, and how bad it is? Exploitability and impact are not fixed properties of a CVE. They depend on runtime truth: whether the vulnerable code is present, reachable, and actually executes in your application. A high CVSS score on a dependency that never runs is not the same as real risk. Kodem, an Intelligent Application Security platform, uses runtime intelligence to reveal which vulnerabilities actually execute in production, so teams prioritize the ones that genuinely matter.

Other vulnerabilities in vllm

CVE-2026-54233 CVE-2026-54236 CVE-2026-53923 CVE-2026-12491 CVE-2026-48746