Hugging Face Datasets and Tokenizers in JavaScript: Security Issues for AI Pipelines

This series shows how vulnerabilities propagate through the stack and provides a framework for defending AI applications in production.

written by
Mahesh Babu
published on
September 8, 2025
topic
Application Security

Introduction

Hugging Face Datasets and Tokenizers.js are integral to many JavaScript and TypeScript AI pipelines. They handle ingestion, normalization, and preprocessing of text data. These libraries appear safe but introduce critical security issues at the application layer.

Malicious Dataset Injection

Hugging Face Datasets allows direct loading of datasets from the Hub. Attackers have uploaded poisoned datasets containing adversarial samples and malicious metadata. In one case, metadata fields included embedded escape sequences that broke JSON parsers, leaking system-level error messages. Applications that pulled datasets blindly into preprocessing pipelines became vulnerable to denial of service and data leakage.

Tokenizer Vulnerabilities

Tokenizers.js handles splitting text into subword units. Improper handling of malformed Unicode sequences can trigger buffer overflows in WASM-based tokenizers. In 2022, a proof-of-concept showed how specially crafted Unicode payloads could crash applications and, in some cases, corrupt memory beyond the tokenizer sandbox. For production AI applications, this represents a direct reliability and security issue.

Data Leakage in Preprocessing Pipelines

AI preprocessing pipelines often log intermediate outputs. When Hugging Face Datasets are used without sanitization, sensitive personally identifiable information (PII) can be logged in plain text. In one real incident, a pipeline ingesting customer support chat logs leaked user credentials into system logs during preprocessing.

MITRE ATT&CK Mapping

Threat Vector MITRE Technique(s) Example
Poisoned datasets T1565 – Data Manipulation Malicious Hugging Face dataset metadata causing DoS and error leaks
Tokenizer buffer overflow T1203 – Exploitation for Client Execution Malformed Unicode payload crashing WASM tokenizer
Sensitive data leakage T1530 – Data from Cloud Storage Object Preprocessing logs exposing customer credentials

Conclusion

Datasets and Tokenizers.js introduce underestimated risks in AI pipelines. Poisoned datasets, buffer overflows in WASM tokenizers, and uncontrolled data leakage all compromise application security. Product security teams must enforce dataset provenance, validate Unicode input handling, and prevent sensitive logging at preprocessing stages.

References

  • Hugging Face. (2024). Datasets security practices. Hugging Face Documentation. https://huggingface.co/docs/datasets
  • MITRE ATT&CK®. (2024). ATT&CK Techniques. MITRE. https://attack.mitre.org/
  • Wichers, D. (2022). Top 10 Web Application Security Risks. OWASP. https://owasp.org/Top10
  • SecurityWeek. (2022, July 5). New vulnerabilities found in WebAssembly runtimes. SecurityWeek. https://www.securityweek.com

Blog written by

Mahesh Babu

Head of Marketing

More blogs

View all

CVE-2025-55182: Remote Code Execution in React Server Components

On December 3, 2025, the React and Vercel teams disclosed CVE-2025-55182, a critical remote-code-execution (RCE) vulnerability (CVSS 10) affecting React Server Components (RSC) as used in the Flight protocol implementation.

December 3, 2025

Shai Hulud 2.0: What We Know About the Ongoing NPM Supply Chain Attack

A new wave of supply chain compromise is unfolding across the open-source ecosystem. Multiple security vendors, including Aikido Security and Wiz have confirmed that the threat actor behind the earlier Shai Hulud malware campaign has resurfaced. This time, compromising NPM accounts, GitHub repositories and widely-used packages associated with Zapier and the ENS (Ethereum Name Service).

November 24, 2025

Remediation That Meets Developers in Context

Identifying issues isn’t the challenge. The challenge is effective remediation that fits your codebase, your environment and your team’s development velocity. Developers need to understand where issues originated, which packages to upgrade, what code to change and how disruptive fixes will be. Meanwhile, AppSec needs visibility into what's immediately actionable and which issues require cross-team coordination.

November 19, 2025

A Primer on Runtime Intelligence

See how Kodem's cutting-edge sensor technology revolutionizes application monitoring at the kernel level.

5.1k
Applications covered
1.1m
False positives eliminated
4.8k
Triage hours reduced

Platform Overview Video

Watch our short platform overview video to see how Kodem discovers real security risks in your code at runtime.

5.1k
Applications covered
1.1m
False positives eliminated
4.8k
Triage hours reduced

The State of the Application Security Workflow

This report aims to equip readers with actionable insights that can help future-proof their security programs. Kodem, the publisher of this report, purpose built a platform that bridges these gaps by unifying shift-left strategies with runtime monitoring and protection.

Get real-time insights across the full stack…code, containers, OS, and memory

Watch how Kodem’s runtime security platform detects and blocks attacks before they cause damage. No guesswork. Just precise, automated protection.

Stay up-to-date on Audit Nexus

A curated resource for the many updates to cybersecurity and AI risk regulations, frameworks, and standards.