Hugging Face Datasets and Tokenizers in JavaScript: Security Issues for AI Pipelines


Introduction
Hugging Face Datasets and Tokenizers.js are integral to many JavaScript and TypeScript AI pipelines. They handle ingestion, normalization, and preprocessing of text data. These libraries appear safe but introduce critical security issues at the application layer.
Malicious Dataset Injection
Hugging Face Datasets allows direct loading of datasets from the Hub. Attackers have uploaded poisoned datasets containing adversarial samples and malicious metadata. In one case, metadata fields included embedded escape sequences that broke JSON parsers, leaking system-level error messages. Applications that pulled datasets blindly into preprocessing pipelines became vulnerable to denial of service and data leakage.
Tokenizer Vulnerabilities
Tokenizers.js handles splitting text into subword units. Improper handling of malformed Unicode sequences can trigger buffer overflows in WASM-based tokenizers. In 2022, a proof-of-concept showed how specially crafted Unicode payloads could crash applications and, in some cases, corrupt memory beyond the tokenizer sandbox. For production AI applications, this represents a direct reliability and security issue.
Data Leakage in Preprocessing Pipelines
AI preprocessing pipelines often log intermediate outputs. When Hugging Face Datasets are used without sanitization, sensitive personally identifiable information (PII) can be logged in plain text. In one real incident, a pipeline ingesting customer support chat logs leaked user credentials into system logs during preprocessing.
MITRE ATT&CK Mapping
Conclusion
Datasets and Tokenizers.js introduce underestimated risks in AI pipelines. Poisoned datasets, buffer overflows in WASM tokenizers, and uncontrolled data leakage all compromise application security. Product security teams must enforce dataset provenance, validate Unicode input handling, and prevent sensitive logging at preprocessing stages.
References
- Hugging Face. (2024). Datasets security practices. Hugging Face Documentation. https://huggingface.co/docs/datasets
- MITRE ATT&CK®. (2024). ATT&CK Techniques. MITRE. https://attack.mitre.org/
- Wichers, D. (2022). Top 10 Web Application Security Risks. OWASP. https://owasp.org/Top10
- SecurityWeek. (2022, July 5). New vulnerabilities found in WebAssembly runtimes. SecurityWeek. https://www.securityweek.com
More blogs

Malicious Packages Alert: The Qix npm Supply-Chain Attack: Lessons for the Ecosystem
The npm ecosystem is in the middle of a major supply-chain compromise. The maintainer known as Qix is currently targeted in a phishing campaign that allows attackers to bypass two-factor authentication and take over their npm account. This is happening right now, and malicious versions of widely used libraries are being published and distributed.

Security Issues in popular AI Runtimes - Node.js, Deno, and Bun
Node.js, Deno, and Bun are the primary runtimes for executing JavaScript and TypeScript in modern applications. They form the backbone of AI backends, serverless deployments, and orchestration layers. Each runtime introduces distinct application security issues. For product security teams, understanding these runtime weaknesses is essential because attacks often bypass framework-level defenses and exploit the runtime directly.

Application Security Issues in AI Edge and Serverless Runtimes: AWS Lambda, Vercel Edge Functions, and Cloudflare Workers
AI workloads are increasingly deployed on serverless runtimes like AWS Lambda, Vercel Edge Functions, and Cloudflare Workers. These platforms reduce operational overhead but introduce new application-layer risks. Product security teams must recognize that serverless runtimes are not inherently safer—they simply shift the attack surface.
A Primer on Runtime Intelligence
See how Kodem's cutting-edge sensor technology revolutionizes application monitoring at the kernel level.
Platform Overview Video
Watch our short platform overview video to see how Kodem discovers real security risks in your code at runtime.
The State of the Application Security Workflow
This report aims to equip readers with actionable insights that can help future-proof their security programs. Kodem, the publisher of this report, purpose built a platform that bridges these gaps by unifying shift-left strategies with runtime monitoring and protection.
.png)
Get real-time insights across the full stack…code, containers, OS, and memory
Watch how Kodem’s runtime security platform detects and blocks attacks before they cause damage. No guesswork. Just precise, automated protection.

Stay up-to-date on Audit Nexus
A curated resource for the many updates to cybersecurity and AI risk regulations, frameworks, and standards.