Skip to content

Fix platform-dependent String.getBytes() calls to use explicit UTF-8 charset#10671

Open
saravadeo wants to merge 3 commits intoDataDog:masterfrom
saravadeo:fix/explicit-charset-in-getbytes-calls
Open

Fix platform-dependent String.getBytes() calls to use explicit UTF-8 charset#10671
saravadeo wants to merge 3 commits intoDataDog:masterfrom
saravadeo:fix/explicit-charset-in-getbytes-calls

Conversation

@saravadeo
Copy link

Summary

Specify StandardCharsets.UTF_8 in String.getBytes() calls used with MessageDigest and other encoding-sensitive APIs. Without an explicit charset, getBytes() uses the platform's default charset, which can vary across systems (e.g., UTF-8 on Linux vs Windows-1252 on older Windows) and produce inconsistent results.

Changes

AppSecEventTracker.anonymize() (internal-api)

  • Bug fix: userId.getBytes()userId.getBytes(StandardCharsets.UTF_8)
  • User ID anonymization hashes are now consistent across all platforms, even for non-ASCII user IDs
  • Resolved the TODO about MessageDigest caching with a clarifying comment referencing micro-benchmark data showing negligible overhead of getInstance()

Fingerprinter (agent-debugger)

  • getBytes()getBytes(StandardCharsets.UTF_8) for exception fingerprint hashing

JsonStreamParser (dd-trace-core)

  • raw.getBytes()raw.getBytes(StandardCharsets.UTF_8) — JSON is UTF-8 by specification

LLMObsSpanMapper (dd-trace-core)

  • getKey().getBytes()getKey().getBytes(StandardCharsets.UTF_8) — method is writeUTF8(), so the bytes should actually be UTF-8

Testing

  • All existing tests pass (AppSecEventTrackerSpecification, Fingerprinter, JsonStreamParser, LLMObsSpanMapper)
  • For ASCII-only strings (which existing tests use), behavior is unchanged since UTF-8 and most default charsets encode ASCII identically
  • The fix matters for non-ASCII characters (e.g., Unicode user IDs) where platform charsets diverge

@saravadeo
Copy link
Author

Hi maintainers 👋 Could you please add the appropriate labels? I'd suggest:

  • comp: core (dd-trace-core changes)
  • comp: appsec (AppSecEventTracker change)
  • comp: debugger (Fingerprinter change)
  • type: bugfix

Thank you!

…charset

Specify StandardCharsets.UTF_8 in String.getBytes() calls used with
MessageDigest and other encoding-sensitive APIs. Without an explicit
charset, getBytes() uses the platform's default charset, which can
vary across systems and produce inconsistent results.

Files changed:
- AppSecEventTracker: user ID anonymization hash now uses UTF-8,
  ensuring consistent hashing across all platforms. Also resolved
  the TODO about MessageDigest caching with a clarifying comment
  referencing micro-benchmark data showing negligible overhead.
- Fingerprinter: exception fingerprint hashes now use UTF-8.
- JsonStreamParser: JSON byte conversion now uses UTF-8 (JSON spec).
- LLMObsSpanMapper: writeUTF8() now receives actual UTF-8 bytes.
@saravadeo saravadeo force-pushed the fix/explicit-charset-in-getbytes-calls branch from 8afdff4 to 0c152d3 Compare February 24, 2026 17:07
@saravadeo saravadeo marked this pull request as ready for review February 25, 2026 03:17
@saravadeo saravadeo requested review from a team as code owners February 25, 2026 03:17
@saravadeo saravadeo requested review from daniel-romano-DD, evanchooly and manuel-alvarez-alvarez and removed request for a team February 25, 2026 03:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants