Skip to content

nicholashoule/demojify-sanitize

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

demojify-sanitize

CI Go Reference Go Version License Zero Dependencies

A dependency-free Go module for auditing, detecting, removing, and substituting emoji clutter and redundant whitespace in text content before it reaches production. Use it as a post-processing step after AI agent output, as a content gate in your request pipeline, or as a CI quality gate -- one call to Sanitize strips and normalizes in a single pass, Replace maps emoji to meaningful text equivalents, ScanDir audits entire directory trees (it calls ContainsEmoji internally per file), and ContainsEmoji is available directly for ad-hoc single-string detection.

Features

  • Emoji removal -- strips all emoji and pictographic codepoints using compiled Unicode range tables; ZWJ sequences, variation selectors, and tag characters handled correctly
  • Whitespace normalization -- collapses redundant inline spaces and blank lines while preserving leading indentation
  • Configurable pipeline -- Sanitize runs removal and normalization in one call; AllowedRanges and AllowedEmojis let callers preserve specific codepoints
  • Substitution -- Replace maps ~137 built-in emoji to readable text equivalents (e.g., [PASS], [FAIL]); custom maps supported
  • Metrics -- SanitizeReport returns emoji count removed and bytes saved alongside the cleaned text
  • Streaming -- SanitizeReader processes io.Reader line by line; supports lines up to 1 MiB
  • JSON-aware -- SanitizeJSON cleans string values only; preserves keys, numbers, booleans, null, and numeric precision
  • Directory scanner -- ScanDir / ScanDirContext walk an entire tree and return per-file findings; cancellation supported
  • Atomic writes -- SanitizeFile, ReplaceFile, WriteFinding, and FixDir write through a temp file and rename; partial writes cannot corrupt the original
  • CLI -- cmd/demojify supports audit, strip (-fix), substitute (-sub), normalize, quiet mode, extension filter, and directory skip
  • Zero external dependencies -- pure stdlib; no go.sum required

Installation

go get github.com/nicholashoule/demojify-sanitize

CLI

go install github.com/nicholashoule/demojify-sanitize/cmd/demojify@latest

Quick start

import demojify "github.com/nicholashoule/demojify-sanitize"

// Remove all emojis and normalize whitespace in one call.
clean := demojify.Sanitize(text, demojify.DefaultOptions())

A ready-to-run CLI example lives in cmd/demojify/main.go. It audits a directory tree for emoji, reports every occurrence with file, line, and column, and optionally rewrites affected files (-fix) or substitutes emoji with text tokens (-sub). Use -skip to exclude specific directories (e.g., dist, build) in addition to the defaults.

# Build once, then run the binary:
go build -o demojify ./cmd/demojify
./demojify -root . -sub -skip dist,build

Integration patterns

AI response post-processing

clean := demojify.Sanitize(aiResponse, demojify.DefaultOptions())

Content gate -- detect and clean

if demojify.ContainsEmoji(userInput) {
 userInput = demojify.Sanitize(userInput, demojify.DefaultOptions())
}

Directory scanner -- audit a repo in one call

cfg := demojify.DefaultScanConfig()
findings, _ := demojify.ScanDir(cfg)
for _, f := range findings {
 fmt.Printf("%s: has_emoji=%v\n", f.Path, f.HasEmoji)
}

Batch fix -- scan and write back in one call

cfg := demojify.DefaultScanConfig()
fixed, _, err := demojify.FixDir(".", cfg)
fmt.Printf("fixed %d file(s)\n", fixed)

Substitution -- replace emoji with meaningful text

repl := demojify.DefaultReplacements()
clean := demojify.Replace("\u2705 tests passed, \u274c build failed", repl)
// "[PASS] tests passed, [FAIL] build failed"

Git pre-commit hook

Option A -- pre-built binary (CI, minimal setup):

go build -o .git/hooks/demojify ./cmd/demojify
#!/bin/sh
# .git/hooks/pre-commit
root="$(git rev-parse --show-toplevel)"
"$root/.git/hooks/demojify" -root "$root" -exts .go,.md -quiet

Option B -- go run with repogov governance (recommended for in-repo hooks):

This is the pattern used in scripts/hooks/pre-commit in this repository. Repogov enforces line limits and layout rules; demojify blocks emoji. Both tools run from their published module versions -- no local clone required.

#!/bin/sh
# .git/hooks/pre-commit
root="$(git rev-parse --show-toplevel)"
cd "$root"

go run github.com/nicholashoule/repogov/cmd/repogov@v0.3.0 -root "$root" -agent copilot
repogov_exit=$?

go run github.com/nicholashoule/demojify-sanitize/cmd/demojify@v0.4.0 -root "$root"
demojify_exit=$?

exit $((repogov_exit | demojify_exit))

See docs/git-hooks.md for auto-fix, substitution, the full Go API variant, and cross-platform (macOS/Linux/Windows) examples.

Streaming sanitization

Process LLM token streams or HTTP chunked responses line by line without buffering the full input:

var out bytes.Buffer
err := demojify.SanitizeReader(llmStream, &out, demojify.DefaultOptions())

Lines up to 1 MiB are supported. Longer lines return bufio.ErrTooLong.

JSON value sanitization

Clean string values inside a JSON document while leaving keys, numbers, booleans, and null untouched:

clean, err := demojify.SanitizeJSON(jsonBytes, demojify.DefaultOptions())

Returns an error for invalid JSON and for input with trailing non-whitespace content after the first value (e.g., {"a":1} trailing).

See example_test.go for additional runnable patterns (HTTP handler, pre-commit/CI, file write-back, per-occurrence matching).

API

Full signatures and doc comments are on pkg.go.dev.

Core functions

Function Purpose
Sanitize(text, opts) string Configurable pipeline: emoji removal then whitespace normalization
SanitizeFile(path, opts) (bool, error) Sanitize a file atomically; no write when clean
Demojify(text) string Strip all emoji / pictographic codepoints
ContainsEmoji(text) bool Detect emoji presence
CountEmoji(text) int Count emoji codepoint occurrences
BytesSaved(text) int Bytes freed by emoji removal
Normalize(text) string Collapse redundant whitespace (preserves leading indentation)
TechnicalSymbolRanges() []*unicode.RangeTable Pre-built ranges for check marks, gears, etc. -- pass to AllowedRanges

Reporting and streaming

Function / Type Purpose
SanitizeReport(text, opts) SanitizeResult Sanitize with structured metrics (emoji count, bytes saved)
SanitizeResult Cleaned text plus EmojiRemoved and BytesSaved fields
SanitizeReader(r, w, opts) error Line-by-line streaming sanitization (LLM streams, MCP payloads)
SanitizeJSON(data, opts) ([]byte, error) Sanitize JSON string values only; preserves structure and numeric precision

Substitution

Function Purpose
Replace(text, repl) string Map emoji to text equivalents; strip unmapped remainder
ReplaceFile(path, repl) (int, error) Atomic in-place replacement; no write when clean
ReplaceCount(text, repl) (string, int) Replace and return substitution count
FindAll(text) []string Distinct emoji sequences in text
FindAllMapped(text, repl) []string Mapped keys found in text
DefaultReplacements() map[string]string Built-in ~137-entry emoji-to-text map (full list)

Scanner

Function / Type Purpose
ScanDir(cfg) ([]Finding, error) Walk directory tree, return findings
ScanDirContext(ctx, cfg) ([]Finding, error) Context-aware scan with cancellation support
ScanFile(path, opts) (*Finding, error) Check a single file
FindMatchesInFile(path, repl) ([]Match, error) Per-occurrence match detail (line, column, context)
WriteFinding(path, f) (bool, error) Atomic write-back without re-reading
FixDir(root, cfg) (fixed, clean int, err error) Scan and fix an entire directory tree in one call
ScanConfig / DefaultScanConfig() Scanner configuration (root, skip dirs, extensions, etc.)
Finding Path, HasEmoji, Original, Cleaned, Matches
Match Sequence, Replacement, Line, Column, Context

Line limit configuration

Symbol Purpose
LimitConfig Per-file line limit struct: Default int + Files map override
DefaultLimitConfig() LimitConfig Returns a pre-populated config (500-line default; .claude/CLAUDE.md capped at 50)
DefaultLineLimit Fallback constant (500) when LimitConfig.Default is zero
ResolveLimit(cfg LimitConfig, path string) int Returns the effective line limit for path (file override → Default → DefaultLineLimit)

Options

type Options struct {
 RemoveEmojis        bool               // strip emoji / pictographic characters
 NormalizeWhitespace bool               // collapse redundant spaces and blank lines
 AllowedRanges       []*unicode.RangeTable // preserve emoji in these Unicode ranges
 AllowedEmojis       []string           // preserve specific emoji strings (exact match)
}

func DefaultOptions() Options // RemoveEmojis + NormalizeWhitespace = true

AllowedRanges and AllowedEmojis can be combined. Empty strings in AllowedEmojis and empty keys in replacement maps are silently skipped.

// Remove all emoji except rocket and thumbs-up.
clean := demojify.Sanitize(text, demojify.Options{
 RemoveEmojis:  true,
 AllowedEmojis: []string{"\U0001F680", "\U0001F44D"},
})

Unicode emoji coverage

Demojify strips U+2139, U+2600-U+27BF, U+1F000-U+1FAFF, ZWJ (U+200D), variation selectors (U+FE00-U+FE0F), tag characters (U+E0020-U+E007F), and related auxiliary ranges. Intentionally not removed: copyright, registered, trademark, and basic math/technical arrows.

Full range table: docs/unicode-coverage.md.

Design and documentation

Document Contents
docs/design.md Architecture rationale: zero-dependency policy, pipeline order, error handling, atomic writes
docs/replacements.md Full DefaultReplacements() reference: all ~137 entries organized by category
docs/unicode-coverage.md emojiRE ranges, intentional exclusions (copyright, trademark, math arrows), substitution vs. stripping
docs/cli.md cmd/demojify CLI reference: flags, exit codes, output format, examples
docs/git-hooks.md Pre-commit hook integration: shell and Go examples, auto-fix, substitution

License

See LICENSE.

About

A dependency-free Go library for auditing, detecting, removing, and substituting emoji clutter and unnecessary whitespace in text content. Built for post-processing AI agent output, sanitizing user-submitted content, and functioning as a CI quality gate.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Contributors

Languages