Drop any broken JSON or JSONL file. The engine repairs structure, normalizes encoding, deduplicates, and outputs a clean SFT-ready dataset.
Process 100 GB+ files with O(1) memory footprint. JSONL line-by-line, JSON chunked 64MB. Never OOM again.
100 GB+ FilesEvery object gets a quality grade: Gold (SFT-ready), Silver (pre-training), Bronze (review), Trash (inspect).
4-Grade SystemZero data retention. Uploads deleted immediately after processing. Outputs purged on a configurable timer.
No Data RetentionAuto-detects CP1251 (Russian), GB18030 (Chinese), CP1256 (Arabic), and 9+ encodings. BOM, chardet, fallback chain.
9+ EncodingsAuto-maps synonyms in 6 languages + Levenshtein fuzzy matching. "инструкция" → "instruction" automatically.
6 LanguagesContent freezing prevents aggressive structural repair from corrupting your valuable text content. Values locked until fully safe.
Content Safe