Why Use This This skill provides specialized capabilities for sundial-org's codebase.
Use Cases Developing new features in the sundial-org repository Refactoring existing code to follow sundial-org standards Understanding and working with sundial-org's codebase structure
Install Guide 2 steps 1 2 Install inside Ananke
Click Install Skill, paste the link below, then press Install.
https://github.com/sundial-org/skills/tree/main/skills/training-data-curation Skill Snapshot Auto scan of skill assets. Informational only.
Valid SKILL.md Checks against SKILL.md specification
Source & Community
Updated At Jan 16, 2026, 09:58 PM
Skill Stats
SKILL.md 126 Lines
Total Files 1
Total Size 0 B
License NOASSERTION
---
name: training-data-curation
description: Guidelines for creating high-quality datasets for LLM post-training (SFT/DPO/RLHF). Use when preparing data for fine-tuning, evaluating data quality, or designing data collection strategies.
---
# Training Data Curation Guidelines
Best practices for gathering and preparing training data for LLM fine-tuning.
## Data Quality Principles
**Quality over quantity.** Llama 2 used only 27,540 high-quality SFT examples and outperformed models trained on larger noisy datasets [[1]](#references). Focus on clean, diverse, well-formatted data.
**Garbage in, garbage out.** The model will learn patterns from your data—including errors, biases, and formatting issues. Inspect samples manually before training.
**Match the target distribution.** Training data should reflect the tasks and style you want the model to perform. If you want formal responses, don't train on casual chat data.
## Format Requirements
### Supervised Fine-Tuning (SFT)
Use the **messages format** (OpenAI/Anthropic/Tinker standard) [[5]](#references):
```
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
```
- Each sample is a complete conversation
- Multi-turn: alternate user/assistant messages
- System prompts optional: `{"role": "system", "content": "..."}`
- JSONL format, one sample per line
### Preference Learning (DPO/ORPO/KTO)
Requires **paired comparisons** [[2]](#references):
```
{"prompt": "...", "chosen": "...", "rejected": "..."}
```
- `chosen` and `rejected` must respond to the same prompt
- Quality difference should be clear and consistent
- Annotator agreement >70% indicates usable samples [[1]](#references)
For KTO, pairs aren't required—just binary labels on completions [[7]](#references):
```
{"prompt": "...", "completion": "...", "label": true/false}
```
### Reward Modeling (RLHF)
Needs **ranked responses** [[1]](#references):
```
{"prompt": "...", "responses": ["best", "second", "worst"]}
```
## Quality Checklist
Before training, verify:
- [ ] **No duplicates** — exact and near-duplicate removal [[3]](#references)
- [ ] **No empty fields** — all required fields populated
- [ ] **Consistent format** — schema matches throughout
- [ ] **Appropriate length** — not too short (noise) or too long (truncation)
- [ ] **Clean text** — proper encoding, no HTML/boilerplate artifacts [[8]](#references)
- [ ] **Manual inspection** — reviewed random sample of 50-100 examples
- [ ] **No PII/sensitive data** — unless intentionally included
- [ ] **License verified** — legal to use for training
## Common Quality Issues
| Issue | Detection | Fix | Source |
|-------|-----------|-----|--------|
| Duplicates | Hash-based dedup | Remove exact matches, MinHash for near-dupes | [[3]](#references) |
| Boilerplate | Keyword filter | Remove "subscribe", "cookie policy", etc. | [[8]](#references) |
| Repetitive text | N-gram analysis | Flag if <30% unique trigrams | [[4]](#references) |
| Low-quality text | Alpha ratio | Remove if <50% alphabetic characters | [[8]](#references) |
| Wrong language | Language detection | fastText classifier, filter to target | [[3]](#references) |
| Too short | Length check | Minimum 3-5 sentences, 100+ words for documents | [[8]](#references) |
## Data Sources
**High quality:**
- Curated human annotations [[1]](#references)
- Expert-written examples
- Filtered high-quality web data [[3]](#references)
**Medium quality:**
- Synthetic data from stronger models (distillation)
- Community Q&A with voting signals
- Filtered user-generated content
**Use with caution:**
- Raw web scrapes
- Unfiltered synthetic data
- Data without clear provenance [[6]](#references)
## Sizing Guidelines
| Dataset Size | Use Case | Source |
|--------------|----------|--------|
| 100-1K | Quick experiments, specific behaviors | — |
| 1K-10K | Production SFT, domain adaptation | — |
| 10K-100K | Comprehensive instruction tuning | [[1]](#references) |
| 1M+ preference pairs | Large-scale RLHF | [[1]](#references) |
Llama 2 used ~27K SFT examples and 1M+ preference comparisons [[1]](#references).
## File Format
- **JSONL** — one JSON object per line, human-readable
- **Parquet** — efficient for large datasets, built-in compression [[3]](#references)
- **Sharding** — split files >500MB into chunks
## References
1. [Llama 2 Paper](https://arxiv.org/abs/2307.09288) — Touvron et al. (2023). SFT/RLHF data quality practices, 27K SFT examples, >70% annotator agreement threshold
2. [TRL Library](https://huggingface.co/docs/trl/) — HuggingFace trainer implementations for SFT, DPO, KTO, ORPO
3. [FineWeb Paper](https://arxiv.org/abs/2406.17557) — Penedo et al. (2024). Large-scale filtering: MinHash dedup, language detection, quality classifiers
4. [Data-Juicer](https://github.com/alibaba/data-juicer) — Alibaba's quality filtering toolkit with repetition filters, n-gram analysis
5. [Tinker API](https://tinker-docs.thinkingmachines.ai/) — Training API using messages format for SFT, DPO/RLHF support
6. [Data Provenance Initiative](https://arxiv.org/abs/2310.16787) — Longpre et al. (2023). Dataset licensing and attribution audit
7. [KTO Paper](https://arxiv.org/abs/2402.01306) — Ethayarajh et al. (2024). Binary preference learning without pairs
8. [C4/T5 Paper](https://arxiv.org/abs/1910.10683) — Raffel et al. (2020). Foundational filtering: terminal punctuation, min sentences, alpha ratio, boilerplate removal