Ecosystem & R&D Skillshigh

skill-creator (Anthropic Official)

skill-creator is Anthropic's official meta-skill for creating, editing, and benchmarking other skills. Scaffolds SKILL.md with proper frontmatter and runs evals to measure trigger accuracy — how often the skill fires when it should and stays silent when it shouldn't.

Memory anchor

skill-creator is a recipe-test kitchen — develop the dish (skill body), taste-test with critics (eval prompts), tweak the menu description until customers actually order it (trigger accuracy).

Expected depth

Workflow: invoke /skill-creator → describe what the skill should do → it generates SKILL.md (name, description, optional allowed-tools) plus a body of instructions and examples → runs an eval harness against test prompts (positive cases that should trigger, negative cases that shouldn't) → reports precision/recall on the description match. Iterate on the description if accuracy is low — the description is what the model matches against user intent. Also handles editing existing skills and measuring trigger variance.

Deep — senior internals

Trigger accuracy is the dominant quality metric for skills — vague descriptions cause inconsistent firing or silent failures. skill-creator's eval harness sends N prompts (typical: 10–30), labels each as expected-trigger or expected-skip, and reports confusion matrix metrics. Description optimization mode iteratively rewrites the description and re-runs evals until accuracy plateaus. Variance analysis re-runs the same prompt set multiple times to surface flaky triggers (the model picking the skill 7/10 times is a problem). Output: a skill directory with SKILL.md + helpers, plus an eval report you commit alongside.

🎤Interview-ready answer

skill-creator is Anthropic's official meta-skill for building and tuning other skills. I use it to scaffold new skills with the right frontmatter and to benchmark the description field — the description is what the model uses to decide when to invoke the skill, and weak descriptions cause inconsistent triggering. The tool runs an eval harness across positive and negative test prompts, reports trigger accuracy, and iterates on the description if accuracy is low. It also measures variance to surface flaky triggers in existing skills.

Common trap

Skipping the eval pass and shipping a skill with a vague description like 'helps with code review.' The skill won't fire reliably; users will think the agent is broken when really the description doesn't match how they phrase their intent.