Skip to main content

Posts

Featured

Show HN: Caliper – pass@k reliability testing for Claude Code and Codex skills https://ift.tt/gSUXTWR

Show HN: Caliper – pass@k reliability testing for Claude Code and Codex skills Skills for Claude Code and Codex are hard to test. What I mean by hard is that there's no standard way to do it. You evaluate the skill once on something, it looks like it works. You publish it. Then the new super model releases (GLM 5.2 anyone?), it will quietly break for some part, and you won't find out until your users complain. I also faced the same problem, so I tried to build something lightweight to stop doing that. Caliper. It's a local and lightweight harness that runs a skill k times in isolated environments and gives you a pass@k score (How much times it succeeded in these k times). As a non-deterministic technology, you can't just say "it worked once". You need to answer how much it passed in k times. You define success in a YAML spec. I picked YAML to keep a schema and make it still readable for a human. You either use a LLM judge, a Python assertion, or both: Here...

Latest Posts

Show HN: Starglyphs - A constellation puzzle game based on Euler paths https://ift.tt/z0e43xt

Show HN: Wind particles on Mapbox from a single EXIF JPEG https://ift.tt/nMINd4L

Show HN: A Living Neural Web in HTML5 Canvas https://ift.tt/x2Dvf6r

Show HN: Puzzle with Strangers. A free multiplayer jigsaw https://ift.tt/aGQkLNX

Show HN:Every Team Is Building the Same Cache https://ift.tt/wDubJt5

Show HN: No chair fixed my back, so we built one that won't let you sit still https://ift.tt/VsUv2zP

Show HN: OpenKnowledge – open source AI-first alternative to Obsidian/Notion https://ift.tt/aRpkNx7

Show HN: LookAway, a Mac break reminder that knows when not to interrupt https://ift.tt/1tsMmwN

Show HN: Follow the Thread – a calmer, typographic way to read Wikipedia https://ift.tt/UskeG5l

Show HN: The Cascade Graph – An interactive map of AI and energy constraints https://ift.tt/XQx9Czn