Tagged "evaluation"
- LLM Personalization Breaks Down in High-Stakes Finance
- Show HN: SkillCompass – Open-Source Quality Evaluator for Your AI Skills
- SkillCompass – Diagnose and Improve AI Agent Skills Across 6 Dimensions
- FretBench – Testing 14 LLMs on Reading Guitar Tabs Reveals Performance Gaps
- AI Agent Reliability Tracker
- No, Local LLMs Can't Replace ChatGPT or Gemini — I Tried
- How Do You Know Which SKILL.md Is Good?