hugobiais 8 minutes ago

Super interesting! At my company we have our agent writing code to make API calls and we were looking for a way to evaluate our agent on exactly that! The problem with doing that yourself using the Gmail, Linear or Slack API is that you quickly hit rates limits, but if we have a copy of it, problem solved.

Will definitely try this!

  • hubertmarek 2 minutes ago

    Where rate limits the main blocker for you?

hubertmarek 2 hours ago

Hi HN, I noticed it is almost impossible to run evals or train models on 3rd party integrations, so I built interactive environments for them. Feedback is more than welcome. Thanks!

Interesting fact - running evals on 40 tasks for Linear API, most frontier models scored surprisingly well:

- Claude Opus 4.5: 95% (38/40) - GLM 4.6: 87.5% (35/40) - Claude Sonnet 4.5: 85% (34/40) - Claude Haiku 4.5: 82.5% (33/40) - Kimi K2: 82.5% (33/40) - Grok 4.1 Fast: 80% (32/40) - GPT 5.1: 77.5% (31/40)

This makes me think whether we really need to reinvent the wheel and make special interfaces (MCPs) for agents interacting with services, when they can just use APIs as they are.