Provide agents with automated feedback

(banay.me)

121 points | by ghuntley 1 day ago

19 comments

achou 3 hours ago
Y'all are sleeping on custom lint rules.
Every time you find a runtime bug, ask the LLM if a static lint rule could be turned on to prevent it, or have it write a custom rule for you. Very few of us have time to deep dive into esoteric custom rule configuration, but now it's easy. Bonus: the error message for the custom rule can be very specific about how to fix the error. Including pointing to documentation that explains entire architectural principles, concurrency rules, etc. Stuff that is very tailored to your codebase and are far more precise than a generic compiler/lint error.
[-]
- harlanlewis 1 hour ago
  Clever! Sharing my lightning test of this approach.
  Context - I have a 200k+ LOC Python+React hobby project with a directory full of project-specific "guidelines for doing a good job" agent rules + skills.
  Of course, agent rules are often ignored in whole or in part. So in practice those rules are often triggered in a review step pre-commit as a failsafe, rather than pulled in as context when the agent initially drafts the work.
  I've only played for a few minutes, but converting some of these to custom lint rules looks quite promising!
  Things like using my project's wrappers instead of direct calls to libs, preferences for logging/observability/testing, indicators of failure to follow optimistic update patterns, double-checking that frontend interface to specific capabilities are correctly guarded by owner/SKU access control…
  Lots of use cases that aren't hard for an agent to accurately fix if pointed at directly, and now that pointing can happen inline to the agent work loop without intervention through normal lint cleanup, occurring earlier in the process (and faster) than is caught by tests. This doesn't replace testing or other best practices. It feels like an additive layer that speeds up agent iteration and improves implementation consistency.
  Thanks for the tip!
- tristandunn 1 hour ago
  I realized this recently and I've been creating a RuboCop plug-in[1] to automatically have the LLM code better match my personal style. I don't think it'll ever be perfect, but if it saves me from moving a few bits around or adding spacing I'd rather see then it's wroth. The fun part is I'm vibe coding it, since as long as the tests verify the rules then it doesn't really matter much how they work. As a result adding a new rule is pasting in LLM generated code followed by what I'd prefer it look like and asking it to add a rule.
  [1]: https://github.com/tristandunn/rubocop-vibe/
- esafak 2 hours ago
  I just discovered https://megalinter.io/
- esperent 2 hours ago
  Ha, I just had the LLM create my first custom eslint rule yesterday and was thinking that I should make more.
- oofbey 2 hours ago
  I like this idea but I can’t think of a concrete example to ground it. Can anybody share a real example?
  [-]
  - paradite 1 hour ago
    Max 200/300 LOC per file is pretty popular.
michalsustr 26 minutes ago
As someone said: Custom lints are super useful.
What we do at https://minfx.ai (a Neptune/Wandb replacement) is we use TONS of custom lints. Anytime we see some undesireable repeatable agent behavior, we add it as a prompt modification and a lint. This is relatively easy to do in Rust. The kinds of things I did are:
- Specify maximum number of lines / tabs, otherwise code must be refactored.
- Do not use unsafe or RefCells.
- Do custom formatting, where all code looks the same: order by mods, uses, constants, structs/enums, impls, etc. In particular, I added topological ordering (DAG-ordering) of structs, so when I review code, I build up understanding of what the LLM actually did, which is faster than to read the intermediate outputs.
- Make sure there are no "depedency cycles": internal code does not use public re-exports, so whenever you click on definitions, you only go DEEPER in the code base or same file, you can't loop back.
- And more :-)
Generally I find that focusing on the code structure is super helpful for dev and for the LLM as well, it can find the relevant code to modify much faster.
[-]
- actionfromafar 16 minutes ago
  What is DAG ordering of structs?
  [-]
  - michalsustr 13 minutes ago
    Each struct and its referenced fields can be thought of as a graph which can be sorted. Ideally, it is a DAG, but sometimes you can have recursive structures so it can be a cyclic graph. By DAG-ordering a I meant a topological sorting such that you do it by layers of the graph.
    https://en.wikipedia.org/wiki/Topological_sorting
    https://en.wikipedia.org/wiki/Directed_acyclic_graph
  - tablatom 11 minutes ago
    DAG is directed acyclic graph. A bit like a tree where branches are allowed to merge but there are no cycles.
    [-]
    - actionfromafar 8 minutes ago
      Yes, but I was wondering how organize your code in a DAG.
jamesblonde 1 hour ago
I got turned off in the first paragraph with the misuse of the term "back pressure". "back pressure" is a term from data engineering to specifically indicate a feedback signal that indicates a service is overloaded and that clients should adapt their behavior.
Backpressure != feedback (the more general term). And in the agentic world, we use the term 'context' to describe information used to help LLMs make decisions, where the context data is not part of the LLM's training data. Then, we have verifiable tasks (what he is really talking about), where RL is used in post-training in a harness environment to use feedback signals to learn about type systems, programming language syntax/semantics, etc.
[-]
- ghuntley 13 minutes ago
  the back pressure terminology comes from me. essentially it’s the wheel - you need to add backpressure to the agentic flywheel.
  see https://ghuntley.com/pressure
  i have the pleasure to work with moss and he came up with a way to explain what is in my head with ease.
- paradite 1 hour ago
  If you want to pedantic:
  Context is also a misnomer, where in fact it's just a part of prompt.
  Prompt itself is also a misnomer, where in fact it's just part of model input.
  Model input is also a misnomer, in fact it's just first input token + prefill for model output to generate more output.
  Harness is also a misnomer, where it's just scaffold / tools around the model input/output.
- patates 1 hour ago
  We all live in our own various small circles, in which many terms get misused. Isomorphic in front end circle means something completely different than any other use, for example. This is how languages evolve.
  I'm not trying to discount any attempt to correct people, especially when it gets confusing (like here, I was also confused honestly), but we could formulate it nicer IMHO.
- SubiculumCode 1 hour ago
  It is perhaps more generally known in the plumbing sense of pressure causing resistance to the desired direction of flow, but yeah, a poor word choice...at least it isn't AI written though.
- oofbey 1 hour ago
  Ironic that you nitpick the author’s word choice of “back pressure” and then completely misuse the term RL in your complaint.
- atoav 1 hour ago
  Well it does sound technical.
qazxcvbnmlp 4 hours ago
My mental model is that ai coding tools are machines that can take a set of constraints and turn them into a piece of code. The better you get at having it give its self those constraints accurately, the higher level task you can focus on.
Eg compiler errors, unit tests, mcp, etc.
Ive heard of these; but havent tried them yet.
https://github.com/hmans/beans
https://github.com/steveyegge/gastown
Right now i spent a lot of “back pressure” on fitting the scope of the task into something that will fit in one context window (ie the useful computation, not the raw token count). I suspect we will see a large breakthrough when someone finally figures out a good system for having the llm do this.
[-]
- AnonyX387 1 hour ago
  > Right now i spent a lot of “back pressure” on fitting the scope of the task into something that will fit in one context window (ie the useful computation, not the raw token count). I suspect we will see a large breakthrough when someone finally figures out a good system for having the llm do this.
  I've found https://github.com/obra/superpowers very helpful for breaking the work up into logical chunks a subagent can handle.
  [-]
  - nonethewiser 1 hour ago
    How would you compare it to Claude Code in planning mode?
    [-]
    - AnonyX387 6 minutes ago
      I've only used Claude's planning mode when I just started using Claude Code, so it may be me using it wrong at the time, but the superpowers are way more helpful for picking up on you wanting to build/modify something and helping you brainstorm interactively to a solid spec, suggesting multiple options when applicable. This results in a design and implementation doc and then it can coordinate subagents to implement the different features, followed by spec review and code review. Really impressed with it, I use it for anything non-trivial.
zmmmmm 27 minutes ago
It's sort of a mini singularity event once you get sufficient test coverage (and other guardrails in place) that your app can "code itself" via agents. There's some minimum viable amount and a set of infra to provide structured feedback (your agent gets good text error messages, has access to error context, screen shorts, etc etc) where it really starts to take off. Once you get lift off it's pretty cool.
bob1029 1 hour ago
Appropriate feedback is critical for good long horizon performance. The direction of feedback doesn't necessarily have to be from autonomous tools back to the LLM. It can also flow from tools to humans who then iterate the prompt / tools accordingly.
I've recently discovered that if a model gets stuck in a loop on a tool call across many different runs, it's almost certainly because of a gap in expectations regarding what the available tools do in that context, not some random model failure mode.
For example, I had a tool called "GetSceneOverview" that was being called as expected and then devolved into looping. Once I counted how many times it was looping I realized it was internally trying to pass per-item arguments in a way I couldn't see from outside the OAI API black box. I had never provided a "GetSceneObjectDetails" method (or explanation for why it doesn't exist) so it tried the next best thing foreach item returned in the overview.
I went one step further and asked the question "can the LLM just directly tell me what the tooling expectation gap is?" And sure enough it can. If you provide the model with a ReportToolIssue tool, you'll start to get these insights a lot more directly. Once I had cleared non-trivial reports of tool concerns, the looping issues all but vanished. It was catching things I simply couldn't see. The best insight was the fact that I hadn't provided parent ids for each scene object (I assumed not relevant for my test command), so it was banging its head on those tools trying to figure out the hierarchy. I didn't realize how big a problem this was until I saw it complaining about it every time I ran the experiment.
markbao 1 hour ago
Yeah, I think designing a system for the LLM to check its own work will replace prompt engineering in key LLM techniques (though, it itself is a form of prompt engineering, but more intentional.) Given that LLMs are doing this today already (with varying success), it might not be long until that’s automated too.
tern 18 minutes ago
I just started building something with Elixir and that ecosystem is stacked with "back-pressure" opportunities
skybrian 5 hours ago
This jumps to proof assistants and barely mentions fuzzing. I've found that with a bit of guidance, Claude is pretty good at suggesting interesting properties to test and writing property tests to verify that invariants hold.
[-]
- ekidd 4 hours ago
  If you give Claude examples of good and bad property tests, and explain why, it gets much better than it was out of the box.
- tungsten_metal 2 hours ago
  Proof assistants are the most extreme example of validation that leads you being able to trust the output (so long as the problem you intended on solving was correctly described) but fuzzing and property based testing are definitely more approachable and appropriate in most cases.
bobjordan 1 hour ago
Linters...custom made pre-commit linters which are aligned with your code base needs. The agents are great at creating these linters and then forevermore it can help feedback and guide them. My key repo now has "audit_logging_linter, auth_response_linter, datetime_linter, fastapi_security_linter, fastapi_transaction_linter, logger_security_linter, org_scope_linter, service_guardrails_linter, sql_injection_linter, test_infrastructure_linter, token_security_checker..." basically every time you find an implementation gap vs your repo standards, make a linter! Of course, need to create some standards first. But if you know you need protected routes and things like this, then linters can auto-check the work and feedback to the agents, to keep them on track. Now, I even have scripts that can automatically fix the issues for the agents. This is the way to go.
thomasfromcdnjs 2 hours ago
I've been slowly working on https://blocksai.dev/ which is a framework for building feedback loops for agentic coding purposes. It just exposes a CLI that can run custom validators against anything with a spec in the middle. It's goal being like the blog post is to make sure their is always a feedback loop for the agent, be it programmatic test, semantic linting, visual outputs, anything!
sh3rl0ck 5 hours ago
Beyond Linting and Shell Exec (gh, Playwright etc), what other additional tools did you find useful for your tasks, HN?!
Most of my feedback that can be automated is done either by this or by fuzzing. Would love to hear about other optimisations y'all have found.
[-]
- __MatrixMan__ 3 hours ago
  I like to generate clients with type hints based on an openapi spec so that if the spec changes, the clients get regenerated, and then the type checker squawks if any code is impacted by the spec change.
  There are also openapi spec validators to catch spec problems up front.
  And you can use contract testing (e.g. https://docs.pact.io/) to replay your client tests (with a mocked server) against the server (with mocked clients)--never having to actually spin up both a the same time.
  Together this creates a pretty widespread set of correctness checks that generate feedback at multiple points.
  It's maybe overkill for the project I'm using it on, but as a set of AI handcuffs I like it quite a bit.
- alphax314 4 hours ago
  Running all shorts of tests (e2e, API, unit) and for web apps using the claude extension with chrome to trigger web ui actions and observe the result. The last part helps a lot with frontend development.
- esafak 2 hours ago
  I've started incorporating checks into commit hooks, shifting testing left. https://hk.jdx.dev/
- sigseg1v 4 hours ago
  Teaching them skills for running API and e2e tests and how to filter those tests so it can check if what it did works quickly.
visarga 3 hours ago
Well said, I have been saying the same. Besides helping agents code, it helps us trust the outcome more. You can't trust a code not tested, and you can't read every line of code, it would be like walking a motorcycle. So tests (back pressure, deterministic feedback) become essential. You only know something works as good as its tests show.
What we often like to do in a PR - look over the code and say "LGTM" - I call this "vibe testing" and think it is the real bad pattern to use with AI. You can't commit your eyes on the git repo, and you are probably not doing as good of a job as when you have actual test coverage. LGTM is just vibes. Automating tests removes manual work from you too, not just make the agent more reliable.
But my metaphor for tests is "they are the skin of the agent", allow it to feel pain. And the docs/specs are the "bones", allow it to have structure. The agent itself is the muscle and cerebellum, and the human in the loop is the PFC.
[-]
- wcarss 3 hours ago
  For anyone else who briefly got very lost at PFC, probably "prefrontal cortex".
jillesvangurp 32 minutes ago
Tests, compilation, and other automated checks definitely help coding agents. In the same way that they help people catch their own mistakes. More importantly, as coding agents will be running these things a lot in limited resource & containerized environments, it's also important that these things run quickly and fail fast. At least, I've observed LLMs spend a lot of time running tools and picking apart their output with more tools.
For complicated things, it helps to impose a TDD workflow: define the test first. And of course you can get the LLM to write those as well. Cover enough edge cases that it can't take any short cuts with the implementation. Review tests before you let it proceed.
Finally skills help remove a lot of the guess work out of deciding which tools to run when. You can just tell it what to run, how to invoke it, etc. and it will do it. This can save a bit of time. Simple example, codex seems to like running python things a lot. I have uv installed so there is no python on the path; you need to call python3. Codex will happily call python first before figuring that out. Every time. It will just randomly call tools, fall back to some node.js alternative, etc. until it finds some combination of tools to do whatever it needs to do. You can save a lot of time by just making it document what it is doing in skill form (no need to write those manually, though you might want to review and clean them up).
I've been iterating on a Hugo based static website. After I made it generate a little test suite, productivity has gone up a lot. I'm able to do fairly complex changes on this thing now and I end up with a working website every time. It doesn't stop until tests pass. It doesn't always do the right thing in one go but I usually get there in a few attempts. It takes a few seconds to run the tests. They prove that the site still builds and runs, things don't 404, and my tailwind styling survives the build. I also have a few checks for link and assets not 404ing. So it doesn't hallucinate image links that don't exist. I made it generate all those tests too. I have a handful of skills in the repository outlining how/when to run stuff.
I did some major surgery on this website. I made it do a migration from tailwind 3 to 4. I added a search feature using fuse.js and made it implement reciprocal rank fusion for that to get better ranking. Then I decided to consolidate all the javascript snippets and cdn links into a vite/typescript build. Each of these tasks were completed with pretty high level prompts. Basically, technical debt just melts away if you focus it on addressing that. It won't do any of this by itself unless you tell it to. A lot depends on your input and direction. But if you get structured, this stuff is super useful.
anditherobot 4 hours ago
With Visual Studio and Copilot I like the fact that runs a comment and then can read the output back and then automatically continues based on the error message let's say there's a compilation error or a failed test case, It reads it and then feeds that back into the system automatically. Once the plan is satisfied, it marks it as completed
jackblemming 1 hour ago
I think the standard terminology for these are harnesses. No reason to invent some new term.
dang 3 hours ago
People have been complaining about the title.* To avoid getting into a loop about that, I've picked a phrase from the article which I think better represents what it's saying. If there's a better title, we can change it again.
* (I've moved those comments to https://news.ycombinator.com/item?id=46675246. If you want to reply, please do so there so we can hopefully keep the main thread on topic.)
dang 3 hours ago
[stub for offtopicness]
[-]
- waterproof 5 hours ago
  "Back pressure" is already a term widely used in computing for something entirely different: https://schmidscience.com/what-does-back-pressure-in-compute...
  [-]
  - jagged-chisel 4 hours ago
    I have the same argument with “crypto”
    [-]
    - andai 4 hours ago
      And web 3? ;)
  - johnfn 3 hours ago
    I am not sure if I am missing something, since many people have made this comment, but isn't this in some ways similar to the shape of the traditional definition of back pressure, and not "entirely different"? A downstream consumer can't make its work through the queue of work to be done, so it pushes work back upstream - to you.
  - swader999 1 hour ago
    Yeah it's too bad the author chose that word. They are in to something though, is a useful way to think about this game.
- jandrewrogers 5 hours ago
  This use of the term “back pressure” is pretty confusing in a computer science context.
  [-]
  - cortesoft 4 hours ago
    Yeah, I spent way too long trying to think of how what the author was talking to was related to back pressure... I had a very stretched metaphor I was going with until I realized he wasn't talking about back pressure at all
- asmvolatile 4 hours ago
  Back pressure is not a good name for this. You already listed one that makes more sense - “feedback”
- refulgentis 4 hours ago
  Others have pointed out the incongruity of back pressure here, I would have loved “feedback”.
- didip 4 hours ago
  I thought you are talking about back pressure pipes in my housing complex.
  I’ve been wondering why I can’t use it to generate electricity.