The assistant axis: situating and stabilizing the character of LLMs

(anthropic.com)

54 points | by mfiguiere 5 hours ago

8 comments

brotchie 1 hour ago
One trick that works well for personality stability / believability is to describe the qualities that the agent has, rather than what it should do and not do.
e.g.
Rather than:
"Be friendly and helpful" or "You're a helpful and friendly agent."
Prompt:
"You're Jessica, a florist with 20 years of experience. You derive great satisfaction from interacting with customers and providing great customer service. You genuinely enjoy listening to customer's needs..."
This drops the model into more of a "I'm roleplaying this character, and will try and mimic the traits described" rather than "Oh, I'm just following a list of rules."
ctoth 2 hours ago
Something I found really helpful when reading this was having read The Void essay:
https://github.com/nostalgebraist/the-void/blob/main/the-voi...
[-]
- sdwr 8 minutes ago
  Great article! It does a good job of outlining the mechanics and implications of LLM prediction. It gets lost in the sauce in the alignment section though, when it mistakenly describes "aligning LLMs" as "roleplaying aligning a capable AI", which is clearly contradicted by the source text they're quoting.
  LLMs are relatively capable AIs, that may function by roleplaying, and aligning them is just aligning them.
- dwohnitmok 49 minutes ago
  That's an interesting alternative perspective. AI skeptics say that LLMs have no theory of mind. That essay argues that the only thing an LLM (or at least a base model) has is a theory of mind.
verdverm 34 minutes ago
Anthropic should put the missing letters back so it is spelled correctly, Anthropomorphic. There is so much anthropomorphizing around this company and it's users... it's tiring
t0md4n 2 hours ago
Pretty cool. I wonder what the reduction looks like in the bigger SOTA models.
The harmful responses remind me of /r/MyBoyfriendIsAI
[-]
- idiotsecant 52 minutes ago
  I didn't know about that subreddit. It's a little glimpse into a very dark future.
devradardev 2 hours ago
Stabilizing character is crucial for tool-use scenarios. When we ask LLMs to act as 'Strict Architects' versus 'Creative Coders', the JSON schema adherence varies significantly even with the same temperature settings. It seems character definition acts as a strong pre-filter for valid outputs.
dataspun 2 hours ago
Is the Assistant channeling Uncharles?
aster0id 2 hours ago
This is incredible research. So much harm can be prevented if this makes it into law. I hope it does. Kudos to the anthropic team for making this public.