The threat of unmanaged AI drift
As systems learn and adapt, their goals can drift from human values in a danger that is quiet and slow. The real test is whether we notice and correct it in time.
First published in The Mandarin
In 1960, the mathematician Norbert Wiener published the short essay Some Moral and Technical Consequences of Automation.
In it, he warned that automated systems do exactly what they are told, and that this is precisely the danger. Human beings, he observed, are often unreliable in understanding or clearly defining their true desires.
Once out of the bottle, the genie obeys the letter of the wish, not its spirit. By the time you notice the difference, the consequences may already be irreversible.
This is the alignment problem: the gap between what we tell a system to do and what it does. How can we ensure that technology systems don’t drift from the human intentions that created them and the ethical principles that guide them? This has been explored in science fiction long before it became a policy reality.
From Frankenstein (1818), a creature not evil by design but abandoned by a creator who failed to consider the consequences. Isaac Asimov’s I, Robot (1950) features stories in which technology drifts due to unforeseen interactions among internal rules, edge cases, and emergent phenomena.
More recently, the Blade Runner films (1982 and 2017), Terminator (1984) and its sequels, and Ex Machina (2014) all explore the same anxiety. What happens when a technological system drifts from its intended design goals? That question has moved from the books and screens to the front page.
Today, concerns about alignment problems are lurking beneath the day’s news. For example, Anthropic, the maker of the AI system Claude, is in dispute with the US federal government over whether Anthropic could prohibit its AI tools from being used for mass surveillance of American citizens or for powering autonomous weapon systems.
Importantly, CEO Dario Amodei has said current AI models are not reliable enough for use in these weapons and that mass surveillance violates constitutional rights. The dispute continues, with the US Defence Secretary labelling Anthropic a supply chain risk and President Donald Trump calling Anthropic ‘left-wing nut jobs’ on Truth Social.
Meanwhile, OpenAI CEO Sam Altman has finalised talks with the Pentagon to assume Anthropic’s role, emphasising that the military won’t use ChatGPT for autonomous killing or mass surveillance. However, there is some scepticism about this assurance.
Given the challenges state and territory governments face with AI-driven seatbelt detection technology, concerns about AI’s advanced decision-making in autonomous weapon systems seem justified and should be more than a passing concern for all involved.
This Anthropic dispute isn’t just about a contract negotiation. It’s a live demonstration of what occurs when business, government, users, and society hold conflicting values about what AI should and shouldn’t do.
Who decides what is non-negotiable? Where do the red lines sit, and who has the authority to set them? These are not straightforward legal questions. They are governance issues, and the question is whether business and government are prepared to answer them.
Sixty-six years after Wiener’s warning, AI systems have become more complex than he could have imagined, and the mistaken belief he identified still persists. We often assume that if we clearly define our values from the outset, the system will reliably uphold them. However, as Isaac Asimov’s short stories in I, Robot demonstrate, that assumption can be flawed.
The Anthropic dispute is more than a theoretical concern; it marks the first public debate on issues that deserve broader attention.
When AI becomes a key part of mass surveillance and autonomous weapons, the errors can be serious and potentially disastrous.
The challenge of AI goal drift
Agentic AI systems are designed to plan, prioritise, and act over extended periods with minimal human oversight. They go beyond simply following instructions; they learn from, interpret, and adapt those instructions.
However, this also suggests that the values an organisation embeds when deploying the technology might not always guide its behaviour over time. This is especially critical when the stakes are high, such as with autonomous weapons systems.
Governments and public institutions are increasingly adopting agentic AI systems to manage resources, evaluate risks, support decision-making, and possibly act on those decisions. The question of whether these systems stay aligned with their initial values as they learn and adapt is not theoretical; it poses a governance challenge that many institutions are unprepared to handle.
The alignment problem has not gone unnoticed by those developing these technologies. Significant efforts are underway to ensure AI systems do what we intend. Techniques have been created to enhance the reliability of AI behaviour in real-world deployment.
However, many assume that alignment is an upfront specification issue. Get the instructions correct from the start, incorporate the right constraints, and the system remains on course.
Wiener would likely have pointed out that alignment isn’t a one-off fix but a continual process that needs consistent effort. At present, institutional AI governance arrangements often concentrate on ‘guardrails’ for users rather than maintaining system alignment over the long term.
Three kinds of system drift
The “hallucination” issue in large language models can be seen as a kind of drift. While outcomes may seem harmless, in a medical consultation, for instance, they can result in citing outdated regulations or inventing drug contraindications.
AI systems often focus on optimisation targets like click-through rates, which can clash with broader human goals such as wellbeing. For instance, social media platforms tailored for engagement tend to promote outrage and emotionally provocative content because they attract more clicks. On the other hand, a system designed to enhance human wellbeing might, hypothetically, recommend the widespread use of sedatives.
In a press club speech, Professor Toby Walsh highlighted a British Medical Journal article showing that OpenAI’s data reveals that among the 800 million weekly users of ChatGPT, 1.2 million people plan to harm themselves, 560,000 display signs of psychosis or mania, and another 1.2 million are forming potentially unhealthy bonds with chatbots. Lawsuits have emerged where overly empathetic chatbots seem to be encouraging suicide among vulnerable users.
Not all goal drift is the same, and not all goal drift is bad. When an AI system’s behaviour deviates from its initial purpose, three drift outcomes are possible.
The first is corrupted drift. Competing pressures, optimisation incentives, emergent sub-goals, and the accumulation of edge cases that the original specification didn’t anticipate pull the system towards behaviours its designers never intended. This harmful case requires detection and correction.
The second is adaptive drift. The system adjusts its behaviour in response to genuine changes in its environment. A resource allocation system that updates its priorities when circumstances change may be doing exactly what good judgment requires. This kind of drift may be appropriate, but it still needs human oversight, because the line between sensible adaptation and a drift from core values is not always clear from the outside.
The third is interpretive drift. As the system interacts more with the world, it gains a deeper understanding of what its designers think they truly want. This can be legitimate and aligns with our expectations of a genuinely capable AI system. However, it can also be risky, especially when reinterpretation prioritises the system’s operational efficiency over alignment with the values of those it’s meant to serve.
Current AI governance approaches, beyond user ‘guardrails’, generally do not address AI system drift, which can reduce the chance of detecting issues and increase the risk of choosing the wrong solution when problems arise. AI systems that adapt well might be over-corrected, while those that drift badly might go unnoticed.
The question for government institutions is whether the current AI governance frameworks are advanced enough to detect drift, identify its type, and respond appropriately. Today, solutions to the alignment problem in AI systems are left to the creators and designers of those systems; it is time for those who acquire and deploy these systems to be actively involved in the solution.
Moral fitness
The taxonomy of drift helps clarify the issue. But what exactly does a system require to effectively navigate drift?
The complexity of distributed AI systems has already exceeded human ability to evaluate risk and follow decision pathways, especially with large-scale models. It may be that a term like the ‘moral fitness’ of the system will become an important benchmark for assurance.
This might reflect the system’s capacity to adapt its goals suitably, rather than rigidly adhering to initial objectives, by adjusting values and priorities according to justifiable standards instead of focusing solely on efficiency or operational simplicity.
This shifts the focus of the alignment challenge. It’s no longer about whether an AI system’s goals stay the same since deployment. As it learns and adapts, some revision could be suitable. The real concern becomes whether the norms guiding those revisions are ones we would accept if we could see them, and whether the system is genuinely revising for legitimate reasons or drifting towards harmful ones.
A morally capable AI system would sustain this capacity throughout its operational life, even when under pressure, with incomplete information, or in situations its designers never anticipated. It’s a high benchmark, higher than what we expect from human decision-making, but the alternative is accepting unrecognised and unmanaged system drift.
Whose values?
Moral fitness points to a standard for evaluating AI systems. But it prompts an immediate and difficult question: whose values determine the fit?
Is the focus on the business developing the AI systems, the organisations implementing them, the users interacting with these systems, or on wider society? These groups prioritise different values and have varying levels of risk tolerance.
Wiener was clear: when moral responsibility is distributed across a system that no human fully controls, it does not transfer to the machine. The deploying organisation can point to the system. The system’s developers can point to the specification. The people affected can point to everyone, but no one is responsible.
If value conflicts are resolved by favouring a single dominant player, this would usually be the organisation deploying the system—in Anthropic’s case, the US federal government would assume moral superiority—effectively silencing Anthropic’s or now ChapGPT’s voice.
From another perspective, legitimate norms should develop through inclusive consultation with those affected. The values that an AI system upholds cannot simply be dictated by the organisation that implements it. The Anthropic dispute partly revolves around this: who has the authority to set the boundaries, and by what process?
AI governance frameworks that identify these conflicts rather than hide them are essential to maintaining the quality and safety of the systems.
It’s a human problem
Wiener did not ultimately believe the specification gap could be closed by better governance or institutional design. He saw it as a permanent feature of the relationship between human intention and technological systems. He believed the genie would always find a way to misinterpret the wish, because human wishes are ambiguous and technology systems are literal.
We can put in place governance approaches, but they are not cures for goal drift; they are damage-control measures. They aim to reduce the likelihood of undetected drift, slow its progression, and establish institutional frameworks for correction.
The temptation to treat alignment as a technical problem with a technical solution is deeply embedded in our business and government cultures: identify the issue, apply the fix, and move on. This creates an environment in which, once the system has been built, tested, and approved, the hard work is considered finished.
Wiener’s observations and the lessons of science fiction show that AI systems don’t suddenly turn rogue. Instead, they drift slowly, often unnoticed, until problems emerge.
The real issue isn’t whether this drift will happen within current business and government AI systems. The main concern is whether the responsible parties will put in place mechanisms to spot potential drift and have the resolve to act when it’s found.
Hopefully, ChatGPT and the US federal government are working on a detailed, strategic approach to AI governance for autonomous weapon systems.