Articles
Ubiquity
Volume 2025, Number July (2025), Pages 1-19
Ubiquity Symposium: Artificial Intelligence: Generative AI1
Peter J. Denning
DOI: 10.1145/3747356
Large language models (LLMs) are the first neural network machines capable of carrying on conversations with humans.2 They are trained on billions of words of text scraped from the internet. They generate text responses to text inputs. They have transformed the public awareness of artificial intelligence, bringing on reactions ranging from astonishment and awe to trepidation and horror. They have spurred massive investments in new tools for drafting texts, summarizing conversations, summarizing literature, generating images, coding simple programs, supporting education, and amusing humans. Experience with them has shown them likely to respond with fabrications (called "hallucinations") that severely undermine their trustworthiness and make them unsafe for critical applications. Here, we will examine the limitations of LLMs imposed by their design and function. These are not bugs but are inherent limitations of the technology. The same limitations make it unlikely that LLM machines will ever be capable of performing all human tasks at the skill levels of humans.
In late 2022, large language models (LLMs) erupted into the public spotlight. For the first time in human history, we have machines that can engage in wide-ranging, uncannily good conversations with us. Pundits were quick to claim that LLMs were a giant step on the path to artificial general intelligence (AGI) and perhaps even "the singularity."3 These LLM machines have reshaped many expectations about machine intelligence and have created hosts of new opportunities for innovation. Huge investments are being made to start new companies that offer new tools and services that were not possible before. OpenAI's GPT Store gives a glimpse of what is coming. It offers an impressive array of LLM-based tools for writing support, image generation, productivity, research, summarizing research literature, mathematical manipulation, programming, education, and lifestyle resources.
LLMs have also created deep concerns about their safety and even the future of humankind. They are an inflection point in the computer revolution. They seem like newcomers in our neighborhood, and we are not sure what to do with them. For many people, this is profoundly unsettling. How shall we live with machines that talk like us? Will they achieve the power to outsmart us in every dimension of being human? Will they become sources of creativity and innovation? Will they take our jobs? Will they displace humanity?
Even worse, what if LLM machines are good enough to seem intelligent, but not good enough to respect human values? There are indications that this may be so. They cannot distinguish truth from falsehood or the possible from the impossible. They cannot care about anything. They cannot make or keep commitments. They cannot take responsibility. They are amoral. They have no ethics. We will explore these claims in this essay.
LLMs seem to be creative because they can compose short stories, write poems, code small programs, and draw images. But are they really creative? LLMs compute statistically probable responses to questions relative to their training data. The word "probable" is important here. Their responses reflect words and phrases appearing frequently in their training data. Jensen Huang, the CEO of NVIDIA, summarized this as "prior knowledge that enables us to know what is already known." Emily Bender, AI researcher at the University of Washington, called LLMs "stochastic parrots," meaning they repeat back composites of what has already been told them, given the context set by a prompt. LLM inferences can appear meaningful and valuable to human beholders because they resemble stories, poems, codes, and images that are already familiar. The "creations" of these machines are probable rather than improbable inferences from the training data.
Compare LLMs with a feat of true creativity: Einstein's theory of relativity. His special relativity (1905) asserts that all observers measure the same speed of light (c), leading to the improbable equivalence of mass and energy expressed in his famous formula E=mc2. His general theory of relativity (1915) gave an accurate prediction, verified in 1919, of the amount of bending of light around stars. It also explained and predicted the precession of the perihelion of Mercury's orbit, which was not possible within Newtonian mechanics.
The old idea that AI could eventually surpass human cognitive capabilities fell into the background by the early 2000s. It was rejuvenated after 2010 when it seemed supercomputing power would be its savior. Today, this dream is AGI.
AI Machines
The term "artificial intelligence" was coined in 1956 by a group of researchers who defined it as machines that could perform cognitive tasks that were normally considered uniquely human tasks. Over the years many AI technologies have been developed: speech recognition, computer vision, genetic algorithms, language translation, natural language processing, planning, board games such as Chess and Go, logic problem solvers, logic processing languages, expert systems, neural networks, medical diagnosticians, drug discoverers, and driverless cars.
Artificial neural networks (ANNs) were first envisioned in the early 1940s as electronic circuits modeling the neuronal structure of the brain. They evolved into mathematical models that can be simulated on a computer, then burst into public view after 2010 when sufficient computing power became available to train very large networks. ANNscan perform valuable tasks, including some of those noted earlier. This branch of AI has come to be known as deep learning. The current generation of LLMs can engage in surprisingly realistic conversations with us.
Where does this new LLM development fit in the AI landscape? In 2019, Ted Lewis and I proposed a hierarchy of AI machines ranked by learning power (see Table 1) [1]. A machine at one level is capable of learning tasks that machines of lower levels cannot. Here, "learning" means to acquire an input-output function in a reasonable amount of time through programming or presentation of many examples. Learning is not the same as computability; all the levels are Turing complete. With this hierarchy, we aimed to cut through the chronic hype of AI and show that AI can be discussed without ascribing human qualities to machines. At the time, there were easily identifiable machines up through Level 4 and little or nothing from Level 5 upward. Level 5 (then called creative AI) was for machines that could generate new works such as essays, speeches, computer code, solutions to math problems, poems, music, and art. Level 6 (human-machine interaction) was for designs that enabled humans and machines to "team up" on tasks neither could perform alone. Level 7 (aspirational AI) was for machines that might exist, such as those for AGI, but do not exist now and may never exist.
In 2022, the dramatic arrival of ChatGPT gave the public the first working machines at Level 5. These machines are called generative AI or LLMs, which are ANNs trained on huge amounts of natural-language text obtained from the internet. Many years of research have come together in these technologies. However, because the algorithms behind LLMs are not widely known, to many, the technology still looks like magic. Opinions about the good and bad of LLMs are all over the map.
LLM Basics
From the earliest days of electronic computing in the 1940s, people have wondered whether computers whose structures resemble brains might become intelligent. Researchers invented artificial neurons, circuits that simulate brain neurons. An artificial neuron can be in one of two states: excited (1) and quiescent (0). It receives signals from other neurons and enters state 1 if the sum of its inputs exceeds a threshold or maintains state 0 otherwise. This idea did not take off for mainstream computing because it was too slow. Researchers continued to investigate this idea and evolved it into today's ANN, which consists of layers of artificial neurons. Each neuron of a layer has links connecting its output value (0 or 1) to every neuron of the next layer. Each link has a weight, 0 or larger, that determines what fraction of its input is delivered to its output. The weights are called parameters because they are adjusted during the training process. The training process works its way through a long series of examples (x,y) where y is the intended output when the network is presented with input x. (The x and y are usually encoded as bit vectors.) The training algorithm adjusts parameters to minimize the error between the actual and intended outputs. The most advanced model today, ChatGPT-4, has a core ANN with around 1 trillion parameters. The training process, which was conducted with a massively parallel supercomputer, took more than 90 days and cost more than $100 million. Once trained, an ANN is very fast, producing its output in milliseconds.
LLMs are ANNs created by a complex, two-phase process. The first phase sets up the core-ANN, whose job is to predict the next word that would appear after a given text (the "prompt"). To enable this, the prompt is encoded so that the nearness of words can be represented; words that are near those in the prompt are more likely to appear next. The core-ANN is trained on billions of words of text from the internet to respond to a prompt with a highly probable next word after the prompt. The predicted word is appended to the prompt, which is cycled back as a new query of the ANN. This cycle is repeated to generate the sequence of words in the response. The second phase is called "fine-tuning" or "tweaking." It sets up a second ANN, the tweak-ANN, whose job is to respond to a (prompt, response) pair with a score that indicates human satisfaction with the response. Tweak-ANN is trained from a dataset of (prompt, response, score) elements obtained by generating a large number of random prompts, getting core-ANN responses to each, and having human readers score the responses. The tweak-ANN is then used in a reinforcement learning mode to adjust the weights in the core-ANN so that its responses tend to get higher scores according to the tweak-ANN. In some cases, data from user responses to post-tweak LLM responses are fed back to further fine-tune core-ANN weights for even better results.
When this process is completed, the basic fact remains: The LLM is an ANN that responds to prompts with highly likely statistical inferences from the training data. AI researchers use the term Bayesian learning, a derivative of Bayes' Rule in statistics, for machines that generate a set of most probable hypotheses given the data. Thus, the LLM neural network is a Bayesian inference engine. The consequence is subtle but important. A response is a string of words drawn from multiple text documents in the training set, but the string probably does not appear in any single document. That means responses can look novel, but there is no source document to consult for verification of truth. Indeed, LLMs have no means of verifying whether a response is truthful or bears any relation to reality. LLMs can, and do, cite nonexistent documents and quote from them.
The Hallucination Problem
It has been known for many years that natural language text generators are prone to producing false and nonsensical answers [2]. Today, this is called the hallucination problem. Susceptible systems include abstractive summarization, dialog, general question-answer, data-to-text, and translation. In every case, researchers have sought ways to detect when a generator has gone awry. No detection method catches all the wrong answers. LLMs are particularly challenging because their outputs are composed of words drawn from many documents; it is next to impossible to find authoritative documents in the training corpus to validate an LLM's claim.
A large research effort is underway to develop LLM fabrication detectors. There has been some progress. In one example, researchers got an LLM to say "I don't know" when a query concerns events that occurred after the LLM was trained. A recent article in Nature discusses a new method that measures divergence among responses to alternative but equivalent questions about facts stated in the prompt. The less divergence, the lower the risk of fabrication. This new method detects more than other known methods [3]. Still, the new method fails to detect 20% or more fabrications.
In 2023, a study by Brian Randell and Brian Coghlan seeking to use ChatGPT-3 to discover historical facts about computer pioneer Percy Ludgate showed astonishing fabrications [4]. Eighteen months later, my colleague Walter Tichy posed the same questions to Claude 3, a next-generation LLM emphasizing safety, and saw fewer fabrications. The newer generations of LLMs may be less susceptible to fabrication. More experimentation is needed.
The basic issue is that LLMs can make statistical inferences that do not correspond to reality. Nothing stops LLMs from making inferences from the training data that have no validating documentation in the training data. For example, I could ask, "What animals are a cross between a human and a horse?" And the LLM could respond, "A three-legged Centaur." The three legs are a cross between two and four, and there are other well-known three-legged objects such as stools and tricycles. Unless it were paired with a search robot that seeks independent verification of claims, an LLM would not "know" that three-legged centaurs do not exist. When we are uncertain about the truthfulness of an LLM claim, we need to locate an independent, trusted, authoritative source to validate the claim. This is getting to be increasingly difficult.
An Avalanche Arrives
LLMs are like an avalanche that has swept society, upsetting many practices and identities. Two themes have shown up in the conversations of the aftermath. One is optimistic and focuses on the good things the technology can bring in a plethora of narrowly defined applications. The other is pessimistic and focuses on the damage that can be wrought from improper use of the technology. You do not need to take sides but rather be aware of both sides and act responsibly.
ANNs are generating much interest in many fields. Early experiments are promising. In medicine, the technology can identify polyps from colonoscopy images or cancer from mammogram images. In pharmaceuticals, it is accelerating the search for molecules that can become new drugs. In business, it creates new products and augmentations of existing products, it powers interactive voice customer service, and it inspires startups to try out new lines of business. In marketing, it enables new methods of collecting personal data and sending individualized ads. In journalism, it provides new ways to summarize documents and jump-start the writing process. In programming, it generates initial versions of code that can be rapidly edited and corrected by programmers. The technology has enabled all sorts of experimentation and new offers for productivity-improving software and services.
On the pessimistic side, as noted earlier, LLMs routinely "hallucinate," that is, make up answers to questions and present their fabrications as authoritative statements. This has raised serious questions about their trustworthiness. Access to training data may be inhibited by lawsuits from artists, musicians, writers, and newspapers over copyright infringement when LLMs are trained by incorporating their copyrighted works from the internet without permission. Law enforcement and political leaders are concerned about the ease with which troublemakers can commit crimes, such as cyberattacks, and create false documents, false images, false voice and video recordings, and other deepfakes. Political and civic leaders are deeply concerned about the use of LLMs to manipulate people in elections. Parents and social leaders are deeply concerned about its abuse in addicting young people to social media. Labor leaders are deeply concerned that LLM automation might extinguish many jobs. Healthcare leaders are concerned that LLM summaries of doctor-patient visits will contain life-threatening hallucinations. Some AI experts declare that LLMs will, without strong government regulation, spin out of control and inflict great damage on humanity.
This technology has changed what people perceive as possible in their lives and in the life of their community. It is reshaping the world and changing the realities for people, both positively and negatively. Prior to 2023, I seldom heard anyone say that a machine would pass the Turing test soon. In 2024, the mood had changed. I heard many who believed Ray Kurzweil's claim that LLMs would achieve this by 2029 [9]. In 1950, Alan Turing, who invented the test, thought it would take until the year 2000 before anyone would think it credible that a machine conversation could be indistinguishable from a human conversation. Even so, many experts believe detection tools will assist humans in distinguishing other humans from machines.
The Data Quality Problem
To alleviate some of the problems cited above, those who choose the data to train LLMs have been concerned with the quality of the data. OpenAI reported that the scraping process gathered over a trillion words. They discarded a huge number of those texts from untrustworthy sources, leaving behind a still-huge corpus of several billion words for training GPT-3. They are concerned about getting enough quality data to support the much larger LLMs in GPT-4 and later versions. Some researchers have stated that the linguistic abilities of LLMs taper off for larger corpora.
There is a huge concern that bias in the training data will cause LLMs to respond unfairly to prompts. For example, training data for facial recognition applications were dominated by white male faces, causing the recognizer to misidentify many persons who were not white males. Bias can arise from subtle sources. For example, most of the texts in LLM databases are in English, which means responses to prompts in non-English languages may not fit the asker's context. Most of the texts come from authors in developed countries; few authors come from countries where dissent is forbidden. Biases can even be introduced by the tweaking process because the human evaluators of LLM outputs have their own unseen biases.
What makes bias such a difficult issue? In human affairs, a bias is a prejudice, a disposition to favor a particular interpretation over others. It is very hard for us to see our own biases, and it is often unclear where we got them from. Biases serve a biological function by predisposing us toward interpretations that have proved useful or successful for us in the past. Our biases show up in our writings and assessments. They are absorbed into LLMs through the training and tweaking processes. Designers of machines that support a community may introduce biases favoring community norms. There is no good answer to this. Perhaps the best solution is to let LLMs awaken us to the reality that we all have biases, teach ourselves to become aware of our biases, and commit to listening to others despite our biases.
There is also a huge concern over threats from "synthetic data." These data are generated by LLMs (or other statistical models) but not directly from a human source. Synthetic data can be generated prolifically at low cost. LLMs are populating the internet with synthetic ads, news summaries, translations, blogs, and more. These documents are being used to train future generations of LLMs. Because synthetic data are noisy representations of the original sources, the quality of LLMs trained from them gradually decreases with each new generation. I have heard that some researchers have concluded that noticeable degradation appears in as few as five generations. With the big tech companies rushing to put LLMs into products, notably search, this will be a big hit to trust and verifiability. I heard a researcher describe the degradation process as "mad cow disease of the internet brain." We are facing a possible future in which no one will know what to trust, and there will be no way to independently verify claims.
To deal with these issues, we will surely need technological help such as digital watermarks and signatures associated with trusted sources. Even with these tools, people will be selective about which sources they are willing to believe, giving credence to those matching their prejudices.
Entering Into Language
A main reason behind the perceptual shift around AI is that LLMs are the first machines that have finally "entered into language." People are starting to reassess possibilities and dangers based on the impressive linguistic displays of LLMs.
Language is not simply grammar and words; it is a milieu of expression, coordination, culture, customs, interpretation, and history that fundamentally fashions our way of being in the world. Despite their surprising capacity to participate in fluent, syntactically competent human-like conversations, LLMs do not share other human abilities conferred by language. It is more accurate, as some are saying, to see LLMs as manifesting an uncaring and unempathetic machine intelligence. LLMs cannot match the ways we humans shape and are shaped by language.
Noam Chomsky, a highly respected linguist, joined with two colleagues in a critique of LLMs [5]. In addition to the issues previously discussed, they noted that LLMs' method of learning language—scraping huge text databases from the internet—is completely different from the way children learn language. We acquire language by being immersed in the conversations and practices of human communities. This means there is ultimately no way LLMs can learn language in the way humans learn and live language—no way they can overcome the problems mentioned earlier. After iterating with ChatGPT on questions around morality and ethics, Chomsky concluded:
ChatGPT exhibits something like the banality of evil: plagiarism and apathy and obviation. It summarizes the standard arguments in the literature by a kind of super-autocomplete, refuses to take a stand on anything, pleads not merely ignorance but lack of intelligence and ultimately offers a "just following orders" defense, shifting responsibility to its creators.
Chomsky worried that with all the enthusiasm to incorporate LLMs into many computer applications, we will infuse this technology into many other technologies, a situation we might regret and be unable to turn off.
Being In Language4
Let's review the differences between humans and machines and then determine what conclusions we can draw about the capabilities of LLM machines to create new ideas and lead innovation [6].
• Care
Care is one of the most fundamental aspects of being human. Humans care about each other and about future possibilities. Care distinguishes between what matters and what does not. What we care about solicits our attention and action. We are not drawn to be in service of unimportant matters. We demonstrate our care by taking stands and sustaining them, in word and deed. Being in language with others enables us to commit our lives and coordinate with others around shared human concerns such as justice, progress, revolution, conservation, romantic love, artistic creativity, and much more. One important matter that humans cannot help but caring about is the honesty and sincerity of others. We care about getting things right. We care about truth.
Machines do none of this. They do not care and cannot be programmed to care.
• Shared Spaces of Concerns
Language enables us to explicitly share matters of common concern and to coordinate our actions to take care of them. We call these shared spaces "worlds" and perceive them as realities. We have a sense of belonging to a larger whole, the "we" who share commitments and norms of proper human behavior. We co-create our worlds through our conversations and interactions. We pass on our beliefs, values, and norms to our children and others through our conversations with them in our worlds. We share convictions, commitments, assessments, and opinions. We imitate and influence each other, often without even realizing it. We change each other's minds. We develop life-long friendships and seek mentors to shape our lives. We clean up distrust, banish resentments, and open new futures together. We socialize and adapt in the shared space, shaping each other's ways of thinking, acting, and being.
Being incapable of having concerns, machines are incapable of forming social spaces of shared concerns. Machines have no conscience. They cannot discern the appropriateness of their actions or experience remorse for negative consequences of their decisions. They are utterly unaware of the broader context in which they are generating inferences or assertions. Their conclusions, therefore, correspond to reality only by accident or when constrained by a restrictive set of context-free rules, as in a game.
• Commitments
Language enables humans to make and deliver on commitments. Our commitments structure our worlds. Commitments are always social: we make them to other people. We hold each other responsible for how we live up to our commitments. Consistent success at fulfilling commitments generates a rapport of trust that circulates in our communities. And consistent failure generates distrust, anger, and resentment. Trust, in turn shapes the interactions others are willing to have with us.
Philosophers of language have identified five kinds of commitments: requests, promises, declarations, assertions, and assessments. The language acts where we generate these commitments are events in the relationships between people. They generate actions and expectations of the future and thereby shape the world. When expectations are not met, a breakdown in the relationship often happens, and further conversations are needed to repair it.
A statistical model of language (an LLM) is potentially able to track conversations, keeping records of agreements and conveying them to the responsible parties. But keeping records of commitments is not the same as enacting them. LLMs cannot make commitments at all or take responsibility for words that sound like commitments. If an LLM fails or causes damage, we do not blame the model or hold it responsible; we hold the designers, users, or ourselves responsible.
• Moods and Emotions
Moods and emotions are among the most important ways we experience the world together. Moods such as wonder, confusion, or overwhelm are embodied dispositions that shape the possibilities we can see. Emotions such as joy or anger are embodied reactions to events. Both are closely linked to our ability to make assessments in language. An emotion is a reactive assessment of a current event; a mood gives us assessments about the future and shapes what actions are possible for us. Assertions and assessments are our means to make sense of the world and to explore our own and other people's concerns.
Language is permeated by emotional resonance. Resonance is a feeling of attunement to another person. We can be insulted or flattered simply by someone's tone in talking with us. Resonances can support or impede actions. Communities resonate with collective moods such as anxieties about pandemics, joy when a sports team wins a match, or distrust of government institutions. Competent leaders read and flow with resonances, avoiding making requests when people are not in a receptive mood, and making requests when they are.
LLMs can generate text strings that signify emotions and moods, but these strings are statistical constructions. Having no concerns and no bodies, LLM machines have no emotions and no moods, and no means to develop sensibilities for them.
• The Background
Our language conveys the ripples of conversations passed down through the years and centuries from prior generations. Our beliefs, customs, mannerisms, practices, and values are inherited from the conversations of our forebears, combining with the conversations we share today. Most of the time, we think, speak, and act against this historical background of presuppositions and prejudices without being aware of it. This background is boundless, with no definite beginning or end, extending beyond every horizon.
We have the remarkable ability to sense and reveal what is in the background, to make what is tacit explicit. Paradoxically, we often react to a revelation of something hidden in the background with "that's obvious." What we call "common sense" is all that goes without saying in this tacit "background of obviousness" that nevertheless makes sense when revealed and brought into conversation.
In the 1980s, failures of expert systems were attributed to missing "common sense facts" that are obvious to us, but not to the machine. Expert system designers sought compendia of common-sense facts that the machine could use. Perhaps the most famous of these efforts was the Cyc project started by Douglas Lenat in 1984; it accumulated 25 million common-sense facts after 40 years. Yet even that treasury could not add up to a background of common sense and make expert systems smart enough to be experts. It remains to be seen whether combining Cyc with an LLM will improve the LLM.
It is reasonable to ask whether LLMs can infer background context statistically. Since all these machines can do is infer from already written texts, and since people are generally unaware of their tacit knowledge and cannot write about it, it seems unlikely that these machines can infer text that has not been written or recorded.
It is also reasonable to ask whether a robot powered by an LLM could learn tacit knowledge through practice. Since tacit knowledge depends on fine details of biology and structure, the most we can say now is that whatever tacit knowledge is developed by a learning robot is likely to be different from tacit knowledge developed by a human being. This question deserves more exploration.
Imagination is another human ability that flows from our tacit background. It is a capacity to conceive possibilities that do not exist and can become incorporated into our shared background once articulated. Although LLMs have generated some surprisingly imaginative poetry, it is more likely that these are unexpected statistical inferences rather than genuine creations relative to the background. This question deserves more exploration.
• Embodied Action
Our ability to act exceeds our linguistic powers. Much of what we "know" is in the form of embodied practices rather than descriptions and rules—knowing-how rather than knowing-that. Even if we can linguistically describe a practice, reading the description does not impart the skill of performing the practice. Michael Polanyi, a philosopher, captured the paradox in his famous saying, "We know more than we can tell."
Descriptions of actions can be represented as bits and stored in a machine database. However, performance skill can only be demonstrated but not decomposed to bits. Performance knowledge, what psychologists call "procedural memory" (memory of how to do things), is deeply ingrained into our embodied brains, nervous systems, and muscles. This intuitive, embodied sense of relevance resists being objectively measured, recorded, or described.
In 1980, Stuart and Hubert Dreyfus defined a hierarchy of skill levels – beginner, advanced beginner, competent, proficient, expert, and master. In their hierarchy, the beginner has no embodied skill and can only act by explicitly following decontextualized rules. The expert has a fully embodied familiarity with typical situations and acts without following rules. A person increases in skill and embodiment through practice, often with the help of coaches and mentors who already have the skill.
Machines store knowledge given to them by rules, algorithms, and data. This applies both to traditional logic machines, which are programmed, and to modern neural networks, which are trained over given data. The statistical inferences performed by LLMs are computed by the algorithms defined by the neural network. In contrast, human bodies live and interact in their vast intangible interpretative structures, constantly shaped by tacit knowledge. Because tacit knowledge cannot be recorded, it seems unlikely that statistical inference from recorded data can reveal it.
In fact, this is the reason we design and build machines. They can muster calculation speeds or marshal kinetic forces well beyond human capabilities. Machines with an exogenous "body" of hardware can get their gears, levers, hydraulics, and circuits to do tasks on a scale that is impossible for embodied humans. That is what makes machines valuable to us.
The End Of Programming?
ANNs are still logic systems at root. Their deep-layered architecture is a network of nodes completely describable by a set of rules specifying link connections and weights. But with trillions of links in an LLM, specifying their rules in a programming language is intractably complex. Traditional programming does not work. Machine learning, which does work, is a form of automatic programming by which the ANN acquires an input-output function by being shown examples that enable it to give good approximations to the correct answers. But does this mean the demand for programmers who work with programming languages will disappear?
LLM enthusiasts claim that the end of programming is at hand. LLMs quickly write code once a human gives the specifications. In a few years, goes the claim, mature LLMs will be capable of doing almost all programming. The tedium of designing algorithms, proving them correct, testing code, and debugging will be gone. At least, that is the goal.
The current state of the art is nowhere near that goal. LLMs can produce small programs, perhaps a few hundred lines. Their code almost always contains errors and cannot be used until a human programmer locates and removes them. The main advantage seems to be that the LLM can rapidly produce a pretty good initial draft of a small program. This amounts to a small but noticeable gain in productivity for producing small codes.
LLMs can be very useful in areas where they do not need to scale up. One such case is robot interfaces that translate a human command into robot commands. Another might be as a front end to search engines, although it appears the jury is still out about whether that would yield trustworthy search results.
For many other coding tasks, LLMs cannot scale up. When the codes are a little longer (say 1,000 lines), the human checkers are already hard pressed to find and remove all errors. For perspective, the Windows 11 operating system is estimated to be 50 million lines of code. Unless something fundamental changes in the way LLMs generate code, large, trustworthy systems will be beyond their reach. Traditional software engineering will continue to be needed.
Professional programmers have coding and design skills that rely on extensive experience, intuition, familiarity with user domains and concerns, and subtleties of practice. Design is well beyond LLM capabilities because all LLMs can only deal with symbolic but not tacit knowledge. In addition, professional programmers aspire to make programs that are dependable, reliable, usable, safe, and secure—all goals that rely on understanding the concerns and practices of human users.
Some further insight can be gained by considering how we assess software quality and trustworthiness [7]. There are three levels of quality: meets specifications, produces no negative consequences, and leaves users delighted. The current state of the art of LLM coding does not even attain the lowest level of quality.
Can LLMs Outsmart Us?
Enthusiasts claim that LLMs will soon encompass all human knowledge once they absorb every scrap of text, speech, and image on the internet.
This is nonsense. The documents archived online are overwhelmingly English and come from only a subset of all humans. Many people live in oppressive countries where they are not allowed to speak up and record their views. Enormous troves of recorded data are in inaccessible archives in nondigital form. For example, the Venetian State Archives record the wisdom of the Republic of Venice for a thousand years; much of what resides on their 80 kilometers of shelving has not been consulted recently and is not available digitally. Large quantities of personal data are locked on private servers inaccessible on the internet. In short, a massive chunk of human expression is not represented in any of the data used to train LLMs. Most LLM training data comes from a limited set of educated people in modern times.
There is an even more important reason that all human knowledge cannot be acquired by any LLM. The whole of human knowledge has at least three aspects. One is symbolic descriptions that can be represented and recorded, such as rules, procedures, algorithms, and facts. The second is performance skills, which are the tacit knowledge that can be demonstrated but cannot be represented or recorded—what we call expertise. It possible that a robot could learn performance skills by imitating humans even though there is no way to communicate those skills in symbols. The third aspect is the background on context discussed earlier. The totality of descriptive knowledge is minuscule compared to performance knowledge. Thus, even if we could unlock and digitize every archive and unmute every silenced voice, LLMs would still have no access to the bulk of human knowledge.
Conclusions
LLMs reveal striking new statistical predictabilities in our use of language and have harnessed them in a deeply impressive way. They have exposed some serious gaps in our understanding of language. How much of the language we use every day can be modeled by statistical inference? Can the mathematics of inference (Bayesian analytics) make inferences that no one has ever considered? Are those inferences "creations" or just "revelations" of what was hidden in the data?
Inference is a third-person phenomenon susceptible to mathematical formalization. Human immersions in care, responsibility, communities, assessments, imaginations, futures, commitments, worlds, moods, emotions, backgrounds of obviousness, and embodied action are not formalizable and are out of reach for LLMs. Rather than chase a chimera of growing LLMs into superhuman intelligences, why not follow a strategy that acknowledges humans and machines each have powers that the other lacks? Then focus on the designs of machines and their interfaces that augment human powers with machine powers. Focus on apps that augment human work and relieve grunt work rather than replace human workers [8]. For many, this would be a better outcome than the singularity of AI machines merging with humans [9].
How shall we deal with the hallucination problem? The problem is formulated as a machine intelligence that occasionally fabricates falsehoods. It would help to reframe the problem by returning to the basics of LLM design and function. LLMs are statistical inference engines that compute output strings that are statistically probable given the training data and the prompt. In this formulation, every LLM output is a fabrication. The machine cannot tell which outputs are true or meaningful because the meanings of outputs are not in the machine or its neural network. They are in the interpretations by human users of the outputs.
This is why LLMs cannot distinguish truth from falsehood or the possible from the impossible. They cannot care about anything, take responsibility for anything they say, take a stand on anything, or make and keep commitments. They are amoral, they have no ethics, and they have no sense of the consequences of actions they might recommend.
LLMs are a new kind of machine. They aren't going away. We are in a time of caveat emptor with these machines. We need to understand their powers and limits. Hard science is one of our best tools to this end. We need to define our terms rigorously, formulate our claims of LLM capabilities as falsifiable hypotheses, carefully execute experiments to verify claims, and present clear summaries of our findings on the powers and limitations of the machines. This process is well under way with researchers seeking methods of validating claims made in LLM outputs, looking for design principles that reduce the incidence of outputs judged as fabrications, and reporting where the machines produce value or create unsafe risks. This puts a burden of responsibility on us. When we are users of LLMs, we can refrain from basing actions on unverified claims made in the outputs from these machines. When we are designers of LLM applications, we avoid unverifiable promises about the utility and safety of our constructions.
Given all we have said, we can conclude that LLMs are likely to be useful in a wide range of areas where they can do narrow tasks well and customize solutions to individual users. They are likely to be risky in another wide range of areas where human judgment is important, and the costs of automated mistakes are high. A benefit, if we are open to it, is that we can gain a much sharper understanding of the core question of computer science: what human tasks can be automated and performed well by machines?
Humans live in language. Machines are outside of language. If machines develop an intelligence, it will seem very alien to us, and we might regret our achievement.
Acknowledgements
I am grateful to my fellow Ubiquity editors for comments and insights on the drafts of this essay, especially Espen Andersen, Robert Akscyn, Kemal Delic, Jeff Johnson, Ted Lewis, Andrew Odlyzko, Walter Tichy, and Martin Walker. I am also grateful to Fernando Flores, Chauncey Bell, and B. Scot Rousse for conversations on the powers that language gives humans that are not machine-reproducible.
References
[1] Denning, P. and Lewis, T. Intelligence may not be computable. American Scientist (2019), 346–349.
[2] Ji, Z. et al. Survey of hallucination in natural language generation. ACMComput. Surv. 55, 12 Article 248 (2023).
[3] Farquhar, S. et al. Detecting hallucinations in large language models using semantic entropy. Nature 630, 8017 (2024), 625–630.
[4] Randell, B. and B. Coghlan. 2023. ChatGPT's astonishing fabrications about Percy Ludgate. Annals of History of Computing 45, 2 (2023), 71–72.
[5] Chomsky, N., Roberts, I., and J. Watumull, J. Noam Chomsky: The False promise of ChatGPT. The New York Times. March 8, 2023.
[6] Denning, P. and Rousse, B. S. Can machines be in language? Commun. ACM 67, 3 (2024), 32–35.
[7] Denning, P. Software quality. Commun. ACM 59, 9 (2016).
[8] Acemoglu, D. and Johnson, S. Power and Progress: Our Thousand-Year Struggle over Technology and Prosperity. PublicAffairs, 2023.
[9] Kurzweil, R. The Singularity is Nearer: When We Merge with AI. Viking, 2024.
Author
Peter J. Denning is a Distinguished Professor of computer science at the Naval Postgraduate School in Monterey, California. He is a past president of ACM (1980-82). He received the IEEE Computer Pioneer Award in 2021. His most recent books are Computational Thinking (with Matti Tedre, MIT Press, 2019) and Navigating a Restless Sea (with Todd Lyons, Waterside, 2024).
Footnotes
1. This essay includes excerpts from three Communications of the ACM articles written by the author: "Can Generative AI Bots be Trusted?," "The Smallness of Large Language Models," and "Can Machines Be in Language?"
2. In 1966, Joseph Weizenbaum published Eliza, a conversational program that used keyword substitution to simulate Rogerian psychotherapy sessions. It was not a neural network.
3. AGI is the idea that computers will eventually be better than humans at every human cognitive task. The singularity is the moment in history when computers develop superhuman intelligence, after which it is impossible to predict what will happen to humanity.
4. This section draws from "Can Machines Be in Language?" [6].
Tables
Table 1. Hierarchy of AI machines
2025 Copyright held by the Owner/Author.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2025 ACM, Inc.
COMMENTS