Jacob Browning is a postdoc in NYU’s Computer Science Department working on the philosophy of AI.
For as long as people have fantasized about thinking machines, there have been critics assuring us of what machines can’t do. Central to many of these criticisms is the idea that machines don’t have “common sense,” such as an artificial intelligence system recommending you add “hard cooked apple mayonnaise” or “heavy water” to a cookie recipe.
In a seminal paper, “Representational Genera,” the late philosopher of AI John Haugeland argued that a unique feature of human understanding, one machines lack, is an ability to describe a picture or imagine a scene from a description. Understanding representations, Haugeland wrote, depends on “general background familiarity with the represented contents — that is, on worldly experience and skill.” It is our familiarity with representations, like the “logical representations” of words and the “iconic representations” of images, that allow us to ignore scribbles on paper or sounds and instead grasp what they are about — what they are representing in the world.
Which is why OpenAI’s recently released neural networks, CLIP and DALL-E, are such a surprise. CLIP can provide descriptions of what is in an image; DALL-E functions as a computational imagination, conjuring up objects or scenes from descriptions. Both are multimodal neural networks, artificial intelligence systems that discover statistical regularities in massive amounts of data from two different ways of accessing the same situation, such as vision and hearing.
CLIP and DALL-E are fed words and images and must discern correspondences between specific words and objects, phrases and events, names and places or people, and so on. Although the results — as with all contemporary AI — have their mix of jaw-dropping successes and embarrassing failures, their abilities reveal some insight into how representations inform us about the world.
In many criticisms of AI, one that CLIP and DALL-E expose, the meaning of common sense is ambiguous. Many observers seem to imagine common sense as a matter of words, such as a bunch of sentences in the head cataloging the beliefs a person holds. Another approach would be to base common sense around mental images, like a massive model of the world our brains can consult.
Haugeland opened up yet another approach to common sense — which he did not take — centered around neural networks, which are a kind of “distributed representation.” This way of representing the world is less familiar than logical and iconic, but arguably the most common. It treats common sense not as a matter of knowing things about the world, but a matter of doing things in the world.
The Distinction Between Logical And Iconic
In his article “Representational Genera,” Haugeland noted that humans use many kinds of representations, like the pictures we frame and hang around the house or the descriptions that fill books. He argued that what distinguishes logical, iconic and distributed representations is what they can or cannot represent about the world. Each only represents a small portion of the world and can do so in a peculiar way — capturing some features but ignoring many others.
Humans absorb these representations using background knowledge, “fleshing out” missing details based on common sense. Shorn of background knowledge, logical contents — a single word or phrase, a few notes on a music score, the markings in an equation or sentence — typically represent only what philosophers call “discrete facts”: objects and properties, musical phrases or the relation of numbers in an equation.
By contrast, iconic representations — images, maps, music recordings or videos — involve elements that only make sense in relation to each other: shapes in a picture, the location of a mountain range or the various positions and movements of actors in a movie. Iconic representations depend on the relationship between elements and their locations, like how a black-and-white photograph represents certain wavelengths of light at different locations. Both kinds of representation are expressive, but logical representations cannot capture relations between elements without adding more information, whereas iconic representations cannot depict elements non-relationally.
Neither of these forms of representation reflects how we experience them. Musicians looking at a familiar musical score — a logical representation — will instantly imagine their favorite recording of the piece: an iconic representation. But this is the work of our background familiarity with both kinds of representation.
Take an article about a recent New York mayoral debate. An image might show a series of human bodies standing awkwardly behind podiums with bright red, white and blue shapes and patterns behind them. By contrast, the article discusses policy ideas, personal attacks, one-liners and sharp rebukes about policing. At the skeletal level, these refer to entirely different things: a group of bodies on the one hand and a group of topics on the other. That we grasp the text and image as related is based on our background understanding of how news articles work, because we understand the bodies are people running for office who are talking to and about each other.
These are the kinds of skills needed for switching between representations that Haugeland understood as beyond the abilities of machines. And this is why the success of DALL-E and CLIP is so surprising. These systems recognize and reproduce not just skeletal content but also flesh it out, contextualizing it with tacit information implied by the logical modality that bears on what should be depicted in the iconic modality.
Take a specific example: There is no generic image DALL-E can generate when faced with the phrase “football player evading a defender,” no one-for-one correspondence the machine can learn that would allow it to memorize the right answer. Instead, it needs to discern a many-to-many correspondence that captures all kinds of different features: two players, fully clothed, on a field, under lighting, with either a soccer ball at their feet or a football in their hand (but not both), close up or from a distance, surrounded by other players or maybe a referee but no eagles or bikes — and on and on.
This means DALL-E needs to represent the world — or, at least, the visible world made available in static images — in terms of what matters based on the kinds of descriptions people give of a scene. Distributed representations, with neural nets being the most common kind, provide their own distinct way of representing things, one capable of pulling from both logical and iconic representations in the effortless ways humans do.
Getting Distributed Representations Into View
We are familiar with logical and iconic representations because they are ubiquitous artifacts of our everyday lives. Distributed representations, on the other hand, have only recently become artifacts because of the success of deep learning, though they are both older and more common than anything artificial. Evolution stumbled onto this kind of solution for brains early on, since these networks provide an incredibly efficient means for representing the world in terms of what matters for the agent in order to act appropriately. Contemporary AI roughly mimics some of the architectural design and learning tactics present in all brains to approximate feats pulled off by nature.
Haugeland suggested we think of distributed representations as representing skills or know-how. It may seem strange to say a skill “represents” something, but skills depend on recognizing the relevant patterns in a task, grasping which nuances and differences matter and which responses are the most appropriate.
The skill for playing ping-pong, for example, needs to represent the look of a ball with spin related to a peculiar swing of the paddle, as well as which responses will be effective. The speed of the game requires recognition and response to happen instantaneously, far faster than we can consciously understand spin and decide how to react. Neural networks, biological and artificial, condense recognition and response into the same act.
For a familiar example in AI, take highway driving. It is a relatively simple task: ensure the car is equidistant between lane markers, maintain a constant distance to the next car, and — when lane changes are necessary — find out the relative position of cars in close proximity. This means the system can be precisely tuned to these patterns of visual data — lane markers, car shapes and relative distance — and ignore all the other stuff, like the colors of the cars or the chipped paint on the lane markers. There are only a few outputs available — maintain speed, go faster, go slower, stop, turn left, turn right — and the correct one is usually largely defined by visual inputs: brake if too close, turn slightly to stay in lane and so on.
The skeletal content of a distributed representation of highway driving, then, is just the association between the relevant visual patterns in the input that will trigger one output rather than another. The result is a highly detailed representation of the situation, but one that is different than logical or iconic representations. The distributed representation has nothing in it that “looks like” a car or acts as a “description” of the road. It instead encodes how certain visual patterns fit together in a way that reliably tracks cars and, thus, should be handled in a certain way. When humans go on “autopilot” while driving, they plausibly resort to a similar representation, effortlessly and unconsciously responding to lanes, cars and potholes — largely without noticing much of anything.
The main challenge for these skills is the same one facing humans: preventing a “deer in headlights” moment. Many infrequent events will be represented in the model, like driving on a slippery road or under limited visibility. But really rare events will fail to be represented at all and will instead be treated as something else; there likely won’t be a representation of a deer in the road, so the system will (hopefully) lump it into the broad category of nondescript obstacles and respond by slamming on the breaks.
This indicates a limit of the representation, which is that many possible inputs simply won’t be sufficiently distinct because they aren’t statistically common enough to be relevant. These distributed representations, in this sense, have a kind of tunnel vision — they represent what elements are most essential for the task and leave out the rest. But this goes for both biological and artificial networks, as well as logical and iconic representations; no representation can represent everything.
With CLIP and DALL-E, what matters is capturing how things should look in relation to a particular phrase. This obviously requires insight into how words describe objects. But they also need to figure out what is tacitly indicated by the phrase — whether the object is in the foreground or background, posing or in action, looking at the camera or engaged in some task and so on.
Understanding what matters based on a phrase requires building up rough multimodal representations that, on the one hand, map the relationship of words with other words and, on the other hand, words with various kinds of images. A phrase with the word “democrat” needs to pull up not just Joe Biden but also blue flags, tacky bumper stickers and anthropomorphic donkeys in suits. The ability of CLIP and DALL-E to pull off these feats suggests they have something like common sense, since representing any particular element in a plausible way demands a tacit general understanding of many other elements and their interconnections — that is, all the other potential ways something could look or be described.
But ascribing common sense to CLIP and DALL-E doesn’t feel quite right, since the task is so narrow. No living species would need to acquire a skill just for connecting captions and images. Both captions and images are social artifacts, governed by norms to keep them formulaic: short and sweet descriptions and crisp, focused images. They are useless at seemingly similar tasks, such as producing captions for videos or creating short movies. The whole activity is just too artificial — too specific and disconnected from the world. It seems like common sense, if it is anything, should capture more generality than this.
Rethinking Common Sense
An old philosophical tradition took common sense to be the place where all our modalities come together — where touch, taste and vision united in the mind to form a multimodal iconic model of the external world. For AI researchers operating in the 20th century, it was more common to think of a giant written encyclopedia, where our beliefs were written down in cross-referencing sentences — a database of logical representations.
But in either case, it required someone to consult these models or databases, a central reasoner who would pick out what matters in the models or databases (or both) to figure it all out. It is no surprise people struggled with creating common-sense AI since it seemed you’d need both a system that could know everything but also know how to access all the relevant stuff when solving a common-sense puzzle.
But when normal people talk of common sense, it tends to be because someone lacks it — someone behaving awkwardly or saying stupid things. When we ascribe common sense, it is to people who behave normally — people who have the skills and know-how to navigate the world. This model of common sense is less like the logical and iconic versions, where it is expected common sense is some giant body of knowledge in the brain, and instead hews closer to what we see in distributed representations.
Neural networks often generate a distributed representation that captures the right way to understand and act given a specific task. Multimodal neural networks allow these distributed representations to become much more robust. In CLIP and DALL-E’s case, the rich connections between logical and iconic representations provide them with a background familiarity about the world — discerning not just how words hang together but also what they imply about what things look like.
This approach to understanding makes more sense from an evolutionary perspective: let each species come up with the appropriate representations relative to its body, modalities and skills. What is meaningful to each species is relative to the world they inhabit, and what isn’t meaningful just doesn’t need to be represented. The common sense of a dog is its ability to do lots of dog-like things well, but there certainly isn’t any central reasoner inside a dog, or any database of language-like sentences specifying their beliefs and desires. A species represents its world in terms of how it should respond, and leaves the rest unrepresented.
This more modest take on common sense has implications for supposed worries about superintelligent machines hoovering up vast amounts of data — perhaps the encyclopedia of beliefs or the model of everything — which then leads to an omnicompetent general reasoner. But CLIP and DALL-E demonstrate that this is backwards: doing precedes knowing, and what we need to do determines what we know. Any representation of the world — logical, iconic or distributed — involves an assumption about what does and does not matter; you don’t take a picture of a sound. Humans know a lot because they do a lot — not vice-versa.
Machine understanding is not an all-or-nothing matter. Machines will continue to understand more through the piecemeal accumulation of skills that expand what they can do. This means artificial general intelligence won’t look like what we thought, but will likely be similar to us — a bundle of skills with rough-and-ready representations of what it needs to know to accomplish its various tasks. There is no more to general intelligence than that.