← Blog

What Stable Diffusion learned about wrongness

AI / generative images10 min
remaindererror

The file

The file weighs 24.08 kilobytes.

It was uploaded to CivitAI on February 10, 2023, by the anonymous user rqdwdw. No explanation of what it does. No theory. No manifesto. A file, a folder path, and a caveat: results may vary depending on the model. The filename is easynegative.safetensors.

As of this writing, it has been downloaded 723,000 times. It holds a 4.99 out of 5 rating across 51,000 reviews. It has been used to generate over 132 million images.

Open the file. Inside: a header naming one tensor, its shape (8 × 768), its data type (32-bit floating point). Then 6,144 numbers. None larger than a tenth. Most within a hundredth of zero. Sample values: −0.0004, 0.0095, −0.008, −0.0209, 0.0295, −0.0086. This is what 132 million images were shaped by. A grid of numbers so small they are almost nothing.

EasyNegative is a textual inversion embedding — a pseudo-word with a position in latent space but no human-readable definition. The technique was originally designed to teach Stable Diffusion personal concepts: your face, your dog, a particular art style. You show the model a few images, and it learns coordinates in 768-dimensional space that capture the concept. Then you type the pseudo-word, and the model navigates to that position.

EasyNegative is the same method, pointed the other way. Instead of training on images of something the model should produce, rqdwdw trained it on images of what the model should refuse. Bad outputs. Failures. The anatomically broken, the visually degraded, the aesthetically wrong. The result is not a list of banned features. It is a direction — a vector pointing away from everything the training data's consensus rejected.

The most widely adopted aesthetic intervention of the generative AI era is smaller than a thumbnail image, and it describes not what an image should be, but everything it should not.


The space

The interface presents a blank text box, implying infinite possibility. The space behind it is already full.

Stable Diffusion's latent space was shaped by approximately five billion image-text pairs scraped from the open internet. That sounds neutral — a statistical average, a representative sample. It is not. The images that survive online, the images that accumulate links and metadata and alt-text, the images that search engines surface and datasets scrape, are overwhelmingly the images that capital rewards: commercial photography, stock images, advertising. What sits at the center of the distribution is not an average of all possible images. It is the aesthetic residue of the market.

Generate a portrait with no style guidance and you will see it: soft-box lighting, amber-to-rose palette, skin smoothed to porelessness, the faint commercial glow of an image that exists to sell something. The visual equivalent of elevator music. No one is offended. No one is moved.

Every user working against the default feels this center as a gravitational pull. The model smooths grit. It heals damage. It drifts back toward glow. It is like swimming against a current you cannot see but whose direction you learn, stroke by stroke, to predict.

EasyNegative operates within this shaped space. It pushes away from failure — from blur, from deformity, from the aesthetically rejected. But because the space is not neutral, the geometry of repulsion is also the geometry of attraction. Push away from wrongness and you slide deeper toward the commercial center. You cannot negate the default without reinforcing the rest of the field. Every refusal is also a selection. Every force away is a force toward.

Before EasyNegative, users navigated this space by hand. When AUTOMATIC1111 added a second text box to the Stable Diffusion interface on September 25, 2022, it created a space for describing what images should not contain. Users filled it with litanies — strings of sixty, seventy terms copied from guides and forums and passed like recipes: worst quality, low quality, normal quality, lowres, blurry, jpeg artifacts, watermark, username, bad anatomy, bad hands, extra fingers, fewer digits, cropped, worst quality, deformed, ugly. The lists grew. They were debated, ranked, refined. A UX researcher called the users medieval alchemists, and the comparison was more precise than she intended. In the absence of transparency — a system that cannot tell you what it wants — animism is the most rational available epistemology. The litanies were apotropaic: lists of wrongness recited to ward it off.

EasyNegative compressed all of that communal, iterative, contested practice into one word. Eight tokens instead of seventy. Drop the file into a folder and type the name. What had been a negotiation — conducted in language, under conditions of mutual opacity, between a user who could not see and a model that could not explain — became a single invocation you could forget you were making.


The paradox

What the model was actually doing when it received the command to refuse was not discovered until 2024.

A team at UCLA, publishing in the European Conference on Computer Vision, found something that users had suspected but could not prove. In the first three of a typical thirty-step denoising process, a negative prompt does not repel the forbidden object. It produces it.

The reason is mathematical. In a uniform noise field — the starting point of every generated image — adding a feature and subtracting a feature are indistinguishable operations. Tell the model "not tower" when there is no structure present, and the result is the same as telling it "tower": a tower pattern appears. In a world with no structure, repulsion is indistinguishable from attraction.

The positive prompt finds its target within four steps. The negative prompt cannot locate what it needs to suppress until approximately step eight. By then, the trajectory is nearly set — consecutive noise estimates show 99.5% cosine similarity. The forbidden object has already left its mark on the image's structure.

But it gets worse. Even when the negative prompt succeeds — in the middle steps, once it has found its target — it does not work by pushing the image away. It works by generating a complete rendering of the forbidden thing and then subtracting it. This is the mechanism behind classifier-free guidance: the system produces two predictions at each step, one guided by the negative prompt and one by the positive, and the difference is used to steer the image away from the negative and toward the positive. Erasure requires creation. To remove something from an image, the model must first imagine it there, deliberately, in full.

Monster, then ghost, then absence.

The negative prompt must produce the very wrongness it exists to eliminate. The system imagines what it is told not to imagine — once accidentally, in its blind early steps, when repulsion and attraction are the same gesture; and once deliberately, in the middle steps, when it constructs a ghost to cancel what it has found. The wrong is not absent from the final image. It was passed through. Every image made with EasyNegative is a séance conducted in reverse: the ghost comes first, the absence comes after.

The folk litanies were structurally accurate all along. Naming summons. Summoning is the condition of erasure.


The wrong

What is compressed into twenty-four kilobytes of wrongness?

The shallowest layer is technical failure. Bad hands, extra fingers, collapsed anatomy, fused digits — the signature errors of diffusion models that cannot count spatial parts. This is the layer most users think they are addressing.

Beneath the anatomical failures is a second, more consequential layer. Look at what else the litanies name: watermark, username, jpeg artifacts, signature, cropped, lowres. These are not errors of generation. They are traces of circulation — fingerprints of ownership, transmission, compression, and the material conditions under which images move through the world. Every watermark is an act of ownership. Every JPEG artifact is a record of transmission, evidence that this image passed through hands and servers and compression algorithms that left their marks on it the way handling leaves marks on paper.

Negating them is not quality control. The word for this, in other contexts, is laundering — removing marks of origin so that the object can circulate as if it had no history.

The two layers together compose a single instruction: forget that a human was ever here. What remains after the negation is an image with no history of production, no record of ownership, no trace of the labor that made it or the infrastructure that carried it. An image that appears to owe nothing to anyone.

Each of these judgments is defensible in isolation. No one wants blurry images. No one wants extra fingers. But together, compounded across 132 million iterations, they constitute a selection pressure — not against visual failure, but against the presence of human beings in images. And specifically against the humans who could least afford to make their images look expensive.

And the wrongness has a particular address. EasyNegative was trained on Counterfeit, an anime-style checkpoint built on images from Danbooru — a Japanese fan art community board with its own quality tags, its own aesthetic hierarchy, its own standards for what counts as finished work. One subculture's preferences, trained into eight tokens, applied to 132 million images across every genre and style. The universal standard of wrongness has a very particular address, and almost no one who uses it knows where it comes from.


The human

The technical failures are already disappearing. SDXL handles anatomy. Flux handles hands. Midjourney resolved it early. The toolness of the tool is being sealed away — newer models no longer expose their failure modes to users. The litanies are becoming unnecessary, and with them the communal epistemology that the litanies sustained.

But the 132 million images are in the world. They sit at the center of the distribution — clean, commercial, debtless — and they will be scraped into the next generation of training data. The probable produces images that become training data that make the probable more probable. The center gets denser. The gravitational pull gets stronger. The variance gets narrower. Each pass through the loop compresses the range of what the system considers acceptable.

What is being selected for is the clean, the high-bandwidth, the commercially viable. What is being selected out is the low-resolution, the watermarked, the compressed, the amateur — the poor images, in both senses of the word, are being bred out of the distribution. Not by anyone's intention. Not by design. By the accumulated weight of 132 million individual judgments that each felt like common sense.

If the process ran to completion — if the feedback loop closed perfectly — the system would produce one image, forever. The same smooth, debtless, commercially optimized surface, generated and regenerated without variation. The singularity is not an explosion of intelligence. It is the heat death of aesthetic difference.

But the human keeps returning. Classified as wrong, pushed out, and returning — not because the system tolerates human presence but because it cannot structurally finish removing it. New models produce new failures. New practices generate new litanies. The erasure is never complete. What is excluded does not disappear. It accumulates at the boundary, exerting force, and the boundary shifts, and the exclusion must be performed again, differently, against a changed remainder.

The human is what the classification cannot contain. The human is what comes back. The human is wrong.

Perhaps the real — with its watermarks, its compression artifacts, its evidence of labor and bandwidth poverty, its six-fingered hands and cracked screens — is unbearable. Perhaps the wrong is simply the real, and we cannot bear to look at it, and twenty-four kilobytes is a small price to pay for not having to.

The images look good.