DALL-E


DALL-E (stylized DALL·E) is an artificial intelligence program that creates images from textual descriptions, revealed by OpenAI on January 5, 2021.[1] It uses a 12-billion parameter[2] version of the GPT-3 Transformer model to interpret natural language inputs (such as "a green leather purse shaped like a pentagon" or "an isometric view of a sad capybara") and generate corresponding images.[3] It can create images of realistic objects ("a stained glass window with an image of a blue strawberry") as well as objects that do not exist in reality ("a cube with the texture of a porcupine").[4][5][6] Its name is a portmanteau of WALL-E and Salvador Dalí.[3][2]

Many neural nets from the 2000s onward have been able to generate realistic images.[3] DALL-E, however, is able to generate them from natural language prompts, which it "understands [...] and rarely fails in any serious way".[3]

DALL-E was developed and announced to the public in conjunction with CLIP (Contrastive Language-Image Pre-training),[1] a separate model whose role is to "understand and rank" its output.[3] The images that DALL-E generates are curated by CLIP, which presents the highest-quality images for any given prompt.[1] OpenAI has refused to release source code for either model; a "controlled demo" of DALL-E is available on OpenAI's website, where output from a limited selection of sample prompts can be viewed.[2] Open-source alternatives, trained on smaller amounts of data, like DALL-E Mini, have been released by communities.[7]

According to MIT Technology Review, one of OpenAI's objectives was to "give language models a better grasp of the everyday concepts that humans use to make sense of things".[1]

The Generative Pre-trained Transformer (GPT) model was initially developed by OpenAI in 2018,[8] using the Transformer architecture. The first iteration, GPT, was scaled up to produce GPT-2 in 2019;[9] in 2020 it was scaled up again to produce GPT-3.[10][2][11]

DALL-E's model is a multimodal implementation of GPT-3[12] with 12 billion parameters[2] (scaled down from GPT-3's 175 billion)[10] which "swaps text for pixels", trained on text-image pairs from the Internet.[1] It uses zero-shot learning to generate output from a description and cue without further training.[13]