
Ars Technica
On Monday, Microsoft researchers presented a Kosmos-1 multimedia model that can analyze images for content, solve visual puzzles, perform visual text recognition, pass visual intelligence tests, and understand natural language instructions. Researchers believe that multimedia AI — which integrates different types of input such as text, audio, images, and video — is an essential step for building artificial general intelligence (AGI) that can perform general tasks at a human level.
“Being an essential part of intelligence, it is multimodal Perception is a necessity for an artificial realization General Intelligencein terms of acquiring knowledge And on the ground,” the researchers wrote in their academic paper, “Language is not all you need: aligning cognition with paradigms of language.” “
Visual examples from the Kosmos-1 paper show the model analyzing images and answering questions about them, reading text from an image, writing comments for images, and taking a visual intelligence test with an accuracy of between 22 and 26 percent (more on that below).
-
An example provided by Microsoft of Kosmos-1 answering questions about images and websites.
Microsoft
-
An example provided by Microsoft of “Stimulating Multimedia Series Thinking” for Kosmos-1.
Microsoft
-
Example of Kosmos-1 answering visual questions, courtesy of Microsoft.
Microsoft
While the media is known for news about large language models (LLM), some AI experts point to multimedia AI as a potential path toward artificial general intelligence, a hypothetical technology that would replace humans in any intellectual task (and any intellectual function). AGI is the stated goal of OpenAI, Microsoft’s main business partner in artificial intelligence.
In this case, Kosmos-1 appears to be a pure Microsoft project without OpenAI’s involvement. The researchers call their creation the Multimodal Large Language Model (MLLM) because its roots lie in text-only LLM-like natural language processing, such as ChatGPT. He explains: For Kosmos-1 to accept image input, researchers must first translate the image into a special string of tokens (the body text) that LLM can understand. The Kosmos-1 paper describes this in more detail:
For the input format, we normalize the input as a sequence decorated with special symbols. Specifically, we use
And To denote the beginning and end of the sequence. TokensAnd They indicate the beginning and end of cryptic weddings. For example, “document is to enter text, andparagraphIt is an interleaved image text entry.Embed image paragraph… The embedding module is used to encode both text tokens and other input methods into vectors. The weddings are then fed into the decoder. For input symbols, we use a lookup table to map them to weddings. For continuous signaling modalities (such as image and sound), it is also possible to represent the input as a discrete symbol and then regard it as ‘foreign languages’.
Microsoft trained Kosmos-1 using data from the web, including excerpts from The Pile (an 800GB English-language text resource) and Common Crawl. After training, they assessed Kosmos-1’s abilities on several tests, including language comprehension, language generation, OCR-free text classification, image commenting, answering visual questions, answering web page questions, and image classification without screenshots. In many of these tests, the Cosmos 1 outperformed the most current models, according to Microsoft.

Microsoft
Of particular interest is Kosmos-1’s performance in Raven’s Progressive Reasoning, which measures visual IQ by presenting a series of shapes and asking the test-taker to complete the sequence. To test Kosmos-1, the researchers fed an entire quiz, one at a time, with each option completed and asked if the answer was correct. Kosmos-1 can correctly answer a question on the Raven test 22 percent of the time (26 percent with fine tuning). This is by no means a knockout, and errors in methodology can affect results, but Kosmos-1 beat random chance (17 percent) in the Raven IQ test.
However, while Kosmos-1 represents early steps in the multimodal field (an approach that others are also following), it’s easy to imagine that future improvements could yield even more significant results, allowing AI models to perceive and act upon any form of media, thus enabling It will greatly enhance the capabilities of industrial assistants. In the future, the researchers say they’d like to scale up Cosmos-1 in model size and also incorporate the ability to speak.
Microsoft says it plans to make Kosmos-1 available to developers, though the GitHub page cited by the paper had no clear Kosmos-specific code when this story was published.