Chameleon, a mixed-modal early-fusion foundation model

In a new paper, Meta announces ๐‚๐ก๐š๐ฆ๐ž๐ฅ๐ž๐จ๐ง, a ๐ฆ๐ข๐ฑ๐ž๐-๐ฆ๐จ๐๐š๐ฅ ๐ž๐š๐ซ๐ฅ๐ฒ-๐Ÿ๐ฎ๐ฌ๐ข๐จ๐ง foundation model. Contrary to earlier multimodal models, which model the different modalities (text, image, audio, etc.) separately, mixed-modal early-fusion foundation models like Chameleon are end-to-end models. They ingest all modalities from the start and project them into one representational space. That permits integrating information across modalities and generating multimodal documents. Indeed, the paper contains some nice examples of ๐ข๐ง๐ญ๐ž๐ซ๐ฅ๐ž๐š๐ฏ๐ž๐ ๐ข๐ฆ๐š๐ ๐ž ๐š๐ง๐ ๐ญ๐ž๐ฑ๐ญ generation (see below), which seems to be Chameleonโ€™s forte.

Interleaved query and response (from the paper).

Despite Metaโ€™s different practices in the social media department, the company is remarkably transparent in its GenAI business. The paper contains many interesting insights, and the Chameleon model will hopefully become open source as its predecessors already are.

The paper describes the used datasets (not including Meta user data) and gives detailed insights into techniques for stable and scalable model training. Interesting too is the section about inference that identifies ๐ญ๐ก๐ซ๐ž๐ž ๐ฌ๐ฉ๐ž๐œ๐ข๐Ÿ๐ข๐œ ๐ฆ๐ข๐ฑ๐ž๐-๐ฆ๐จ๐๐š๐ฅ ๐œ๐ก๐š๐ฅ๐ฅ๐ž๐ง๐ ๐ž๐ฌ: generated tokens must be copied from the GPU to the CPU to inspect their nature (i.e., text or image) to send them to the correct decoder, tokens that do not belong to a particular modality need to be masked, and finally, textโ€™s variable length versus imagesโ€™ fixed-size blocks of tokens need to be seamlessly integrated.

As for ๐ž๐ฏ๐š๐ฅ๐ฎ๐š๐ญ๐ข๐จ๐ง, Chameleon excels at interleaved image and text generation, while remaining very competitive in text-only and image-only tasks. Missing is a comparison against GPT-4o (to be fair, launched only three days before this paper was published, indicative of the speed of innovation). Unfortunately, not much is known about GPT-4oโ€™s architecture. Likely, GPT-4o is much larger than the 34 billion parameter Chameleon (which also exists in a 7B version) and trained on more data. If Chameleon holds up to GPT-4o, then it might lead us towards a future of smaller models, which is desirable in many ways. Note, however, that GPT-4o has audio capabilities which are currently absent in Chameleon.

Next to benchmarking (for the text-only and image-to-text tasks) for evaluation purposes, the paper contains a large section about ๐ก๐ฎ๐ฆ๐š๐ง ๐ž๐ฏ๐š๐ฅ๐ฎ๐š๐ญ๐ข๐จ๐ง, which serves as a great introduction to the topic for the uninitiated reader.


Leave a Reply

Your email address will not be published. Required fields are marked *