Chameleon, a mixed-modal early-fusion foundation model

In a new paper, Meta announces 𝐂𝐡𝐚𝐦𝐞𝐥𝐞𝐨𝐧, a 𝐦𝐢𝐱𝐞𝐝-𝐦𝐨𝐝𝐚𝐥 𝐞𝐚𝐫𝐥𝐲-𝐟𝐮𝐬𝐢𝐨𝐧 foundation model. Contrary to earlier multimodal models, which model the different modalities (text, image, audio, etc.) separately, mixed-modal early-fusion foundation models like Chameleon are end-to-end models. They ingest all modalities from the start and project them into one representational space. That permits integrating information across modalities and generating multimodal documents. Indeed, the paper contains some nice examples of 𝐢𝐧𝐭𝐞𝐫𝐥𝐞𝐚𝐯𝐞𝐝 𝐢𝐦𝐚𝐠𝐞 𝐚𝐧𝐝 𝐭𝐞𝐱𝐭 generation (see below), which seems to be Chameleon’s forte.

Interleaved query and response (from the paper).

Despite Meta’s different practices in the social media department, the company is remarkably transparent in its GenAI business. The paper contains many interesting insights, and the Chameleon model will hopefully become open source as its predecessors already are.

The paper describes the used datasets (not including Meta user data) and gives detailed insights into techniques for stable and scalable model training. Interesting too is the section about inference that identifies 𝐭𝐡𝐫𝐞𝐞 𝐬𝐩𝐞𝐜𝐢𝐟𝐢𝐜 𝐦𝐢𝐱𝐞𝐝-𝐦𝐨𝐝𝐚𝐥 𝐜𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐞𝐬: generated tokens must be copied from the GPU to the CPU to inspect their nature (i.e., text or image) to send them to the correct decoder, tokens that do not belong to a particular modality need to be masked, and finally, text’s variable length versus images’ fixed-size blocks of tokens need to be seamlessly integrated.

As for 𝐞𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧, Chameleon excels at interleaved image and text generation, while remaining very competitive in text-only and image-only tasks. Missing is a comparison against GPT-4o (to be fair, launched only three days before this paper was published, indicative of the speed of innovation). Unfortunately, not much is known about GPT-4o’s architecture. Likely, GPT-4o is much larger than the 34 billion parameter Chameleon (which also exists in a 7B version) and trained on more data. If Chameleon holds up to GPT-4o, then it might lead us towards a future of smaller models, which is desirable in many ways. Note, however, that GPT-4o has audio capabilities which are currently absent in Chameleon.

Next to benchmarking (for the text-only and image-to-text tasks) for evaluation purposes, the paper contains a large section about 𝐡𝐮𝐦𝐚𝐧 𝐞𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧, which serves as a great introduction to the topic for the uninitiated reader.

Chameleon, a mixed-modal early-fusion foundation model

Comments

Leave a Reply Cancel reply