Multimodal Foundation Models

  • Previously each foundational model was designed for a specific output, for example “text-only” like GPT-3, “image-only” like DALL-E, or “speech-only” like Whisper in older versions.

  • The current approach is to create Multimodal foundational models, which can process and understand multiple types of input, (and ideally generate different type of outputs)

  • Within the Multimodal foundation models are those that are only Multimodal Input, which means that they support multiple types of input, but generate generally text only, e.g. Amazon Nova Pro, Claude 3, GPT-4 Vision

  • On the other hand, we have the Multimodal fundation models with Specialized Ouput, which can generate an specific type of output, like Amazon Nova Cavas, DALL-E 3, Stable Diffution, or MidJourney which are specialized in generate images; Amazon Nova Reel, Runway Gen-2 or Pika Labs wich are specialized in generate video; MisicGen or AudioCraft to generate Audio/Music; Amazon Nova Sonic, ElevanLabs for generate speech.

  • The end goal of trend is to have trully Multimodal Fundation Models which can understand and generate multiple type of data, for instance Gemini which can generate text and also images (with limited capatiblites)

  • Generally speaking, an AI model is a general term for a model that is specialized on a specific task or domain. However, we call it a Foundation Model when we want to refer to large, general-purpose models trained on massive and diverse datasets (billions+), where high resources are utilized to train them.

  • Foundation Models like GPT4, Bloom or Stable Diffusion are versatile and can do task across different domains like:

    • Writing
    • Generate Images
    • Solve Math Problems
    • Engage in Dialogues
    • Answer questions based on documents
    • Code