Salesforce Introduces XGen-7B, A Large Language Model With Longer Context Support

News Room

The race to release open source generative AI models is heating up. Salesforce has joined the bandwagon by launching XGen-7B, a large language model that supports longer context windows than the available open source LLMs.

The 7B in XGen-7B LLM represents 7 billion parameters. The larger the number of parameters, the bigger the model. Models with larger parameters, such as 13 billion tokens, require high-end CPUs, GPUs, RAM, and storage. But the larger model size helps get an accurate response since it is trained on larger data corpora. So, it’s a tradeoff between size and accuracy.

One of the key differentiators of XGen-7B is the 8K context window. A larger context window translates to a large prompt and the output generated by the model. This means it’s possible to send prompts with additional context to the model and get longer responses. The 8K context window is the cumulative size of the input and output text.

Let’s understand what a token is. Since machine learning models understand numbers and not characters, each word or a part of it is converted into a token. A token is a way to encode text, like ASCII or Unicode. To turn words into tokens, XGen-7B uses the OpenAI tokenizing system used with its popular models, such as GPT-3 and GPT-4.

XGen-7B becomes an alternative to open source LLMs such as MPT, Falcon, and LLaMa. Salesforce claims that its LLM achieves comparable or better results than the current state-of-the-art language models of similar size.

Salesforce releases three variants of the XGen-7B. The first one, XGen-7B-4K-base, supports a 4K context window, while the second variant, XGen-7B-8K-base, is trained with additional data with support for an 8K context length. Both of these variants are released under the Apache 2.0 open source license, which allows commercial usage.

The third variant, XGen-7B-{4K,8K}-inst, is trained on instructional data including databricks-dolly-15k, oasst1, Baize and GPT-related datasets, which are available only for research purposes. The instruct keyword in the name indicates that the model can understand instructions and has been trained based on reinforcement learning from human feedback (RLHF) techniques. An instruction-based language model can be used to build chatbots similar to ChatGPT.

Salesforce has used multiple datasets, such as RedPajama and Wikipedia, and Salesforce’s own dataset, Starcoder, to train the XGen-7B LLM. Based on Google Cloud pricing for TPU-v4, the training cost of the model is $150K on 1T tokens. The model is trained on 22 different languages to make it multilingual.

Salesforce’s XGen-7B supports Massive Multitask Language Understanding, which is the ability to answer multiple-choice questions from various branches of knowledge such as the humanities, STEM, Social Sciences, and other domains. The XGen-7B scores better than other models in this category.

The XGen-7B model also does well in other categories, such as conversations, long-form Q&A and summarization.

Salesforce also added a disclaimer stating that their LLM is subject to the same limitations as other LLMs, such as bias, toxicity, and hallucinations.

With a larger context window and a comprehensive set of datasets used for training, the XGen-7B LLM from Salesforce looks promising.

Read the full article here

Share this Article
Leave a comment