Airtable - Compute-intensive models

Grok-3

Language

Vision

Multimodal

Chat

Language modeling/generation

Question answering

Code generation

Visual question answering

xAI

2025-02-17

Grok 3 Beta — The Age of Reasoning Agents

https://x.ai/blog/grok-3

4.6400000000000005e+26

Estimate based on training time for a cluster of 100,000 H100s, and xAI's statement that Grok 2 was trained on more compute than GPT-4 (2.1e25) and that Grok 3 was trained on around 15 times more compute than Grok 2. Full estimate here: https://docs.google.com/document/d/1C_dABuZrAqYE_ui4_GZ4bRLtq3TBjIGoBSktaPElhEU/edit?usp=sharing

Unspecified unreleased

NVIDIA H100 SXM5 80GB

Confident

Hosted access (no API)

United States of America

100000

Unreleased

Industry

Gemini 1.0 Ultra

Multimodal

Language

Vision

Language modeling

Visual question answering

Chat

Translation

Google DeepMind

Gemini Team

2023-12-06

Gemini: A Family of Highly Capable Multimodal Models

https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf

633.00

5.0000000001e+25

This number is an estimate based on limited evidence. In particular, we combine information about the performance of Gemini Ultra on various benchmarks compared to other models, and guesstimates about the hardware setup used for training to arrive at our estimate. Our reasoning and calculations are detailed in this Colab notebook. https://colab.research.google.com/drive/1sfG91UfiYpEYnj_xB5YRy07T5dv-9O_c

Unspecified unreleased

"Gemini models are trained on a dataset that is both multimodal and multilingual. Our pretraining dataset uses data from web documents, books, and code, and includes image, audio, and video data... We find that data quality is critical to a highlyperforming model, and believe that many interesting questions remain around finding the optimal dataset distribution for pretraining."

Google TPU v4

Speculative

This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks — notably being the first mode

API access

United States of America

United Kingdom of Great Britain and Northern Ireland

Multinational

57000

Unreleased

API access: https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models

Industry

GPT-4o

Multimodal

Language

Audio

Speech

Vision

Chat

Image generation

Audio generation

Vision-language generation

Table tasks

Language modeling/generation

Question answering

Speech recognition

OpenAI

Aidan Clark, Alex Paino, Jacob Menick, Liam Fedus, Luke Metz, Clemens Winter, Lia Guy, Sam Schoenholz, Daniel Levy, Nitish Keskar, Alex Carney, Alex Paino, Ian Sohl, Qiming Yuan, Reimar Leike, Arka Dhar, Brydon Eastman, Mia Glaese, Ben Sokolowsky, Andrew Kondrich, Felipe Petroski Such, Henrique Ponde de Oliveira Pinto, Jiayi Weng, Randall Lin, Youlong Cheng, Nick Ryder, Lauren Itow, Barret Zoph, John Schulman, Mianna Chen, Adam Lerer, Adam P. Goucher, Adam Perelman, Akila Welihinda, Alec Radford

2024-05-13

Hello GPT-4o

https://openai.com/index/hello-gpt-4o/ https://openai.com/index/gpt-4o-system-card/

Not known. Inference costs in the API are 2x cheaper than GPT-4 Turbo

3.8100010000000003e+25

Training compute estimated from benchmark scores.

Unspecified unreleased

"With GPT-4o, we trained a single new model end-to-end across text, vision, and audio."

Speculative

We’re announcing GPT-4o, our new flagship model that can reason across audio, vision, and text in real time. GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time(opens in a new window) in a

API access

United States of America

Definitely a new model, not a GPT-4 finetune

Unreleased

Industry

Llama 3.1-405B

Language

Language modeling/generation

Meta AI

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu,

2024-07-23

The Llama 3 Herd of Models

https://ai.meta.com/research/publications/the-llama-3-herd-of-models/

405000000000.00

405B

3.8e+25

Stated in paper. Also, 6 * 405B * 15.6T training tokens = 3.8e25

Llama 3 dataset

15600000000000

15.6T tokens

NVIDIA H100 SXM5 80GB

Confident

Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models s

Open weights (restricted use)

United States of America

16384

Open (restricted use)

Llama 3.1 model license: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE must seek separate license if over 700m monthly users, acceptable use restrictions training code here: https://github.com/meta-llama/llama-recipes/blob/main/src/llama_recipes/utils/train_utils.py#L70

Industry

Claude 3.5 Sonnet

Multimodal

Language

Vision

Chat

Image captioning

Code generation

Language modeling/generation

Anthropic

2024-06-20

Claude 3.5 Sonnet

https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf

3.650001e+25

Training compute estimated from benchmark scores. Blog post by Dario Amodei includes some info on 3.5 Sonnet compute: https://darioamodei.com/on-deepseek-and-export-controls "Claude 3.5 Sonnet is a mid-sized model that cost a few $10M's to train (I won't give an exact number). Also, 3.5 Sonnet was not trained in any way that involved a larger or more expensive model (contrary to some rumors)."

Unspecified unreleased

Training data cutoff Apr 2024

Speculative

API access

United States of America

Unreleased

Industry

GLM-4-Plus

Language

Language modeling

Zhipu AI

2024-08-29

GLM-4-Plus

https://bigmodel.cn/dev/howuse/glm-4

3.6000000000000006e+25

Estimated using benchmark imputation

Unknown

At the KDD International Conference on Data Mining and Knowledge Discovery, the Zhipu GLM team unveiled the new generation of base large model—GLM-4-Plus. As the latest version of Zhipu’s fully self-developed GLM large model, GLM-4-Plus signifies Zhipu AI’s continuous dedication in the field of general artificial intelligence, advancing the independent and autonomous innovation of large model technology.

API access

China

Industry

Claude 3.7 Sonnet

Language

Vision

Multimodal

Language modeling/generation

Question answering

Code generation

Quantitative reasoning

Translation

Instruction interpretation

Visual question answering

Anthropic

2025-02-24

Claude 3.7 Sonnet

https://www.anthropic.com/news/claude-3-7-sonnet

3.3500000000000006e+25

https://docs.google.com/spreadsheets/d/10bhwdVrfHI8tysVIz62ZxtvQ30L-HojYvmU18_b-WIM/edit?gid=0#gid=0

Unspecified unreleased

"Claude 3.7 Sonnet is trained on a proprietary mix of publicly available information on the Internet, as well as non-public data from third parties, data provided by data labeling services and paid contractors, and data we generate internally. While trained on publicly available information on the internet through November 2024, Claude 3.7 Sonnet’s knowledge cut-off date is the end of October 2024. This means the model’s knowledge base is most extensive and reliable on information and events up

Likely

Today, we’re announcing Claude 3.7 Sonnet1, our most intelligent model to date and the first hybrid reasoning model on the market. Claude 3.7 Sonnet can produce near-instant responses or extended, step-by-step thinking that is made visible to the user. API users also have fine-grained control over how long the model can think for. Claude 3.7 Sonnet shows particularly strong improvements in coding and front-end web development.

API access

United States of America

Unreleased

Industry

Grok-2

Language

Vision

Multimodal

Chat

Language modeling/generation

Question answering

Code generation

Visual question answering

xAI

2024-08-13

Grok-2 Beta Release

https://x.ai/blog/grok-2

2.96e+25

Estimate based on xAI statements comparing Grok-2 compute to GPT-4 and Grok-3. Full estimate here: https://docs.google.com/document/d/1C_dABuZrAqYE_ui4_GZ4bRLtq3TBjIGoBSktaPElhEU/edit?usp=sharing

Unspecified unreleased

NVIDIA H100 SXM5 80GB

Confident

Grok-2 is our frontier language model with state-of-the-art reasoning capabilities. This release includes two members of the Grok family: Grok-2 and Grok-2 mini. Both models are now being released to Grok users on the 𝕏 platform.

Hosted access (no API)

United States of America

Unreleased

Industry

Doubao-pro

Language

Language modeling/generation

Question answering

Text summarization

Text classification

ByteDance

2024-10-28

Doubao General Model Pro (Doubao-pro)

https://www.volcengine.com/docs/6360/1264663

500000000000.00

[Speculative] Doubao's large language model has scaled up from 35 billion parameters to 800 billion, with 500 billion and 800 billion parameter models currently under training. https://xueqiu.com/9637001584/309910396?md5__1038=7qmx2DyDuie4cDBqDTQEWqDtMvO4iTphD

2.505e+25

6ND = 6 * 500*10^9 * 8350*10^9 = 2.505e+25

Unspecified unreleased

Doubao's data sources primarily rely on proprietary business data, accounting for 50-60%; externally sourced data comprises 15-20%; and synthetic data has been used since June of this year, although Doubao is cautious in feeding synthetic data due to its uncertain quality.

8350000000000

[Speculative] Doubao's pre-training data volume is approximately 500TB, with only about 10% of this data actually used for training. The current version employs a non-Mixture-of-Experts (MoE) architecture. In the future, MoE architecture may be introduced to increase parameter count and performance, while also integrating multimodal data solutions. So this model is dense, and the training data is probably all text tokens, not multimodal. 50TB * 167M tokens/GB ~= 8.35 trillion tokens

Speculative

A professional-grade, self-developed LLM supporting up to 128k tokens, enabling fine-tuning across the entire series.

API access

China

Unreleased

Industry

GPT-4 Turbo

Multimodal

Vision

Language

Image generation

Chat

Language modeling/generation

Image generation

Speech synthesis

Table tasks

Visual question answering

Image captioning

OpenAI

2023-11-06

New models and developer products announced at DevDay

https://openai.com/blog/new-models-and-developer-products-announced-at-devday

Not known. Maybe smaller/sparser than GPT-4.

2.2e+25

Estimated using benchmark imputation

Unspecified unreleased

Unknown

Today, we shared dozens of new additions and improvements, and reduced pricing across many parts of our platform. These include: New GPT-4 Turbo model that is more capable, cheaper and supports a 128K context window

API access

United States of America

Unreleased

Industry

Mistral Large 2

Language

Language modeling/generation

Translation

Code generation

Mistral AI

Albert Jiang, Alexandre Sablayrolles, Alexis Tacnet, Alok Kothari, Antoine Roux, Arthur Mensch, Audrey Herblin-Stoop, Augustin Garreau, Austin Birky, Bam4d, Baptiste Bout, Baudouin de Monicault, Blanche Savary, Carole Rambaud, Caroline Feldman, Devendra Singh Chaplot, Diego de las Casas, Diogo Costa, Eleonore Arcelin, Emma Bou Hanna, Etienne Metzger, Gaspard Blanchet, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Harizo Rajaona, Henri Roussez, Hichem Sattouf, Ian Mack, Jean-Malo Delignon, Je

2024-07-24

Top-tier reasoning for high-complexity tasks, for your most sophisticated needs.

https://mistral.ai/news/mistral-large-2407/

123000000000.00

2.13e+25

Details are sparse, but we can hazard a guess based on evidence about the training cluster they may have used, the scale up in compute they likely would have used relative to Mistral Large 1, and from the model's MMLU score. Extended reasoning given here: https://docs.google.com/document/d/1I2ZWBLFMpRZYcdMMUfKAGZFJrOJpduNDS9ZeVFIHnd8/edit?usp=sharing

Unspecified unreleased

Likely

Today, we are announcing Mistral Large 2, the new generation of our flagship model. Compared to its predecessor, Mistral Large 2 is significantly more capable in code generation, mathematics, and reasoning. It also provides a much stronger multilingual support, and advanced function calling capabilities.

Open weights (non-commercial)

France

Unreleased

"We are releasing Mistral Large 2 under the Mistral Research License, that allows usage and modification for research and non-commercial usages. For commercial usage of Mistral Large 2 requiring self-deployment, a Mistral Commercial License must be acquired by contacting us."

Industry

GPT-4

Multimodal

Language

Vision

Image generation

Language modeling

OpenAI

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie C

2023-03-15

GPT-4 Technical Report

https://arxiv.org/abs/2303.08774

8281.00

2.1e+25

90% CI: 8.2E+24 to 4.4E+25 NOTE: this is a rough estimate based on public information, much less information than most other systems in the database. Calculation and confidence intervals here: https://colab.research.google.com/drive/1O99z9b1I5O66bT78r9ScslE_nOj5irN9?usp=sharing

Unspecified unreleased

4900000000000

Speculative. Reported secondhand by online sources such as Semianalysis, but not verified by OpenAI. If total number of tokens seen was 13T, text was repeated for 2 epochs, and text was the majority of tokens, then dataset size roughly is 13T*0.75/2 = 4.9T words. Note this examines only the text dataset, since GPT-4 was first and foremost a language model. However, the vision component had its own vision dataset, which we believe accounted for a much smaller part of the compute budget.

NVIDIA A100 SXM4 40 GB

Speculative

We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results

API access

United States of America

25000

Unreleased

Industry

Nemotron-4 340B

Language

Language modeling/generation

Chat

Question answering

NVIDIA

Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das, Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, Aleksander Ficek, Denys Fridman, Shaona Ghosh, Boris Ginsburg, Igor Gitman, Tomasz Grzegorzek, Robert Hero, Jining Huang, Vibhu Jawa, Joseph Jennings, Aastha Jhunjhunwala, John Kamalu, Sadaf Khan, Oleksii Kuchaiev, Patrick LeGresley, Hui Li, Jiwe

2024-06-14

NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models

https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/

340000000000.00

340B

1.8000000000000003e+25

9 trillion tokens for training 6 * 340B * 9T = 1.8E25 alternatively, can do a hardware estimate with a few extra steps: According to the technical report, Nemotron-4 340B was trained using up to 6144 H100 GPUs. Helpfully, they also report the model FLOP utilization (MFU), which was 41-42% (Table 2). This is the ratio of the actual output of their GPUs, in FLOP used for training, relative to their theoretical max of 989 teraFLOP/s per GPU. Unfortunately, the report omits the last ingredient, w

Unspecified unreleased

The technical report for the 340B model cites the report for the 15B version (https://arxiv.org/pdf/2402.16819 ) from that paper: "We train Nemotron-4 15B on a pre-training dataset consisting of 8 trillion tokens. At a high-level, the data blend is split into three different types of data: English natural language data (70%), multilingual natural language data (15%), and source-code data (15%). The English corpus consists of curated documents from a variety of sources and domains including web

6750000000000

9T training tokens. They first train on an 8T token dataset and then an additional 1T tokens, it's slightly unclear if that's more data or a partial second epoch 6.75T words using 1 token = 0.75 words

NVIDIA H100 SXM5 80GB

Confident

We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4- 340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows distribution, modification, and use of the models and its outputs. These models perform competitively to open access models on a wide range of evaluation benchmarks, and were sized to fit on a single DGX H100 with 8 GPUs when deployed in FP8 precision. We

Open weights (unrestricted)

United States of America

6144

Unreleased

Permissive commercial license: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf

Industry

Claude 3 Opus

Multimodal

Language

Vision

Chat

Image captioning

Code generation

Language modeling/generation

Anthropic

2024-03-04

The Claude 3 Model Family: Opus, Sonnet, Haiku

https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

1.6400009999999998e+25

Training compute estimated from benchmark scores.

Unspecified unreleased

Claude 3 models are trained on a proprietary mix of publicly available information on the Internet as of August 2023, as well as non-public data from third parties, data provided by data labeling services and paid contractors, and data we generate internally. We employ several data cleaning and filtering methods, including deduplication and classification. The Claude 3 suite of models have not been trained on any user prompt or output data submitted to us by users or customers, including free us

Speculative

We introduce Claude 3, a new family of large multimodal models – Claude 3 Opus, our most capable offering, Claude 3 Sonnet, which provides a combination of skills and speed, and Claude 3 Haiku, our fastest and least expensive model. All new models have vision capabilities that enable them to process and analyze image data. The Claude 3 family demonstrates strong performance across benchmark evaluations and sets a new standard on measures of reasoning, math, and coding. Claude 3 Opus achieves sta

API access

United States of America

Unreleased

Industry

Gemini 1.5 Pro

Language

Multimodal

Language modeling

Visual question answering

Google DeepMind

Gemini Team

2024-02-15

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf

MoE architecture

1.5800010000000001e+25

Training compute imputed from benchmark scores.

Unspecified unreleased

Google TPU v4

Speculative

API access

United States of America

United Kingdom of Great Britain and Northern Ireland

Multinational

Unreleased

API access: https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models

Industry

GLM-4 (0116)

Language

Language modeling/generation

Question answering

Code generation

Quantitative reasoning

Translation

Zhipu AI

Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yang, Weng Lam Tam, Wenyi Zhao, Xiao Liu, Xiao Xia, Xiaohan Zhang, Xiaotao Gu, Xin Lv, Xinghan Liu, Xinyi Liu, Xinyue Yang, Xixuan Song, Xunkai Zh

2024-01-17

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

https://arxiv.org/abs/2406.12793 https://zhipuai.cn/en/devday

ChatGLM was 130B parameters, and the paper implies GLM-4 was scaled larger than previous models.

1.2e+25

- 0116 has slightly worse performance than 0520 - “the GLM-4 models are pre-trained on ten trillions of tokens” - I did not find any information about parameters or compute. Over here they speculatively estimate GLM-4 to be 200B parameters (which seems plausible to me), though no source provided. - “GLM-4 gets close to the state-of-the-art models (GPT-4-Turbo, Gemini 1.5 Pro, and Claude 3 Opus)” none of these models has parameters disclosed or compute estimation. 6*10000000000000*200000000000

"To date, the GLM-4 models are pre-trained on ten trillions of tokens mostly in Chinese and English, along with a small set of corpus from 24 languages, and aligned primarily for Chinese and English usage"

10000000000000

Likely

We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B. They represent our most capable models that are trained with all the insights and lessons gained from the preceding three generations of ChatGLM. To date, the GLM-4 models are pre-trained on ten trillions of tokens mostly in Chinese and English, along with a small set of corpus from 24

API access

China

GLM-4 (0116) has been made available through the GLM-4 API at https://bigmodel.cn

Industry

Mistral Large

Language

Chat

Mistral AI

2024-02-26

Mistral Large, our new flagship model

https://mistral.ai/news/mistral-large/

1.12e+25

https://www.wsj.com/tech/ai/the-9-month-old-ai-startup-challenging-silicon-valleys-giants-ee2e4c48 Mistral spent <20 million euro (meaning approximately 20 million?) to train Mistral Large: https://x.com/EMostaque/status/1762152740938031484?s=20 "assuming this is on H100s with @Scaleway who are €1.9/hour => 10m H100 hours (c 30m A100 hrs), 3 months at 4k H100s :timer_clock:" -Emad Mostaque Assuming bf16 or fp16, H100 SXM performance is 989 TFLOPS At 1.9 euro per H100-hour and 30% utilization,

NVIDIA H100 SXM5 80GB

Likely

API access

France

Unreleased

Industry

Aramco Metabrain AI

Language

Language modeling/generation

Saudi Aramco

2024-03-04

Saudi Aramco unveils industry’s first generative AI model

https://www.offshore-technology.com/news/saudi-aramco-unveils-industry-first-generative-ai-model/

250000000000.00

"It has 250 billion parameters that are adjustable during training to generate outputs or make predictions."

1.05e+25

6*250B*7T=1.05e+25

"The AI was trained using seven trillion data points, collecting more than 90 years of company history."

7000000000000

Likely

Unreleased

Saudi Arabia

Industry

Government

Inflection-2

Language

Language modeling

Language modeling/generation

Chat

Question answering

Inflection AI

2023-11-22

Inflection-2: The Next Step Up

https://inflection.ai/inflection-2

1.001e+25

"Inflection-2 was trained on 5,000 NVIDIA H100 GPUs in fp8 mixed precision for ~10²⁵ FLOPs" (the second 1 is there because of airtable being wonky, it's not a real sig fig)

Unspecified unreleased

NVIDIA H100 SXM5 80GB

Confident

Today we are proud to announce that we have completed training of Inflection-2, the best model in the world for its compute class and the second most capable LLM in the world today. Our mission at Inflection is to create a personal AI for everyone. Just a few months ago, we announced Inflection-1 — a best-in-class language model that currently powers Pi. Our new model, Inflection-2, is substantially more capable than Inflection-1, demonstrating much improved factual knowledge, better stylistic c

Hosted access (no API)

United States of America

5000

Unreleased

via Pi, no API

Industry

Inflection-2.5

Language

Chat

Inflection AI

2024-03-07

Inflection-2.5: meet the world's best personal AI

https://inflection.ai/inflection-2-5

1.0001e+25

"Inflection-1 used approximately 4% the training FLOPs of GPT-4 and, on average, performed at approximately 72% GPT-4 level on a diverse range of IQ-oriented tasks. Inflection-2.5, now powering Pi, achieves more than 94% the average performance of GPT-4 despite using only 40% the training FLOPs." This is a weird one - we estimated GPT-4 at 2.1e25 FLOP (which could be off somewhat, or Inflection could believe a different number). 40% of that is ~8e24. But Inflection 2, the previous model, was tr

NVIDIA H100 SXM5 80GB

Speculative

At Inflection, our mission is to create a personal AI for everyone. Last May, we released Pi—a personal AI, designed to be empathetic, helpful, and safe. In November we announced a new major foundation model, Inflection-2, the second best LLM in the world at the time. Now we are adding IQ to Pi’s exceptional EQ. We are launching Inflection-2.5, our upgraded in-house model that is competitive with all the world's leading LLMs like GPT-4 and Gemini. It couples raw capability with our signature p

Hosted access (no API)

United States of America

Unreleased

Industry

Grok-1.5

Language

Language modeling

Chat

xAI

2024-03-28

Introducing Grok-1.5, our latest model capable of long context understanding and advanced reasoning. Grok-1.5 will be available to our early testers and existing Grok users on the 𝕏 platform in the coming days.

https://x.ai/blog/grok-1.5

9.26e+24

Lower bound is taken from Grok-1 estimation Upper bound is taken from Grok-2 estimation geometric mean: sqrt(2.90000000001*29.6)*10^24 = 9.26e+24

Unspecified unreleased

Speculative

Hosted access (no API)

United States of America

Unreleased

Musk noted that Grok-1.5 will power xAI’s ChatGPT-challenging chatbot on the X platform, while Grok-2, the successor of the new model, is still in the training phase

Industry

Reka Core

Multimodal

Language

Vision

Chat

Language modeling/generation

Image captioning

Code generation

Code autocompletion

Reka AI

Aitor Ormazabal Che Zheng Cyprien de Masson d’Autume Dani Yogatama Deyu Fu Donovan Ong Eric Chen Eugenie Lamprecht Hai Pham Isaac Ong Kaloyan Aleksiev Lei Li Matthew Henderson Max Bain Mikel Artetxe Nishant Relan Piotr Padlewski Qi Liu Ren Chen Samuel Phua Yazheng Yang Yi Tay Yuqi Wang Zhongkai Zhu Zhihui Xie

2024-04-15

Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models

https://publications.reka.ai/reka-core-tech-report.pdf

67000000000.00

8.400010000000001e+24

No direct information about Reka Core model ("Reka Core has not finished training and is still improving.") The smaller dense model Reka Flash has 21B parameters and was trained on 5 trillion language tokens. There is information about compute: "Our setup comprises of clusters from a mixture of vendors with our peak compute being approximately 2.5K H100s and 2.5K A100s." If we assume 2 months of training with 2.5k H100s and 2.5k A100s at utilization 0.5 we get 8.4e24 FLOP (2500*9.9e14+2500*3.

Wikipedia

Unspecified unreleased

The training data comprises a mixture of publicly available and proprietary/licensed datasets with a dataset knowledge cutoff of November 2023. The dataset ingested by our model comprises of text, images, videos, and audio clips. Reka Flash and Reka Edge were trained on approximately 5 trillion and 4.5 trillion extensively deduplicated and filtered language tokens, respectively. While the classification of corpora is not strictly defined to one class or category, approximately 25% of our pretrai

NVIDIA A100

NVIDIA H100 SXM5 80GB

Speculative

API access

United States of America

Unreleased

Industry

SEA-LION V3 Llama3.1 70B

Language

Language modeling/generation

AI Singapore

2024-12-19

SEA-LION V3

https://huggingface.co/aisingapore/llama3.1-8b-cpt-sea-lionv3-base

70000000000.00

8.0103891e+24

Llama3.1 70B base model: 7.929e+24 Additional pretraining compute: Stage 1: 200*60*60*989500000000000*64*0.3=1.36788×10^22 Stage 2: 495*60*60*989500000000000*128*0.3=6.77103×10^22 Total: 7.929×10^24 + 1.36788×10^22 + 6.77103×10^22 = 8.0103891 × 10^24

The Stack v2

Dolma

Trained on a mix of datasets including StackV2 and Dolma (see https://huggingface.co/aisingapore/llama3.1-70b-cpt-sea-lionv3-base#data)

200000000000

"pre-trained on 200B tokens"

NVIDIA H200 SXM

NVIDIA H100 SXM5 80GB

Unverified

Our SEA-LION v3 Llama3.1 8B and 70B base models have been continued pre-trained on top of the Llama3.1 8B and 70B models respectively. Both have a context length of 128K, making them the SEA-LION models with the longest context length to date.

Open weights (unrestricted)

Llama 3.1-70B

192

Llama 3.1-70B

Language

Language modeling/generation

Meta AI

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu,

2024-07-23

The Llama 3 Herd of Models

https://ai.meta.com/research/publications/the-llama-3-herd-of-models/

70000000000.00

70B

7.929e+24

Huggingface page says 3.1-70B used 7.0M H100 hours and trained over 15T tokens. https://huggingface.co/meta-llama/Llama-3.1-70B The paper also says that 3.1-405B got MFU of between 38-43%; presumably 70B was around the same or a bit higher. I'll assume utilization of 40% 6ND: 6 * 15T * 70B = 6.3e24 FLOPs Hardware: 7M * 9.9e14 * 3600 * 0.4 = 9.98e24 FLOPs Geometric mean: sqrt(6.3e24 * 9.98e24) = 7.929e24 Note that Llama 3-70B also said it used 15T tokens, but only 6.4M H100 hours. This sugges

Llama 3 dataset

Confident

Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models s

Open weights (restricted use)

United States of America

Open (restricted use)

Llama 3.1 license: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE must seek separate license if over 700m monthly users, acceptable use restrictions code here: https://github.com/meta-llama/llama3/tree/main

Industry

Llama-3.1-Nemotron-70B-Instruct

Language

Language modeling

NVIDIA

Meta AI

Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy Zhang, Makesh Narsimhan Sreedhar, Oleksii Kuchaiev

2024-06-12

https://www.semanticscholar.org/paper/HelpSteer2%3A-Open-source-dataset-for-training-reward-Wang-Dong/f590d8926dd12345a3bd22253461850f5ca4b3ed

200000 monthly HF download as of Nov 2024

7.929e+24

Taken from Llama 3.1 70B as the finetuning compute is multiple orders of magnitude lower

Llama 3 dataset

Overall, we have 7,118 preference pairs with 6,766 pairs in the training set and 352 pairs in the validation set.

Unverified

High-quality preference datasets are essential for training reward models that can effectively guide large language models (LLMs) in generating high-quality responses aligned with human preferences. As LLMs become stronger and better aligned, permissively licensed preference datasets, such as Open Assistant, HH-RLHF, and HelpSteer need to be updated to remain effective for reward modeling. Methods that distil preference data from proprietary LLMs such as GPT-4 have restrictions on commercial usa

Open weights (restricted use)

United States of America

Llama 3.1-70B

102591360000B

Llama 3.1 70B: 7.929e+24 FT (see Appendix F): 32+64=96 hours on a single H100 Compute: 96*60*60*989500000000000*0.3=102591360000000000000=1e20

Industry

Llama 3-70B

Language

Chat

Language modeling/generation

Code generation

Meta AI

Aaditya Singh; Aaron Grattafiori; Abhimanyu Dubey; Abhinav Jauhri; Abhinav Pandey; Abhishek Kadian; Adam Kelsey; Adi Gangidi; Ahmad Al-Dahle; Amit Sangani; Ahuva Goldstand; Aiesha Letman; Ajay Menon; Akhil Mathur; Alan Schelten; Alex Vaughan; Amy Yang; Andrei Lupu; Andres Alvarado; Andrew Gallagher; Andrew Gu; Andrew Ho; Andrew Poulton; Andrew Ryan; Angela Fan; Ankit Ramchandani; Anthony Hartshorn; Archi Mitra; Archie Sravankumar; Artem Korenev; Arun Rao; Ashley Gabriel; Ashwin Bharambe; Assaf E

2024-04-18

Introducing Meta Llama 3: The most capable openly available LLM to date

https://ai.meta.com/blog/meta-llama-3/

70000000000.00

7.860999999999999e+24

Arithmetic calculation: 6 * 15T tokens * 70B parameters = 6.3e24 GPU calculation: https://huggingface.co/meta-llama/Meta-Llama-3-70B indicates training took 6.4M GPU-hours We also know their larger scale training runs for 405B were getting between 0.38-0.41 MFU. Presumably the 70B model gets at least 0.43 utilization (405B has to be split across two nodes, while 70B should fit on one). 990 TFLOPS per GPU * 6.4 million GPU hours * 3600s * 0.43 = 9.808e24 Geometric mean: sqrt(6.3e24 * 9.808e24)

Llama 3 dataset

"Llama 3 is pretrained on over 15T tokens that were all collected from publicly available sources. Our training dataset is seven times larger than that used for Llama 2, and it includes four times more code. To prepare for upcoming multilingual use cases, over 5% of the Llama 3 pretraining dataset consists of high-quality non-English data that covers over 30 languages."

15000000000000

NVIDIA H100 SXM5 80GB

Confident

Open weights (restricted use)

United States of America

Unreleased

https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md License A custom commercial license is available at: https://llama.meta.com/llama3/license

Industry

Qwen2.5 Instruct (72B)

Language

Code generation

Code autocompletion

Quantitative reasoning

Question answering

Language modeling/generation

Alibaba

Qwen Team

2024-09-19

Qwen2.5: A Party of Foundation Models!

https://qwenlm.github.io/blog/qwen2.5/

72700000000.00

Number of Parameters: 72.7B Number of Paramaters (Non-Embedding): 70.0B

7.851600000000001e+24

6ND = 6*72700000000 parameters *18000000000000 tokens = 7.8516e+24

Unspecified unreleased

18000000000000

"In terms of Qwen2.5, the language models, all models are pretrained on our latest large-scale dataset, encompassing up to 18 trillion tokens."

Confident

Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. Significant improvements in instruction following, generating long texts (over 8K tokens), unde

Open weights (restricted use)

China

Qwen2.5-72B

requires permission to use in applications with 100K+ users https://huggingface.co/Qwen/Qwen2.5-72B-Instruct

Industry

Qwen2.5-72B

Language

Language modeling/generation

Question answering

Quantitative reasoning

Alibaba

Qwen Team

2024-09-19

Qwen2.5: A Party of Foundation Models!

https://qwenlm.github.io/blog/qwen2.5/

72700000000.00

72.7B

7.8e+24

Training dataset size was 18 trillion 6ND = 6 * 72.7 billion parameters * 18 trillion tokens = 7.8e24

Unspecified unreleased

"In terms of Qwen2.5, the language models, all models are pretrained on our latest large-scale dataset, encompassing up to 18 trillion tokens."

18000000000000

"In terms of Qwen2.5, the language models, all models are pretrained on our latest large-scale dataset, encompassing up to 18 trillion tokens"

Confident

In the past three months since Qwen2’s release, numerous developers have built new models on the Qwen2 language models, providing us with valuable feedback. During this period, we have focused on creating smarter and more knowledgeable language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5. We are announcing what might be the largest opensource release in history! Let’s get the party started!

Open weights (unrestricted)

China

Unreleased

license: allows commercial. weights only https://huggingface.co/Qwen/Qwen2.5-72B/blob/main/LICENSE

Industry

GPT-4o mini

Language

Multimodal

Vision

Chat

Language modeling/generation

Code generation

Visual question answering

OpenAI

Pre-training leads Aidan Clark, Alex Paino, Jacob Menick Post-training leads Liam Fedus, Luke Metz Architecture leads Clemens Winter, Lia Guy Optimization leads Sam Schoenholz, Daniel Levy Long-context lead Nitish Keskar Pre-training Data leads Alex Carney, Alex Paino, Ian Sohl, Qiming Yuan Tokenizer lead Reimar Leike Human data leads Arka Dhar, Brydon Eastman, Mia Glaese Eval lead Ben Sokolowsky Data flywheel lead Andrew Kondrich Inference lead Felipe Petroski Such Inference Producti

2024-07-18

GPT-4o mini: advancing cost-efficient intelligence

https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/

7.36001e+24

Training compute estimated from benchmark scores. 90% CI [3.23e+24, 2.05e+25]

Unspecified unreleased

Speculative

OpenAI is committed to making intelligence as broadly accessible as possible. Today, we're announcing GPT-4o mini, our most cost-efficient small model. We expect GPT-4o mini will significantly expand the range of applications built with AI by making intelligence much more affordable. GPT-4o mini scores 82% on MMLU and currently outperforms GPT-41 on chat preferences in LMSYS leaderboard(opens in a new window). It is priced at 15 cents per million input tokens and 60 cents per million output toke

API access

United States of America

Unreleased

Industry

PaLM 2

Language

Language modeling

Language modeling/generation

Google

Andrew M. Dai, David R. So, Dmitry Lepikhin, Jonathan H. Clark, Maxim Krikun, Melvin Johnson, Nan Du, Rohan Anil, Siamak Shakeri, Xavier Garcia, Yanping Huang, Yi Tay, Yong Cheng, Yonghui Wu, Yuanzhong Xu, Yujing Zhang, Zachary Nado, Bryan Richter, Alex Polozov, Andrew Nystrom, Fangxiaoyu Feng, Hanzhao Lin, Jacob Austin, Jacob Devlin, Kefan Xiao, Orhan Firat, Parker Riley, Steven Zheng, Yuhuai Wu, Zhongtao Liu, Jiahui Yu, Guy Gur-Ari, Weikang Zhou, Sneha Kudugunta, Sunipa Dev, Frederick Liu, Gus

2023-05-10

PaLM 2 Technical Report

https://arxiv.org/abs/2305.10403

950.00

340000000000.00

Model Architecture: "PaLM-2 is a new state-of-the-art language model. We have small, medium, and large variants that use stacked layers based on the Transformer architecture, with varying parameters depending on model size. Further details of model size and architecture are withheld from external publication." However, the parameter count was leaked to CNBC: https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html

7.34e+24

Compute Requirements "Not reported." Paper suggests heuristic of C=6ND. Based on 340B parameters and 3.6T tokens, training compute would be around 7.3*10^24 FLOP.

"The PaLM 2 pre-training corpus is composed of a diverse set of sources: web documents, books, code, mathematics, and conversational data. The pre-training corpus is significantly larger than the corpus used to train PaLM (Chowdhery et al., 2022). PaLM 2 is trained on a dataset that includes a higher percentage of non-English data than previous large language models, which is beneficial for multilingual tasks" (page 9)

2700000000000

"The pre-training corpus is significantly larger than the corpus used to train PaLM" so greater than 6e+11. According to the leaked documents viewed by CNBC, the corpus was 3.6 trillion tokens or around 2.7*10^12 words. https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html

Google TPU v4

Likely

We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM (Chowdhery et al., 2022). PaLM 2 is a Transformer-based model trained using a mixture of objectives similar to UL2 (Tay et al., 2023). Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model

API access

United States of America

Unreleased

Industry

Telechat2-115B

Language

Language modeling

China Telecom

Zihan Wang and Xinzhang Liu and Shixuan Liu and Yitong Yao and Yuyao Huang and Zhongjiang He and Xuelong Li and Yongxiang Li and Zhonghao Che and Zhaoxi Zhang and Yan Wang and Xin Wang and Luwen Pu and Huihan Xu and Ruiyu Fang and Yu Zhao and Jie Zhang and Xiaomeng Huang and Zhilong Lu and Jiaxin Peng and Wenjun Zheng and Shiquan Wang and Bingkai Yang and Xuewei he and Zhuoru Jiang and Qiyi Xie and Yanhan Zhang and Zhongqiu Li and Lingling Shi and Weiwei Fu and Yin Zhang and Zilu Huang and Sishi

2024-09-20

TeleChat Technical Report

https://huggingface.co/Tele-AI/TeleChat2-115B

115000000000.00

6.9e+24

6ND: 6 * 115B * 10T = 6.9e24

10000000000000

The open source TeleChat2-115B model is trained using 10 trillion tokens of high-quality Chinese and English corpus

Unverified

Open weights (restricted use)

China

Industry

Llama 3.3

Language

Language modeling/generation

Question answering

Translation

Code generation

Meta AI

2024-12-06

Meta Llama 3.3 multilingual large language model (LLM)

https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3/

70000000000.00

70B

6.8649768e+24

6ND = 6*70*10^9*15*10^12= 6.3e+24 7000000*3600*989500000000000*0.3= 7.48062e+24 sqrt(7.48062e+24*6.3e+24) = 6.8649768e+24

Unspecified unreleased

"A new mix of publicly available online data."

15000000000000

"Overview: Llama 3.3 was pretrained on ~15 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over 25M synthetically generated examples. Data Freshness: The pretraining data has a cutoff of December 2023."

NVIDIA H100 SXM5 80GB

Confident

The Meta Llama 3.3 multilingual large language model (LLM) is a pretrained and instruction tuned generative model in 70B (text in/text out). The Llama 3.3 instruction tuned text only model is optimized for multilingual dialogue use cases and outperforms many of the available open source and closed chat models on common industry benchmarks. Model developer: Meta Model Architecture: Llama 3.3 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions

Open weights (restricted use)

United States of America

Unreleased

License A custom commercial license, the Llama 3.3 Community License Agreement, is available at: https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/LICENSE "Llama 3.3 is intended for commercial and research use in multiple languages."

Industry

Cosmos-1.0- Diffusion-14B Video2World

Robotics

Vision

Video

Robotic manipulation

Self-driving car

Video generation

NVIDIA

NVIDIA: Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, Jingyi Jin, Seung Wook Kim, Gergely Klár, Grace Lam, Shiyi Lan, Laura Leal-Taixe, Anqi Li, Zhaoshuo Li, Chen-Hsuan Lin, Tsung-Yi Lin, Huan Ling, Min

2025-01-07

Cosmos World Foundation Model Platform for Physical AI

https://arxiv.org/abs/2501.03575

14000000000.00

14B

6.1554816e+24

989500000000000 * 0.4 * 10000 * 3600 * 3 *30 *24 = 3.0777408e+25 (total training compute) assuming this model is 1/5 of it: 3.0777408e+25 / 5 = 6.1554816e+24 (Likely confidence)

Unspecified unreleased

9000000000000000

"Suite of first-generation video models trained on 9,000 trillion tokens, including 20 million hours of robotics and driving data - generating high-quality videos from multimodal inputs like images, text, or video."

NVIDIA H100 SXM5 80GB

Unverified

Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pr

Open weights (restricted use)

United States of America

10000

NVIDIA Open Model License Agreement Under the NVIDIA Open Model License, NVIDIA confirms: Models are commercially usable. You are free to create and distribute Derivative Models. NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models. Important Note: If you bypass, disable, reduce the efficacy of, or circumvent any technical limitation, safety guardrail or associated safety guardrail hyperparameter, encryption, security, digital rights management, or aut

Industry

Amazon Nova Pro

Multimodal

Language

Video

Language modeling/generation

Retrieval-augmented generation

Video generation

Amazon

2024-12-03

Introducing Amazon Nova foundation models: Frontier intelligence and industry leading price performance

https://aws.amazon.com/es/blogs/aws/introducing-amazon-nova-frontier-intelligence-and-industry-leading-price-performance/

6.000010000000001e+24

"probably just below 1e25 stemming from the Llama 70B serving speed. If Llama 70B is trained proportionally to 405B, then it's at ~ 6.6e24. Nova Pro is served at 100tk/s, while Llama 70B is served at 70tk/s on average, and 100tk/s by together.ai at FP8. So Nova Pro would be >1e25 if they roughly 2x the amount of training compared to Llama 70B which [seems unlikely]"

Speculative

API access

United States of America

Industry

DeepSeek-R1

Language

Language modeling/generation

Code generation

Quantitative reasoning

Question answering

DeepSeek

2025-01-20

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

https://api-docs.deepseek.com/news/news250120

671000000000.00

671B total 37B activated https://github.com/deepseek-ai/DeepSeek-R1/tree/main

5.17e+24

4.56e+24 FLOP (estimated base model Deepseek V3 training compute) + 6.1e23 FLOP = 5.17e+24 FLOP

Unspecified unreleased

RL + SFT When reasoning-oriented RL converges, we utilize the resulting checkpoint to collect SFT (Supervised Fine-Tuning) data for the subsequent round. Unlike the initial cold-start data, which primarily focuses on reasoning, this stage incorporates data from other domains to enhance the model’s capabilities in writing, role-playing, and other general-purpose tasks.

Confident

We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhan

Open weights (unrestricted)

China

DeepSeek-V3

610000000000000B

6.1e23 FLOP from these estimations: https://epoch.ai/gradient-updates/what-went-into-training-deepseek-r1

Unreleased

MIT licensed

Industry

Amazon Titan

Language

Image generation

Semantic search

Image generation

Language modeling/generation

Code generation

Chat

Text-to-image

Translation

Amazon

2023-09-28

https://aws.amazon.com/bedrock/titan/

200000000000.00

200B dense model https://importai.substack.com/p/import-ai-365-wmd-benchmark-amazon

4.8e+24

trained using NVIDIA NeMo: https://blogs.nvidia.com/blog/nemo-amazon-titan/ 13,760 NVIDIA A100 chips (using 1,720 P4d nodes). It took 48 days to train. from https://importai.substack.com/p/import-ai-365-wmd-benchmark-amazon counting operations: 6*200000000000*4000000000000=4.8e+24 gpu usage: 312000000000000(FLOP/s)*0.3*13760*1152*3600=5.3413281792e+24

4000000000000

4T tokens of data, based on comments from amazon engineer James Hamilton at a 2024 talk: https://perspectives.mvdirona.com/2024/01/cidr-2024/ Also cited here: https://lifearchitect.ai/titan/

NVIDIA A100

Likely

API access

United States of America

13760

Unreleased

Industry

DeepSeek-V3

Language

Language modeling/generation

Code generation

Quantitative reasoning

Question answering

DeepSeek

2024-12-24

DeepSeek-V3 Technical Report

https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf

671000000000.00

Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token.

4.56e+24

6*37B*14.8T = 3.2856e+24 Alternatively, they say: "DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training." and we know they trained in FP8. H800s get 1.513e15 FLOP/s in FP8: 2.788M * 3600 * 1.513e15 * 0.3 = 4.56e24 Utilization may be somewhat lower for FP8. Upper bound estimate: 50% utilization would mean 7.59e24

14800000000000

"We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities"

NVIDIA H800 SXM5

Confident

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-tr

Open weights (restricted use)

China

2048

MIT and deepseek license https://github.com/deepseek-ai/DeepSeek-V3?tab=readme-ov-file

Industry

AFM-server

Language

Language modeling/generation

Apple

Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chong Wang, Chung-Cheng Chiu, David Qiu, Deepak Gopinath, Dian Ang Yap, Dong Yin, Feng Nan, Floris Weers, Guoli Yin, Haoshuo Huang, Jianyu Wang, Jiarui Lu, John Peebles, Ke Ye, Mark Lee, Nan Du, Qibin Chen, Quentin Keunebroek, Ruoming Pang, Sam Wiseman, Syd Evans, Tao Lei, Tom Gunter, Vivek Rathod, Xiang Kong, Xianzhi Du, Yanghao Li, Yongqiang Wang, Yuan Gao, Zaid Ahmed, Zhaoyang Xu, Zhiyun Lu, Zirui Wang, Al Rashid, Albin Madappally Jose, Ale

2024-07-29

Apple Intelligence Foundation Language Models

https://machinelearning.apple.com/research/apple-intelligence-foundation-language-models

4.3e+24

"The AFM base models are dense decoder-only models that build on the Transformer architecture" "We train AFM-server from scratch for 6.3T tokens on 8192 TPUv4 chips, using a sequence length of 4096 and a batch-size of 4096 sequences." "For both models we perform continued pre-training at a sequence length of 8192, with another 1T tokens from a mixture that upweights math and code, and down-weights the bulk web-crawl." "The sustained model-flop-utilization (MFU) for this training run was appro

6.3T tokens of web text, code, and math, plus another 1T in the second stage and 100B in the third. See section 3.1 for details.

7400000000000

Not explicitly mentioned, but I assume the 7.4T tokens do not involve multiple epochs.

Google TPU v4

Likely

Hosted access (no API)

United States of America

8192

Unreleased

Industry

MegaScale (Production)

Language

Language modeling/generation

ByteDance

Peking University

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu

2024-02-23

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

https://arxiv.org/abs/2402.15627

40.00

530000000000.00

Production run is stated to have "hundreds of billions of parameters". Since the authors also do a number of experiments with a 530B model, I speculate they've used 530B for the production model.

3.9e+24

Speculative. The model is stated to have trained for "several weeks". Assuming 530B parameters and "several" = 3, compute can be estimated from the 175B model's stated PFLOP/sec: 2166.3 aggregate PFlops/sec * 3 weeks * 7 days/week * 24 hours/day * 3600 seconds/hour = 3.9e+24. As an upper bound, say 8e+24.

Speculative. Authors note production system was trained on "multi-trillions of tokens". This could refer to training for multiple epochs on the same 300B tokens used to train the 175B and 530B models outlined in more detail in the paper. Alternatively, it could refer to a larger dataset of perhaps 3-9 trillion tokens.

NVIDIA A100

Speculative

We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pip

Unreleased

China

12288

Unreleased

Code for MegaScale (also called veScale) training system are released under Apache Licence: https://github.com/volcengine/vescale The model itself is unreleased.

Industry

Academia

SenseChat

Language

Chat

SenseTime

2023-04-10

SenseTime Launches “SenseNova” Foundation Model Sets and AI Computing Systems, Advancing AGI Development

https://www.sensetime.com/en/news-detail/51166397?categoryId=1072

180000000000.00

https://www.thepaper.cn/newsDetail_forward_22639611 Translation: "SenseTime launched the "SenseNova" large model system, which includes natural language generation, image generation services, pre-labeling for perception models, and model development. The "SenseChat" application platform, powered by a 180-billion parameter Chinese language model, supports ultra-long text comprehension and offers capabilities such as question answering, understanding, and generation in Chinese." Link says "hundre

3.8899999999999995e+24

“Over the course of five years, SenseTime has built SenseCore, a leading AI infrastructure with 27,000 GPUs, capable of delivering a total computational power of 5,000 petaflops” Assuming they used this entire cluster with 30 days of training (rough average of frontier model training times since 2016), 30% utilization rate: 5000e15 * 0.3 * 30 * 24 * 60 * 60 = 3.89e24 FLOP. Assuming the model is dense and trained Chinchilla-optimal: 20 tokens/parameter * (180e9 parameters)**2 * 6 = 3.89e24 FLOP

Speculative

SenseTime hosted a Tech Day event, sharing their strategic plan for advancing AGI (Artificial General Intelligence) development through the combination of “foundation models + large-scale computing” systems. Under this strategy, SenseTime unveiled the “SenseNova” foundation model set, introducing a variety of foundation models and capabilities in natural language processing, content generation, automated data annotation, and custom model training. At the event, SenseTime not only showcased their

Hong Kong

China

Industry

Claude 2

Language

Language modeling

Chat

Language modeling/generation

Question answering

Anthropic

2023-07-11

https://www.anthropic.com/index/claude-2, https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf

3.866e+24

https://colab.research.google.com/drive/1MdPuhS4Emaf23VXYZ-ooExDW-5GXZkw0#scrollTo=Ds0Q5X8aMnOY

Unspecified unreleased

From model card: "Claude models are trained on a proprietary mix of publicly available information from the Internet, datasets that we license from third party businesses, and data that our users affirmatively share or that crowd workers provide. Some of the human feedback data used to finetune Claude was made public [12] alongside our RLHF [2] and red-teaming [4] research. Claude 2’s training data cuts off in early 2023, and roughly 10 percent of the data included was non-English."

Speculative

API access

United States of America

Unreleased

Industry

Falcon-180B

Language

Language modeling

Technology Innovation Institute

Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, Baptiste Pannier, Guilherme Penedo

2023-09-06

The Falcon Series of Open Language Models

https://falconllm.tii.ae/falcon-180b.html; https://arxiv.org/abs/2311.16867

261.00

180000000000.00

"Falcon 180B is a super-powerful language model with 180 billion parameters"

3.76e+24

43,500 petaflop-days per Table 1 of the paper 43500 * 1e15 * 24 * 3600 = 3.76e24 C = 6ND = 6 FLOP/token/parameter * 3.5 trillion tokens * 180 billion parameters = 3.78*10^24 FLOP

RefinedWeb

"The Falcon series is made of three causal decoder-only models trained on up to 4,096 A100. We assembled a pretraining dataset of 3,500 billion tokens, predominantly sourced from our work on RefinedWeb (Penedo et al., 2023)–a massive filtered and deduplicated web dataset" Training dataset composition is described in Table 3. Falcon was trained for 1 epoch.

2625000000000

3.5 trillion tokens * (~3 words per 4 tokens) ~= 2.625 trillion words

NVIDIA A100 SXM4 40 GB

Confident

Falcon 180B is a super-powerful language model with 180 billion parameters, trained on 3.5 trillion tokens. It's currently at the top of the Hugging Face Leaderboard for pre-trained Open Large Language Models and is available for both research and commercial use. This model performs exceptionally well in various tasks like reasoning, coding, proficiency, and knowledge tests, even beating competitors like Meta's LLaMA 2. Among closed source models, it ranks just behind OpenAI's GPT 4, and perfo

Open weights (restricted use)

United Arab Emirates

4096

Unreleased

"Falcon 180b can be commercially used but under very restrictive conditions, excluding any "hosting use"." https://huggingface.co/blog/falcon-180b

Government

QwQ-32B

Language

Language modeling/generation

Question answering

Quantitative reasoning

Code generation

Alibaba

Qwen Team

2025-03-06

QwQ-32B: Embracing the Power of Reinforcement Learning

https://qwenlm.github.io/blog/qwq-32b/

32500000000.00

Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias Number of Parameters: 32.5B Number of Paramaters (Non-Embedding): 31.0B Number of Layers: 64 Number of Attention Heads (GQA): 40 for Q and 8 for KV

3.51e+24

Assuming the same dataset size as for Qwen2.5 training (18T tokens): 6ND = 6 * 32500000000 parameters * 18 * 10^12 tokens = 3.51 × 10^24 'Speculative' confidence

Unspecified unreleased

Speculativley might be similar to Qwen2.5 models (18T tokens)

Speculative

QwQ is the reasoning model of the Qwen series. Compared with conventional instruction-tuned models, QwQ, which is capable of thinking and reasoning, can achieve significantly enhanced performance in downstream tasks, especially hard problems. QwQ-32B is the medium-sized reasoning model, which is capable of achieving competitive performance against state-of-the-art reasoning models, e.g., DeepSeek-R1, o1-mini.

Open weights (unrestricted)

China

Qwen2.5-Coder (32B)

Unreleased

https://huggingface.co/Qwen/QwQ-32B Apache 2

Industry

Qwen2.5-32B

Language

Language modeling/generation

Question answering

Quantitative reasoning

Alibaba

Qwen Team

2024-09-17

Qwen2.5: A Party of Foundation Models!

https://qwenlm.github.io/blog/qwen2.5/

32500000000.00

32.5B

3.51e+24

6 * 32.5B parameters * 18 trillion tokens = 3.51 × 10^24

Unspecified unreleased

18000000000000

"In terms of Qwen2.5, the language models, all models are pretrained on our latest large-scale dataset, encompassing up to 18 trillion tokens"

Confident

In the past three months since Qwen2’s release, numerous developers have built new models on the Qwen2 language models, providing us with valuable feedback. During this period, we have focused on creating smarter and more knowledgeable language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5. We are announcing what might be the largest opensource release in history! Let’s get the party started! The Qwen2.5-7B model surpasses its predecessors and counte

Open weights (unrestricted)

China

Unreleased

Apache 2.0 https://huggingface.co/Qwen/Qwen2.5-32B

Industry

Hunyuan-Large

Language

Language modeling/generation

Question answering

Code generation

Translation

Tencent

Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, Zhongzhi Chen, Xuemeng Huang, Fengzong Lian, Saiyong Yang, Jianfeng Yan, Yuyuan Zeng, Xiaoqin Ren, Chao Yu, Lulu Wu, Yue Mao, Jun Xia, Tao Yang, Suncong Zheng, Kan Wu, Dian Jiao, Jinbao Xue, Xipeng Zhang, Decheng Wu, Kai Liu, Dengpeng Wu, Guanghui Xu, Shaohua Chen, Shuang Chen, Xiao Feng, Yigeng Hong, Junqiang Zheng, Chengcheng Xu, Zongwei Li, Xiong Kuang, Jianglu

2024-11-06

Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

https://arxiv.org/abs/2411.02265

389000000000.00

"a total of 389 billion parameters and 52 billion activation parameters"

3.49237e+24

52B activated parameters 6ND = 6*52*10^9*7*10^12 = 2.184 × 10^24 They also suggest more precise formula to calculate MoE compute budget: 9.59ND + 2.3 × 10^8D = 9.59*52*10^9*7*10^12 + 2.3 × 10^8 × 7*10^12 = 3.49237×10^24 which seems closer to projected compute on Figure 3

Unspecified unreleased

7000000000000

"# Trained Tokens 7T" Table 1

Confident

In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model, with a total of 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large's superior performance across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated tasks, where it outp

Open weights (restricted use)

China

Open (restricted use)

the license doesn't regulate usage in the EU also requires additional licensing in case of massive commercial use

Industry

NVLM-X 72B

Vision

Language

Language modeling/generation

Vision-language generation

Question answering

Code generation

Translation

Quantitative reasoning

NVIDIA

Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

2024-10-22

NVLM: Open Frontier-Class Multimodal LLMs

https://arxiv.org/abs/2409.11402

72000000000.00

72B

3.0398181e+24

3.02e24 FLOP (Qwen2-72B compute) + 19818086000000000000000 = 3.0398181e+24

COCO

Conceptual Captions (CC3M)

SBU

VQAv2

VisualGenome

TextVQA

OCR-VQA

Captioning COCO [72], CC3M [127], SBU [114], LAION-115M (sanitized) [123; 66] VQA (natural image) VQAv2 [38], Visual Genome [59] Chart DVQA [51] Document Docmatix [90] OCR / Scene-Text OCR-VQA [98], COCO-Text [144], TextOCR [132], ReCTs [170], RRC-ArT [22], RRC-LSVT [134] RCTW [128], synthdog-en [57], pdfa-eng-wds [117] Math CLEVR-Math [73]

45875200000

Pre-training Global batch size 2,048 Sequence length in the LLM decoder 512 Downsampling of visual tokens 1024->256 # of visual token per tile 256 # of tiles 1 # of training steps 20K 2048 * (512 + 256 * 1) * 20000 = 31,457,280,000 SFT: Global batch size 256 Sequence length in the LLM decoder 1,024 # of visual token per tile 256 # of tiles 6+1 # of training steps 20K 256 * (1,024 + 256*7) * 20000 = 14417920000 31,457,280,000 +14417920000 = 45875200000

NVIDIA H100 SXM5 80GB

Likely

We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cro

Open weights (non-commercial)

United States of America

Qwen2-72B

InternViT-6B

19818086000000B

6*72B*45875200000 = 1.9818086e+22

128

Industry

Qwen2-72B

Language

Chat

Language modeling/generation

Alibaba

Qwen Team

2024-06-07

Hello Qwen2

https://qwenlm.github.io/blog/qwen2/ https://arxiv.org/abs/2407.10671

72710000000.00

72.71B parameters in total, of which 70.21B are non-embedding parameters

3.02e+24

72 billion params, 7 trillion tokens 6 * 72 billion * 7 trillion ~= 3.02e24

Unspecified unreleased

"All models were pre-trained on a high-quality, large-scale dataset comprising over 7 trillion tokens, covering a wide range of domains and languages. Compared to previous editions of Qwen, Qwen2 includes a broader spectrum of linguistic data, enhancing the quantity and quality of code and mathematics content. "

7000000000000

"All models were pre-trained on a high-quality, large-scale dataset comprising over 7 trillion tokens, covering a wide range of domains and languages."

Confident

After months of efforts, we are pleased to announce the evolution from Qwen1.5 to Qwen2. This time, we bring to you: - Pretrained and instruction-tuned models of 5 sizes, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B; - Having been trained on data in 27 additional languages besides English and Chinese; - State-of-the-art performance in a large number of benchmark evaluations; - Significantly improved performance in coding and mathematics; - Extended context length sup

Open weights (unrestricted)

China

Unreleased

Apache 2.0

Industry

NVLM-D 72B

Vision

Language

Language modeling/generation

Vision-language generation

Question answering

Code generation

Translation

Quantitative reasoning

NVIDIA

Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

2024-10-22

NVLM: Open Frontier-Class Multimodal LLMs

https://arxiv.org/abs/2409.11402

72000000000.00

72B

3.02e+24

Uses Qwen2-72B as a backbone, which trained with 3.02e24 FLOP, as well as InternViT-6B. It's unclear how many FLOP were spent training but probably negligible; e.g. PaLI trained ViT-e with ~4B parameters using 1.07e23 FLOP. Fine-tuning FLOPs: 57,016,320,000 image/text tokens over all stages 6 * 72B * 57,016,320,000 = 2.463e22

COCO

Conceptual Captions (CC3M)

SBU

VQAv2

VisualGenome

TextVQA

OCR-VQA

Captioning COCO [72], CC3M [127], SBU [114], LAION-115M (sanitized) [123; 66] VQA (natural image) VQAv2 [38], Visual Genome [59] Chart DVQA [51] Document Docmatix [90] OCR / Scene-Text OCR-VQA [98], COCO-Text [144], TextOCR [132], ReCTs [170], RRC-ArT [22], RRC-LSVT [134] RCTW [128], synthdog-en [57], pdfa-eng-wds [117] Math CLEVR-Math [73]

57016320000

Pre-training Global batch size 2,048 Sequence length in the LLM decoder 512 Downsampling of visual tokens 1024->256 # of visual token per tile 256 # of tiles 1 # of training steps 20K 2048 * (512 + 256 * 1) * 20000 = 31,457,280,000 SFT: Global batch size 128 Sequence length in the LLM decoder 3,200 # of visual token per tile 256 # of tiles 6+1 # of training steps 40K 128 * (3200 + 256*7) * 40000 = 25,559,040,000 31,457,280,000 + 25,559,040,000 = 57,016,320,000

NVIDIA H100 SXM5 80GB

Confident

We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cro

Open weights (non-commercial)

United States of America

Qwen2-72B

InternViT-6B

24630000000000B

Fine-tuning FLOPs: 57,016,320,000 image/text tokens over all stages 6 * 72B * 57,016,320,000 = 2.463e22

128

Open (non-commercial)

https://huggingface.co/nvidia/NVLM-D-72B Creative Commons Attribution: Non-Commercial 4.0 International *training code "coming soon"

Industry

NVLM-H 72B

Vision

Language

Language modeling/generation

Vision-language generation

Question answering

Code generation

Translation

Quantitative reasoning

NVIDIA

Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

2024-10-22

NVLM: Open Frontier-Class Multimodal LLMs

https://arxiv.org/abs/2409.11402

72000000000.00

72B

3.02e+24

Additional compute in this paper is negligible relative to the compute used to train the language model backbone (Qwen2-72B at 3.02e24 FLOP)

COCO

Conceptual Captions (CC3M)

SBU

VQAv2

VisualGenome

TextVQA

OCR-VQA

Captioning COCO [72], CC3M [127], SBU [114], LAION-115M (sanitized) [123; 66] VQA (natural image) VQAv2 [38], Visual Genome [59] Chart DVQA [51] Document Docmatix [90] OCR / Scene-Text OCR-VQA [98], COCO-Text [144], TextOCR [132], ReCTs [170], RRC-ArT [22], RRC-LSVT [134] RCTW [128], synthdog-en [57], pdfa-eng-wds [117] Math CLEVR-Math [73]

125829120000

Pre-training: Global batch size 2,048 Sequence length in the LLM decoder 512 Downsampling of visual tokens 1024->256 # of visual token per tile 256 # of tiles 6+1 # of training steps 20K 2048 * (512+256*7) * 20000 = 94,371,840,000 SFT: Global batch size 256 Sequence length in the LLM decoder 1,280 # of visual token per tile 256 # of tiles 6+1 # of training steps 40K 256*(1280+256*7)*40000 = 31,457,280,000 94,371,840,000 + 31,457,280,000 = 125,829,120,000

NVIDIA H100 SXM5 80GB

Likely

We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cro

Open weights (non-commercial)

United States of America

Qwen2-72B

InternViT-6B

54360000000000B

6ND = 6*125,829,120,000*72000000000.00 = 5.436e22

128

Industry

Grok-1

Language

Language modeling

Chat

xAI

2023-11-04

Announcing Grok

https://x.ai/model-card/, https://x.ai/blog/grok-os

314000000000.00

"314B parameter Mixture-of-Experts model with 25% of the weights active on a given token". So effectively 78B parameters Mixture of 8 experts: https://github.com/xai-org/grok-1

2.90000000001e+24

"On these benchmarks, Grok-1 displayed strong results, surpassing all other models in its compute class, including ChatGPT-3.5 and Inflection-1. It is only surpassed by models that were trained with a significantly larger amount of training data and compute resources like GPT-4" Per table, Grok-1 is surpassed by Palm 2, Claude 2, GPT-4, so it required less compute than these three models. Palm 2 was trained on 7e24 FLOP. GPT-3.5 is ~2.6e24. Inflection-1's compute is not public/known by us but

Unspecified unreleased

"Base model trained on a large amount of text data, not fine-tuned for any particular task." "The training data used for the release version of Grok-1 comes from both the Internet up to Q3 2023 and the data provided by our AI Tutors."

6200000000000

(Speculative confidence, see compute notes)

Likely

Grok is an AI modeled after the Hitchhiker’s Guide to the Galaxy, so intended to answer almost anything and, far harder, even suggest what questions to ask! Grok is designed to answer questions with a bit of wit and has a rebellious streak, so please don’t use it if you hate humor! A unique and fundamental advantage of Grok is that it has real-time knowledge of the world via the 𝕏 platform. It will also answer spicy questions that are rejected by most other AI systems. Grok is still a very e

Open weights (unrestricted)

United States of America

Unreleased

apache 2.0

Industry

Minerva (540B)

Language

Quantitative reasoning

Google

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, Vedant Misra

2022-06-29

Solving Quantitative Reasoning Problems with Language Models

https://arxiv.org/abs/2206.14858

585.00

540350000000.00

"To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model (PaLM)." Our approach is to start with the PaLM pretrained decoder-only transformer language models Chowdhery et al. (2022), and further train (finetune) them on our mathematical dataset using an autoregressive objective. Table 2 contains the main model and training hyperparameters. See Table 2

2.7415e+24

Minerva was fine-tuned from PaLM using the same hardware. Assume the same model FLOPs utilization rate for pre-training and fine-tuning. PaLM pretraining time: 6144 TPU for 1200 hours + 3072 TPU for 336 hours = @8404992 TPU-hours Minerva finetuning time: 1024 TPU for 696 hours = 712704 TPU-hours So fine-tuning added 8.5% more compute. Minerva total compute = PaLM pretraining compute * (712704+8404992)/(8404992) = 2.7415*10^24 FLOP https://www.wolframalpha.com/input?i=%28712704%2B8404992%29%2F%

arXiv

PaLM, finetuned on arxiv

26000000000

"Our models were trained on a dataset of 38.5B tokens" + PaLM upd 38.5B tokens - sie of the dataset, the model saw 26B tokens in 399k steps (see Table 2)

Google TPU v4

Language models have achieved remarkable performance on a wide range of tasks that require natural language understanding. Nevertheless, state-of-the-art models have generally struggled with tasks that require quantitative reasoning, such as solving mathematics, science, and engineering problems at the college level. To help close this gap, we introduce Minerva, a large language model pretrained on general natural language data and further trained on technical content. The model achieves state-o

Unreleased

United States of America

PaLM (540B)

214290000000000B

1024

Unreleased

Industry

DBRX

Language

Chat

Code generation

Databricks

Mosaic Research Team

2024-03-27

Introducing DBRX: A New State-of-the-Art Open LLM

https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm

132000000000.00

132B mixture of experts. 36B parameters active per inference

2.6e+24

Mixture of Experts (MoE) 36 billion active params * 12 trillion tokens * 6 ~= 2.6e24 https://www.wolframalpha.com/input?i=6+FLOP+*+36+billion+*+12+trillion also, it was trained on 3072 NVIDIA H100s, but with an unclear timeframe (end-end process was three months, including evals and red-teaming).

12T tokens, text and code "It was pre-trained on 12T tokens of text and code data... DBRX was pretrained on 12T tokens of carefully curated data and a maximum context length of 32k tokens. We estimate that this data is at least 2x better token-for-token than the data we used to pretrain the MPT family of models" from HF: https://huggingface.co/databricks/dbrx-base The training mix used for DBRX contains both natural-language and code examples. The vast majority of our training data is in the

9000000000000

12T tokens is equivalent to 9T words. Though it includes code data, so not very literally 9T words

NVIDIA H100 SXM5 80GB

Confident

Today, we are excited to introduce DBRX, an open, general-purpose LLM created by Databricks. Across a range of standard benchmarks, DBRX sets a new state-of-the-art for established open LLMs. Moreover, it provides the open community and enterprises building their own LLMs with capabilities that were previously limited to closed model APIs; according to our measurements, it surpasses GPT-3.5, and it is competitive with Gemini 1.0 Pro. It is an especially capable code model, surpassing specialized

Open weights (restricted use)

United States of America

Unreleased

license: https://www.databricks.com/legal/open-model-license conditions based on monthly users

Industry

GPT-3.5

Language

Language modeling

OpenAI

2022-11-28

https://platform.openai.com/docs/models/gpt-3-5

Parameter count may be 175B based on OpenAI's statements that text-davinci-003 is in the GPT-3.5 series of models. It was also stated to be 175B in the Microsoft CODEFUSION paper, but the paper was reportedly retracted because the authors did not know the parameter count.

2.578e+24

https://colab.research.google.com/drive/1QSxa8YCWjEBQU7mrXLhw6TP1VX5oqgdW#scrollTo=Gt6Z6oZ26clI

NVIDIA A100 SXM4 40 GB

Speculative

API access

United States of America

Unreleased

Industry

U-PaLM (540B)

Language

Language generation

Google

Yi Tay, Jason Wei, Hyung Won Chung, Vinh Q. Tran, David R. So, Siamak Shakeri, Xavier Garcia, Huaixiu Steven Zheng, Jinfeng Rao, Aakanksha Chowdhery, Denny Zhou, Donald Metzler, Slav Petrov, Neil Houlsby, Quoc V. Le, Mostafa Dehghani

2022-10-20

Transcending Scaling Laws with 0.1% Extra Compute

https://arxiv.org/abs/2210.11399

61.00

540000000000.00

2.53e+24

"The total number of extra tokens we train on for the 540B model is approximately 1.3 Billion which constitutes 0.16% extra computation... Training an U-PaLM 540B model only consumes 512 TPUv4 chips and finishes in about 5 days which is considered to be lightweight." original PaLM was 2.527e+24. adding 0.16% is ~2.53e24

"To keep things consistent, we train this model with the same data mixture as PaLM and do not rely on additional sources of data (labeled or unlabeled)."

Google TPU v4

Confident

Scaling language models improves performance but comes with significant computational costs. This paper proposes UL2R, a method that substantially improves existing language models and their scaling curves with a relatively tiny amount of extra compute. The key idea is to continue training a state-of-the-art large language model (e.g., PaLM) on a few more steps with UL2's mixture-of-denoiser objective. We show that, with almost negligible extra computational costs and no new sources of data, we

Unreleased

United States of America

PaLM (540B)

4000000000000B

"The total number of extra tokens we train on for the 540B model is approximately 1.3 Billion which constitutes 0.16% extra computation... Training an U-PaLM 540B model only consumes 512 TPUv4 chips and finishes in about 5 days which is considered to be lightweight." PaLM was 2.5e24 0.16% of that is 4e21

512

Unreleased

Industry

PaLM (540B)

Language

Language modeling

Code generation

Translation

Google Research

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev,, Henryk Michalewski, Xav

2022-04-04

PaLM: Scaling Language Modeling with Pathways

https://arxiv.org/abs/2204.02311

5064.00

540350000000.00

"To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model (PaLM)."

2.5272e+24

See Table 20. 6144 TPUv4 for 1200 hours + 3072 TPUv4 for 336 hours. Equivalent to 6144 TPUv4 for 1368 hours. 46.2% model FLOPs utilization "The 540B-parameter PaLM model sustained a remarkable 57.8% of the peak hardware floating point performance over 50 days while training on TPU v4 supercomputers. " https://cloud.google.com/blog/topics/systems/tpu-v4-enables-performance-energy-and-co2e-efficiency-gains

Wikipedia

GLaM dataset

LaMBDA dataset

GitHub

585000000000

"The PaLM pretraining dataset consists of a high-quality corpus of 780 billion tokens that represent a wide range of natural language use cases." 1 token ~ 0.75 words

Google TPU v4

Confident

Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 c

Unreleased

Multinational

United States of America

6144

Unreleased

Industry

Flan-PaLM 540B

Language

Language modeling/generation

Google

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, Jason Wei

2022-10-20

Scaling Instruction-Finetuned Language Models

https://arxiv.org/abs/2210.11416

2506.00

540000000000.00

540B

2.5e+24

0.2% greater than Palm 540B, which used 2.5e24

Flan

Various instruction examples for many tasks: "Our final set of finetuning tasks is sourced from a combination of tasks from FLAN, T0, Natural Instructions, along with some dialog, program synthesis, and chain-of-thought reasoning tasks, as described in Figure 2. We provide specific pointers and citations in Table 24. All data sources are publicly available. We also remove all MMLU tasks from Natural Instructions to preserve its role as a broad benchmark of 57 held-out tasks for evaluation. In t

Google TPU v4

Confident

Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups

Unreleased

United States of America

PaLM (540B)

5600000000000B

5.6e21 per Table 2 "we only use 0.2% of the pre-training compute to instruction-finetune Flan-PaLM 540B (approximately 512 v4 TPU chips for 37 hours)" 512 * 37 * 3600 * 275 teraflops * 0.3 = 5.6e21 (so 30% utilization was correct)

512

Unreleased

Industry

Gemma 3 27B

Language

Vision

Multimodal

Language modeling/generation

Question answering

Translation

Chat

Quantitative reasoning

Visual question answering

Code generation

Google DeepMind

Core contributors: Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Beyer, Xiaohai Zhai, Anton Tsitsulin, Robert Busa-Fekete, Alex Feng, Noveen Sachdeva, Benjamin Coleman, Yi Gao, Basil Mustafa, Iain Barr, E

2025-03-12

Gemma 3 Technical Report

https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf

27000000000.00

Vision Encoder: 417M Embedding Parameters: 1,416M Non-embedding Parameters: 25,600M

2.268e+24

6ND = 6 * 27B parameters * 14T training tokens = 2.268 × 10^24 FLOP

Unspecified unreleased

14000000000000

14T

Google TPU v5p

Confident

We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context – at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local atten

Open weights (restricted use)

United States of America

United Kingdom of Great Britain and Northern Ireland

Multinational

SigLIP 400M

6144

Unreleased

https://huggingface.co/google/gemma-3-27b-it Gemma License

Industry

Evo 2 40B

Biology

Protein or nucleotide language model (pLM/nLM)

Arc Institute

Stanford University

NVIDIA

Liquid

University of California (UC) Berkeley

Goodfire

Columbia University

University of California San Francisco

Garyk Brixi, Matthew G. Durrant, Jerome Ku, Michael Poli, Greg Brockman, Daniel Chang, Gabriel A. Gonzalez, Samuel H. King, David B. Li, Aditi T. Merchant, Mohsen Naghipourfar, Eric Nguyen, Chiara Ricci-Tam, David W. Romero, Gwanggyu Sun, Ali Taghibakshi, Anton Vorontsov, Brandon Yang, Myra Deng, Liv Gorton, Nam Nguyen, Nicholas K. Wang, Etowah Adams, Stephen A. Baccus, Steven Dillmann, Stefano Ermon, Daniel Guo, Rajesh Ilango, Ken Janik, Amy X. Lu, Reshma Mehta, Mohammad R.K. Mofrad, Madelena Y

2025-02-19

Genome modeling and design across all domains of life with Evo 2

https://arcinstitute.org/manuscripts/Evo2

40300000000.00

Table 1 lists 40.3B paramters as model size.

2.2500000000000004e+24

40.3e9 parameters * 9.3e12 training datapoints * 6 = 2.25e24. Same FLOPS estimate given by authors in Table 1.

OpenGenome 2

9300000000000

"We trained two versions of Evo 2: a smaller version at 7B parameters trained on 2.4 trillion tokens and a full version at 40B parameters trained on 9.3 trillion tokens."

Unverified

All of life encodes information with DNA. While tools for sequencing, synthesis, and editing of genomic code have transformed biological research, intelligently composing new biological systems would also require a deep understanding of the immense complexity encoded by genomes. We introduce Evo 2, a biological foundation model trained on 9.3 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life. We train Evo 2 with 7B and 40B parameters to have an unprecedente

Open weights (unrestricted)

United States of America

Academia

Industry

Academia

Gemma 2 27B

Language

Language modeling/generation

Chat

Code generation

Question answering

Quantitative reasoning

Google DeepMind

Gemma Team, Google DeepMind

2024-06-24

Gemma 2 offers best-in-class performance, runs at incredible speed across different hardware and easily integrates with other AI tools.

https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf

27000000000.00

2.106e+24

"For the 27B model, we train on an 8x24x32 configuration of TPUv5p, totaling 6144 chips" trained on 13T tokens 6ND = 6*27000000000*13000000000000=2.106e+24

Unspecified unreleased

Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. Primarily English-language content. Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code or understand code-related questions. Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical quer

13000000000000

"We train Gemma 2 27B on 13 trillion tokens of primarily-English data"

Google TPU v5p

Confident

Now we’re officially releasing Gemma 2 to researchers and developers globally. Available in both 9 billion (9B) and 27 billion (27B) parameter sizes, Gemma 2 is higher-performing and more efficient at inference than the first generation, with significant safety advancements built in. In fact, at 27B, it offers competitive alternatives to models more than twice its size, delivering the kind of performance that was only possible with proprietary models as recently as December. And that’s now achie

Open weights (restricted use)

United States of America

United Kingdom of Great Britain and Northern Ireland

Multinational

6144

Unreleased

Gemma 2 is available under our commercially-friendly Gemma license, giving developers and researchers the ability to share and commercialize their innovations.

Industry

FLAN 137B

Language

Language modeling

Question answering

Language modeling/generation

Google Research

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le

2021-09-03

Finetuned Language Models Are Zero-Shot Learners

https://arxiv.org/abs/2109.01652

2994.00

137000000000.00

Abstract: "We take a 137B parameter pretrained language model and instruction tune it on over 60 NLP datasets verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types." Many models seem to be using the same 137B base transformer model?

2.047e+24

From section 2.4: Pretraining was done over 2.49T tokens. 6 * 2.49T * 137B = 2.047e24 Also, "instruction tuning takes around 60 hours on a TPUv3 with 128 cores." 128 TPUv3 cores = 64 TPUv3 chips. Environmental considerations section claims this took less than 2% of total time 1.23e14 * 64 * 60 * 3600 * 0.3 = 5.10e20

Wikipedia

Unspecified unreleased

Abstract: "We take a 137B parameter pretrained language model and instruction tune it on over 60 NLP datasets"

2490000000000

"Model architecture and pretraining. In our experiments, we use LaMDA-PT, a dense left-to-right, decoder-only transformer language model of 137B parameters (Thoppilan et al., 2022). This model is pretrained on a collection of web documents (including those with computer code), dialog data, and Wikipedia, tokenized into 2.49T BPE tokens with a 32k vocabulary using the SentencePiece library (Kudo & Richardson, 2018). Around 10% of the pretraining data was non-English. Note that LaMDA-PT only has

Google TPU v3

Confident

This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning—finetuning language models on a collection of datasets described via instructions—substantially improves zeroshot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction tune it on over 60 NLP datasets verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on uns

Unreleased

Multinational

United States of America

LaMDA

"In our experiments, we use LaMDA-PT, a dense left-to-right, decoder-only transformer language model of 137B parameters (Thoppilan et al., 2022) [...] Note that LaMDA-PT only has language model pretraining (c.f. LaMDA, which was finetuned for dialog)." In our entry for LaMDA we only measured pre-training compute, so we just specify LaMDA as the base model of FLAN 137B.

Unreleased

Industry

Gemini 1.0 Pro

Multimodal

Language

Vision

Language modeling

Visual question answering

Chat

Translation

Google DeepMind

Gemini Team

2023-12-06

Gemini: A Family of Highly Capable Multimodal Models

https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf

633.00

1.8300010000000002e+24

Training compute estimated from benchmark scores. Our reasoning and calculations for Gemini 1 Ultra are detailed in this Colab notebook. https://colab.research.google.com/drive/1sfG91UfiYpEYnj_xB5YRy07T5dv-9O_c

Unspecified unreleased

"Gemini models are trained on a dataset that is both multimodal and multilingual. Our pretraining dataset uses data from web documents, books, and code, and includes image, audio, and video data... We find that data quality is critical to a highlyperforming model, and believe that many interesting questions remain around finding the optimal dataset distribution for pretraining."

Google TPU v4

Speculative

This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks — notably being the first mode

API access

United States of America

United Kingdom of Great Britain and Northern Ireland

Multinational

Unreleased

API access: https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models

Industry

Yi-Large

Language

Chat

Language modeling/generation

01.AI

Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, Zonghong Dai

2024-05-13

100000000000.00

"Yi-Large is a software over-the-air-driven closed-source large model with a parameter of over 100 billion tokens." from https://www.chinadaily.com.cn/a/202405/13/WS6641abd1a31082fc043c6ccd.html

1.7999999999999996e+24

6ND = 6*100000000000*3000000000000=1.8e+24 (speculative confidence because training dataset size is very uncertain)

3000000000000

3T tokens for previous Yi models: "Targeted as a bilingual language model and trained on 3T multilingual corpus, the Yi series models become one of the strongest LLM worldwide, showing promise in language understanding, commonsense reasoning, reading comprehension, and more."

Speculative

API access

China

Unreleased

Industry

DeepSeek-V2.5

Language

Language modeling/generation

Chat

Code generation

DeepSeek

DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J.L. Cai, Jian Liang, Jianzhong Guo, Jiaqi Ni, Jiashi Li, Jin Chen, Jingyang Yuan, Junjie Qiu, Junxiao Song, Kai Dong, Kaige Gao, Kang Guan, Lean Wang, Lecong Z

2024-09-06

DeepSeek-V2.5

https://huggingface.co/deepseek-ai/DeepSeek-V2.5

236000000000.00

21B active params, 236B total

1.7892000000000004e+24

V2.5 is a merge of V2-coder and V2-chat V2-coder is trained for 6T additional tokens from an intermediate checkpoint of V2, which had been trained for 4.2T tokens. Total: 10.2T V2-chat is fine-tuned from V2, saw 8.2T tokens in pre-training Unique steps: 8.2T + 6T = 14.2T FLOPs: 6 * 21B * 14.2T = 1.7892e24

GitHub

Common Crawl

The original V2 had a dataset of 8.1T unique tokens, and coder-V2 added an additional 1.391T unique tokens of code and math. But it appears no additional training was done to combine them into this model.

Confident

Open weights (restricted use)

China

Unreleased

Industry

EXAONE 1.0

Multimodal

Language

Vision

Translation

Language modeling/generation

Visual question answering

LG

2021-12-14

https://www.lgcorp.com/media/release/27387#:~:text=LG%20AI%20Research%20proposes%20EXAONE,performance%20while%20learning%20fewer%20parameters.

300000000000.00

1.6955999999999996e+24

No indication of how images are processed. Supposing they used something like ViT-H/14, and training images were 512x512 (they state "EXAONE shows remarkable performance such as [...] offering 1024x1024 sized image output", but typically this size of image training would only be done during a relatively short, final stage of pre-training), there would be 37x37 = 1,369 patches per image 1,369 * 250 million = around 342 billion image patch embeddings. 300M parameters * (342 billion + 600 billion)

Unspecified unreleased

"To create multi-modal AI, LG AI Research Institute learned from 600 billion corpora, the world's largest, and more than 250 million high-resolution images combining language and images. It is also differentiated in that it is a bilingual AI that understands and speaks Korean and English at the level of a native speaker." 600000000000+250000000=600250000000

600250000000

Speculative

[Dec 2021] EXAONE is a bilingual artificial intelligence that has learned the characteristics of both Korean and English languages at the same time. Since the initial development last June, it has completed learning of 1.3 billion, 13 billion, 39 billion, and 175 billion parameter models, and it is currently learning 300 billion parametric models. EXAONE shows remarkable performance such as obtaining the highest FID score, offering 1024x1024 sized image output, and achieving purpose conversatio

Unreleased

Korea (Republic of)

Industry

Movie Gen Video

Video

Video generation

Meta AI

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Mitesh Kumar Singh, Peizhao Zhang, Peter Vajda, Quentin Duval, Rohit Girdhar, Roshan Sumbaly, Sai Saketh Rambhatla, Sam Tsai, Samaneh Azadi, Samyak Datta, Sanyuan Chen, Sea

2024-10-04

Movie Gen: A Cast of Media Foundation Models

https://ai.meta.com/static-resource/movie-gen-research-paper

30000000000.00

30B

1.65e+24

Model size = 30B Broken down by training stage (table 3): 256px T2I: samples seen = 1.94E9; sample token length = 256; flops = 6ND = 8.94E22 256px T2I/V: samples seen = 3.95E8; sample token length = 8192; flops = 6ND = 5.82E23 768px T2I/V: samples seen = 7.38E7; sample token length = 73,728; flops = 6ND = 9.79E23 Total flops = 1.65E24

26600000000

O(1B) images O(100M) videos, each with 256 frames ~= 25M images

NVIDIA H100 SXM5 80GB

Confident

We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user’s image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generatio

Unreleased

United States of America

6144

Industry

Chameleon-34B

Multimodal

Image generation

Language

Vision

Language modeling/generation

Vision-language generation

Visual question answering

Text-to-image

Facebook AI Research

Srinivasan Iyer, Bernie Huang, Lili Yu, Arun Babu, Chunting Zhou, Kushal Tirumala, Xi Victoria Lin, Hu Xu, Xian Li, Akshat Shrivastava, Omer Levy, Armen Aghajanyan, Ram Pasunuru, Andrew Cohen, Aram H. Markosyan, Koustuv Sinha, Xiaoqing Ellen Tan, Ivan Evtimov, Ping Yu, Tianlu Wang, Olga Golovneva, Asli Celikyilmaz, Pedro Rodriguez, Leonid Shamis, Vasu Sharma, Christine Jou, Karthik Padthe, Ching-Feng Yeh, Mingda Chen, Bapi Akula, Jacob Kahn, Daniel Li, Scott Yih, Barlas Oguz, Morteza Behrooz, Be

2024-05-16

Chameleon: Mixed-Modal Early-Fusion Foundation Models

https://arxiv.org/abs/2405.09818v1

34000000000.00

1.6453571041e+24

GPU method: Table 2 shows that 34B model pre-training uses 4282407 GPU-hours, trained across 3072 A100s. 3.12e14 * 4282407 * 3600 * 0.3 = 1.44e24 Parameter-token method: Pre-training goes over 9.2T tokens, post-training only goes over 1.1B tokens (sum of tokens column in Table 3). 6 * 34B * 9.2T = 1.88e24 Geometric mean: sqrt(1.44e24 * 1.88e24) = 1.65e24

Unspecified unreleased

Pre-training: - 2.9 trillion tokens of pure text - 1.4 billion text-image pairs, which produces 1.5 trillion text-image tokens - Since each image is 1024 tokens, implies 1.43 trillion image tokens and 0.07 trillion text tokens - 400 billion tokens of image-text interleaved documents - Difficult to estimate image-to-text ratio, but references OBELIKS paper which had 141 million web pages, 353 million associated images, and 115 billion text tokens. - 353 million * 1024 = 361.5 billion image tok

4400000000000

Slightly conflicting info. Pre-training data details describe different types of data that sum to 4.8 trillion tokens, but Table 1 indicates 4.4T. Using table values as this agrees with other statements about epochs and total tokens seen.

NVIDIA A100 SXM4 80 GB

Confident

We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-fo

Open weights (non-commercial)

United States of America

Not enough info to estimate. GPU time given for pretraining, and while we know # of fine-tuning tokens we don't know # of epochs.

3072

Unreleased

https://ai.meta.com/resources/models-and-libraries/chameleon-downloads/?gk_enable=chameleon_web_flow_is_live "The models we’re releasing today were safety tuned and support mixed-modal inputs and text-only output to be used for research purposes. While we’ve taken steps to develop these models responsibly, we recognize that risks remain. At this time, we are not releasing the Chameleon image generation model."

Industry

Yi-Lightning

Language

Language modeling/generation

01.AI

2024-10-18

Yi-Lightning

https://www.lingyiwanwu.com/en https://platform.lingyiwanwu.com/

1.5e+24

The CEO of 01.AI tweeted that Yi-Lightning was trained for 1 month on 2000 H100s: https://x.com/kaifulee/status/1846310645849047524 Assuming this is accurate: (9.9e14 * 2000) FLOP/s * 1 month * 30.5 days/month * 24hr/day * 3600 s/hr * 0.3 utilization assumption = 1.565e24

Unspecified unreleased

NVIDIA H100 SXM5 80GB

Confident

API access

China

2000

Unreleased

https://platform.lingyiwanwu.com/

Industry

Qwen-72B

Language

Chat

Code generation

Alibaba

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zha

2023-11-30

https://huggingface.co/Qwen/Qwen-72B

72000000000.00

72B

1.3e+24

72 billion params, 3 trillion tokens 72b * 3T * 6 = 1.3e24

"It is pretrained on over 3 trillion tokens, including Chinese, English, multilingual texts, code, and mathematics, covering general and professional fields"

3000000000000

Assuming not trained for multiple epochs.

Confident

Qwen-72B is the 72B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-72B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-72B, we release Qwen-72B-Chat, a large-model-based AI assistant, which is trained with alignment techniques.

Open weights (restricted use)

China

Unreleased

up to 100m active users: https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT

Industry

Qwen1.5-72B

Language

Chat

Language modeling/generation

Quantitative reasoning

Code generation

Translation

Alibaba

Qwen Team

2024-02-04

Introducing Qwen1.5

https://qwenlm.github.io/blog/qwen1.5/

72000000000.00

72B

1.3e+24

3T training tokens: https://github.com/QwenLM/Qwen2/issues/97 6 * 72 billion * 3 trillion = ~1.3e24

Unspecified unreleased

"We pretrained the models with a large amount of data, and we post-trained the models with both supervised finetuning and direct preference optimization."

3000000000000

3 trillion tokens from this response https://github.com/QwenLM/Qwen2/issues/97

Confident

In recent months, our focus has been on developing a “good” model while optimizing the developer experience. As we progress towards Qwen1.5, the next iteration in our Qwen series, this update arrives just before the Chinese New Year. With Qwen1.5, we are open-sourcing base and chat models across six sizes: 0.5B, 1.8B, 4B, 7B, 14B, and 72B. In line with tradition, we’re also providing quantized models, including Int4 and Int8 GPTQ models, as well as AWQ and GGUF quantized models. To enhance the d

Open weights (restricted use)

China

Unreleased

restriction on >100m monthly users: https://huggingface.co/Qwen/Qwen1.5-72B/blob/main/LICENSE

Industry

DeepSeek-Coder-V2 236B

Language

Code generation

Code autocompletion

DeepSeek

Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y. Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bingxuan Wang, Junxiao Song, Deli Chen, Xin Xie, Kang Guan, Yuxiang You, Aixin Liu, Qiushi Du, Wenjun Gao, Xuan Lu, Qinyu Chen, Yaohui Wang, Chengqi Deng, Jiashi Li, Chenggang Zhao, Chong Ruan, Fuli Luo, Wenfeng Liang

2024-06-17

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

https://github.com/deepseek-ai/DeepSeek-Coder-V2

236000000000.00

Mixture of experts model. 21B parameters activated per token.

1.2852e+24

Trained on a total of 10.2T tokens 6NC: 6 * 10.2T * 21B active parameters = 1.285e24

GitHub

Common Crawl

See Section 2. "In the pre-training phase, the dataset of DeepSeek-Coder-V2 is created with a composition of 60% source code, 10% math corpus, and 30% natural language corpus"

3191000000000

"In the pre-training phase, the dataset of DeepSeek-Coder-V2 is created with a composition of 60% source code, 10% math corpus, and 30% natural language corpus ... The source code consists of 1,170B code-related tokens sourced from GitHub and CommonCrawl... For the math corpus, we collect 221B math-related tokens sourced from CommonCrawl... In total, DeepSeek-Coder-V2 has been exposed to 10.2T training tokens, where 4.2 trillion tokens originate from the DeepSeek V2 dataset, while the remaining

Confident

We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens. Through this continued pre-training, DeepSeek-Coder-V2 substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general l

Open weights (restricted use)

China

DeepSeek-V2 (MoE-236B)

Unreleased

license has some harmful use restrictions: https://github.com/deepseek-ai/DeepSeek-Coder-V2/blob/main/LICENSE-MODEL no training code

Industry

EXAONE 3.5-R 32B

Language

Language modeling/generation

Question answering

Translation

LG AI Research

2025-03-14

32000000000.00

32B

1.2692e+24

1.25 × 10^24 (base model reported training compute) + 1.92 × 10^22 (finetune compute) = 1.2692e+24 FLOP

Confident

Unreleased

Korea (Republic of)

EXAONE 3.5 32B

19200000000000B

1.92e22

Unreleased

Industry

Code Llama-70B

Language

Code generation

Meta AI

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Ellen Tan, Yossef (Yossi) Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Defossez, Jade Copet, Faisal Azhar, Hugo Touvron, Gabriel Synnaeve, Louis Martin, Nicolas Usunier, Thomas Scialom

2024-01-29

Code Llama: Open Foundation Models for Code

https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/ https://arxiv.org/abs/2308.12950

1297.00

70000000000.00

70B

1.26e+24

Base model saw 2T tokens, Code Llama-70B was trained on an additional 1T. 6NC: 6 * 3T * 70B = 1.26e24

Unspecified unreleased

We are releasing four sizes of Code Llama with 7B, 13B, 34B, and 70B parameters respectively. Each of these models is trained with 500B tokens of code and code-related data, apart from 70B, which is trained on 1T tokens.

1000000000000

Llama 70B training dataset was 2 trillion tokens. Code Llama finetuning dataset was 1 trillion tokens of code.

NVIDIA A100 SXM4 80 GB

Confident

We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters

Open weights (restricted use)

United States of America

Llama 2-70B

420000000000000B

Fine tuning from base model uses 1T tokens. 70B * 1T * 6 = 4.2E23

400

Unreleased

Llama 2 license. can't use outputs to train models. https://github.com/meta-llama/llama/blob/main/LICENSE

Industry

EXAONE Deep 32B

Language

Language modeling/generation

Question answering

Quantitative reasoning

Code generation

LG AI Research

LG AI Research, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Yongmin Park, Sihoon Yang, Heuiyeen Yeen, Sihyuk Yi, Hyeongu Yun

2025-03-16

EXAONE Deep: LLMs with Enhanced Reasoning Performance

https://arxiv.org/abs/2503.12524

32000000000.00

32B

1.26e+24

1.25 × 10^24 (base model reported training compute) + 7.04 × 10^21 (finetune compute) = 1.26 × 10^24 FLOP Table 1

Unspecified unreleased

12000000000

"To enhance the reasoning capabilities of language models, we have utilized 1.6M instances for SFT and 20K instances of preference data for DPO. The SFT dataset contains approximately 12B tokens"

NVIDIA H100 SXM5 80GB

Confident

We present EXAONE Deep series, which exhibits superior capabilities in various reasoning tasks, including math and coding benchmarks. We train our models mainly on the reasoning-specialized dataset that incorporates long streams of thought processes. Evaluation results show that our smaller models, EXAONE Deep 2.4B and 7.8B, outperform other models of comparable size, while the largest model, EXAONE Deep 32B, demonstrates competitive performance against leading open-weight models. All EXAONE Dee

Open weights (non-commercial)

Korea (Republic of)

EXAONE 3.5 32B

7040000000000B

Table 1 (reported): 7.04 × 10^21 FLOP 6ND = 6*32B parameters * 12B tokens = 2.304e+21 FLOP

512

Unreleased

https://huggingface.co/LGAI-EXAONE/EXAONE-Deep-32B Exaone License

Industry

EXAONE 3.5 32B

Language

Language modeling/generation

Question answering

Translation

LG AI Research

Soyoung An, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Yountae Jung, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Woohyung Lim, Sangha Park, Sooyoun Park, Yongmin Park, Sihoon Yang, Heuiyeen Yeen, Hyeongu Yun

2024-12-09

EXAONE 3.5: Series of Large Language Models for Real-world Use Cases

https://arxiv.org/abs/2412.04862

32000000000.00

32B

1.25e+24

1.25 × 10^24 (Table 2)

Unspecified unreleased

6500000000000

6.5T tokens (Table 2)

Confident

This technical report introduces the EXAONE 3.5 instruction-tuned language models, developed and released by LG AI Research. The EXAONE 3.5 language models are offered in three configurations: 32B, 7.8B, and 2.4B. These models feature several standout capabilities: 1) exceptional instruction following capabilities in real-world scenarios, achieving the highest scores across seven benchmarks, 2) outstanding long-context comprehension, attaining the top performance in four benchmarks, and 3) compe

Open weights (non-commercial)

Korea (Republic of)

Unreleased

Exaone license (allows only non-commercial usage)

Industry

XVERSE-65B-2

Language

Chat

Language modeling/generation

XVERSE Technology

Shenzhen Yuanxiang Technology

2023-12-08

https://github.com/xverse-ai/XVERSE-65B/blob/main/README_EN.md

65000000000.00

Based on the name. Exact count unknown but may be listed on Hugging Face.

1.24800000000001e+24

C = 6ND = 6 * 3.2T tokens * 65B params = 1.248e24 FLOP

[2023/12/08] Released the XVERSE-65B-2 base model. This model builds upon its predecessor through Continual Pre-Training, reaching a total training volume of 3.2 trillion tokens.

3200000000000

Training Data: The model has been thoroughly trained on a diversified and high-quality dataset consisting of 2.6 trillion of tokens, including more than 40 languages such as Chinese, English, Russian, and Spanish. The sampling ratio of different types of data is finely set, which makes the performance of Chinese and English excellent, and also takes into account the effect of other languages. Assume 0.85 words per token on average for the mix of languages.

Confident

Open weights (restricted use)

China

Open source

https://github.com/xverse-ai/XVERSE-65B/blob/main/README_EN.md license info: "The use of the source code in this repository must follow the Apache-2.0 open-source license, while the use of the model weights of XVERSE-65B needs to adhere to the Model License Agreement. The XVERSE-65B model weights are fully open to academic research and support free commercial use. To apply for a commercial license, please fill in the application form. For other questions or collaborations, please contact opens

Industry

SEA-LION V3 Llama3.1 8B

Language

Language modeling/generation

AI Singapore

2024-12-19

SEA-LION V3

https://huggingface.co/aisingapore/llama3.1-8b-cpt-sea-lionv3-base

8000000000.00

1.23330162e+24

Llama3.1 8B base model: 1.224e+24 Additional pretraining compute: 136*60*60*989500000000000*64*0.3=9.30162×10^21 Total: 9.30162 10^21 + 1.224*^24 =1.23330162 × 10^24

The Stack v2

Dolma

Trained on a mix of datasets including StackV2 and Dolma (see https://huggingface.co/aisingapore/llama3.1-8b-cpt-sea-lionv3-base#data)

200000000000

"pre-trained on 200B tokens"

NVIDIA H200 SXM

Unverified

Our SEA-LION v3 Llama3.1 8B and 70B base models have been continued pre-trained on top of the Llama3.1 8B and 70B models respectively. Both have a context length of 128K, making them the SEA-LION models with the longest context length to date.

Open weights (unrestricted)

Llama 3.1-8B

64

Llama 3.1-8B

Language

Language modeling/generation

Meta AI

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu,

2024-07-23

The Llama 3 Herd of Models

https://ai.meta.com/research/publications/the-llama-3-herd-of-models/

8000000000.00

8B

1.224e+24

Huggingface page says 3.1-8B used 1.46M H100 hours and trained over 15T tokens. https://huggingface.co/meta-llama/Llama-3.1-70B The paper also says that 3.1-405B got MFU of between 38-43%; presumably 8B was around the same or a bit higher. I'll assume utilization of 40% 6ND: 6 * 15T * 8B = 7.2e23 FLOPs Hardware: 1.46M * 9.9e14 * 3600 * 0.4 = 2.08e24 FLOPs Geometric mean: sqrt(7.2e23 * 2.08e24) = 1.224e24 Note that Llama 3-8B also said it used 15T tokens, but only 1.3M H100 hours. This sugges

Llama 3 dataset

Unverified

Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models s

Open weights (restricted use)

United States of America

Open (restricted use)

Llama 3.1 license: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE must seek separate license if over 700m monthly users, acceptable use restrictions code here: https://github.com/meta-llama/llama3/tree/main

Industry

Mi:dm 200B

Language

Language modeling/generation

KT

2023-10-31

https://genielabs.ai/midm/about

200000000000.00

200B

1.2e+24

6ND=1000000000000*200000000000.00*6=1.2 × 10^24

1000000000000

Mi:dm is the first Korean LLM trained on over 1 trillion tokens.

Confident

TL;DR: KT Corp introduces Mi:dm, a massive AI model aimed at diverse sectors. Mi:dm is the first Korean LLM trained on over 1 trillion tokens. It offers four models, from basic to large, with up to 200 billion parameters. KT plans to share Mi:dm’s foundational model with other companies. Three advanced technologies reduce AI hallucinations by up to 70%. Collaborations with AI startups, including Upstage, aim to conquer the global generative AI market.

API access

Korea (Republic of)

Unreleased

KT said it will open up the foundation model of Mi:dm to other companies, providing a full AI development package, including KT Cloud's hyperscale AI computing service and AI chip startup Rebellions Inc.'s neural processing unit infrastructure, fostering the development of various AI services.

Industry

Hunyuan

Language

Image generation

Multimodal

Language modeling/generation

Image generation

Question answering

Tencent

2023-09-07

Tencent Unveils Hunyuan, its Proprietary Large Foundation Model on Tencent Cloud

https://www.tencent.com/en-us/articles/2201685.html

100000000000.00

"Presently, the Hunyuan model has over 100 billion parameters, with more than two trillion tokens in pre-training data."

1.2e+24

6ND = 6*100*10^9*2*10^12 = 1.2*10^24

Unspecified unreleased

2000000000000

"Presently, the Hunyuan model has over 100 billion parameters, with more than two trillion tokens in pre-training data."

Confident

Enterprises in China may now access Hunyuan via Tencent’s public cloud platform and finetune it to their specific needs. The platform features strong Chinese language processing abilities, advanced logical reasoning, and comes with reliable task execution abilities. Tencent’s foundation model supports a wide array of functions spanning the creation of images, copywriting, text recognition, and customer service, to name a few. These will be instrumental in key industries like finance, public ser

API access

China

Unreleased

Industry

PLaMo-100B

Language

Language modeling/generation

Preferred Networks Inc

Preferred Elements (PFE)

2024-06-14

Pre-training of the proprietary LLM "PLaMo-100B" with 100 billion parameters

https://tech.preferred.jp/ja/blog/plamo-100b/

100000000000.00

1.2e+24

6*100B*2T=1.2e24

"The pre-trained model of PLaMo-100B developed this time was trained on a total of 2T tokens of both Japanese and English text data."

Unverified

Preferred Elements (PFE), a subsidiary of Preferred Networks (PFN), has been developing a 100 billion (100B) parameter LLM called "PLaMo-100B" since February. The pre-training part of the development of PLaMo-100B was completed in May, so in this article we will introduce the pre-training part of this model.

API access

Japan

Industry

Luca 2.0

Mianbi Intelligence

2023-08-29

https://www.163.com/dy/article/IDBGA8840511FQO9.html

100000000000.00

https://www.leiphone.com/category/ai/23kbzQXj60xZgUgO.html English translation: "Li Dahai: From a technical point of view, the CPM2 (Chinese Pretrained Model) 100 billion model we launched at that time was a sparse model of MoE, which is different from the 100 billion model we are promoting now." This suggests it is a dense model.

1.2e+24

Assume Chinchilla-optimal dataset size: 20 * 100B * 100B * 6 = 1.2e24 FLOP

Unverified

China

Industry

Megatron-Turing NLG 530B

Language

Language modeling

Microsoft

NVIDIA

Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, Bryan Catanzaro

2021-10-11

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

https://arxiv.org/abs/2201.11990

657.00

530000000000.00

1.17e+24

https://www.lesswrong.com/posts/bGuMrzhJdENCo8BxX/nvidia-and-microsoft-releases-530b-parameter-transformer?commentId=HSJSNspKp94tFcSCx source: https://lair.lighton.ai/akronomicon/ 9938 PF-days * 3600 * 24 * 10^15 = 8.586432e+23

Common Crawl

The Pile

CC-Stories

Realnews

In addition to Common Crawl data, we leveraged a number of other previously generated datasets. From The Pile, we selected Books3, OpenWebText2, Stack Exchange, PubMed Abstracts, Wikipedia, Gutenberg (PG-19), BookCorpus2, NIH ExPorter, and Pile-CC datasets. We also included the CC-Stories and RealNews datasets used to train Megatron

270000000000

"Our training dataset consists of 339 billion tokens and we trained MT-NLG on 270 billions tokens by blending the 15 training datasets as described above. We also set aside 2% of our data for validation." 1 token ~ 0.75 words

NVIDIA A100 SXM4 80 GB

Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models. As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of

Unreleased

United States of America

4480

Unreleased

Industry

Mistral Small 3

Language

Language modeling/generation

Question answering

Quantitative reasoning

Code generation

Translation

Mistral AI

2025-01-30

Mistral Small 3, a latency-optimized 24B-parameter model released under the Apache 2.0 license.

https://mistral.ai/news/mistral-small-3/

24000000000.00

24B

1.152e+24

6ND = 6*8T tokens * 24B parameters = 1.152e+24 FLOP

Unspecified unreleased

"Notably, Mistral Small 3 was developed without reinforcement learning or synthetic training data, techniques commonly used by competitors. Lample said this “raw” approach helps avoid embedding unwanted biases that could be difficult to detect later."

8000000000000

8 trillion tokens Source: https://venturebeat.com/ai/mistral-small-3-brings-open-source-ai-to-the-masses-smaller-faster-and-cheaper/

Confident

Mistral Small 3 is competitive with larger models such as Llama 3.3 70B or Qwen 32B, and is an excellent open replacement for opaque proprietary models like GPT4o-mini. Mistral Small 3 is on par with Llama 3.3 70B instruct, while being more than 3x faster on the same hardware. Mistral Small 3 is a pre-trained and instructed model catered to the ‘80%’ of generative AI tasks—those that require robust language and instruction following performance, with very low latency. We designed this new mode

Open weights (unrestricted)

France

Unreleased

Apache 2.0 https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501

Industry

Qwen2.5-Coder (32B)

Language

Language modeling/generation

Code generation

Alibaba

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, Junyang Lin

2024-11-12

Qwen2.5-Coder Technical Report

https://arxiv.org/abs/2409.12186

32500000000.00

32.5B (31B - non emb)

1.0725e+24

Assuming 1 epoch 6ND = 6*32.5 parameters *10^9*5.5*10^12 tokens = 1.0725e+24

GitHub

Common Crawl

Unspecified unreleased

"We collected public repositories from GitHub created before February 2024" "We curated a large-scale and high-quality text-code mixed dataset from Common Crawl, which includes code-related documentation, tutorials, blogs, and more" "We used CodeQwen1.5, the predecessor of Qwen2.5-Coder, to generate large-scale synthetic datasets."

5500000000000

"As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens."

Confident

In this report, we introduce the Qwen2.5-Coder series, a significant upgrade from its predecessor, CodeQwen1.5. This series includes six models: Qwen2.5-Coder-(0.5B/1.5B/3B/7B/14B/32B). As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder demonstrates impressive code generation capabilities w

Open weights (unrestricted)

China

Unreleased

Apache 2.0 https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct though they have apache 2.0 github repository it seems to be inference code rather than training code

Industry

ESM3 (98B)

Biology

Protein generation

EvolutionaryScale

University of California (UC) Berkeley

Thomas Hayes, Roshan Rao, Halil Akin, Nicholas James Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Quy Tran, Jonathan Deaton, Marius Wiggert, Rohil Badkundri, Irhum Shafkat, Jun Gong, Alexander Derry, Raul Santiago Molina, Neil Thomas, Yousuf Khan, Chetan Mishra, Carolyn Kim, Liam J Bartie, Patrick D Hsu, Tom Sercu, Salvatore Candido, Alexander Rives

2024-06-25

ESM3: Simulating 500 million years of evolution with a language model

https://www.evolutionaryscale.ai/blog/esm3-release

98500000000.00

98.5 billion (Table S1)

1.0699999999999999e+24

"ESM3 at its largest scale was trained with 1.07×10^24 FLOPs on 2.78 billion proteins and 771 billion unique tokens, and has 98 billion parameters." per Table 1, trained 98B model on 1.8T training tokens. 98 billion * 1800 billion * 6 = 1.06e24. Likely some rounding, so will go with developer's reported count.

ESM3 Dataset

771000000000

771 billion tokens

Confident

More than three billion years of evolution have produced an image of biology encoded into the space of natural proteins. Here we show that language models trained on tokens generated by evolution can act as evolutionary simulators to generate functional proteins that are far away from known proteins. We present ESM3, a frontier multimodal generative language model that reasons over the sequence, structure, and function of proteins. ESM3 can follow complex prompts combining its modalities and is

Unreleased

United States of America

Unreleased

only small version released

Industry

Academia

BlueLM 175B

Language

Chat

Language modeling/generation

Question answering

vivo AI lab

2023-11-02

https://baijiahao.baidu.com/s?id=1781445143383237948&wfr=spider&for=pc

175000000000.00

1.05e+24

6ND = 6*175B*1000B=1.05e+24

Unspecified unreleased

1000B Text data 10B Image data 100M video data 100M Knowledge graph (from the conference handout)

Confident

Unreleased

China

Unreleased

information about the model is from their paper catalogue and not found on the internet

Industry

ERNIE 3.0 Titan

Language

Language modeling

Language modeling/generation

Baidu

Peng Cheng Laboratory

Shuohuan Wang, Yu Sun, Yang Xiang, Zhihua Wu, Siyu Ding, Weibao Gong, Shikun Feng

2021-12-23

ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

https://arxiv.org/abs/2112.12731

70.00

260000000000.00

"[We] developed... distributed training technology, including fine-grained parallelism, heterogeneous hardware-aware training, and fault tolerance mechanism to train the 260B model on both Nvidia V100 GPU and Ascend 910 NPU clusters." See also: https://twitter.com/BaiduResearch/status/1468633977242243078?t=6q4zuLNdTSc4GUBe9OM5Aw&s=19

1.0421e+24

The paper suggests that ERNIE 3.0 Titan uses more compute than GPT-3. This is consistent with the 6ND approximation. C = 6ND = 6 (FLOP/param/token) * (260B params) * (668B tokens) = 1.0421*10^24 FLOP

ERNIE 3.0 Corpus

668000000000

"To ensure the success of the pre-training of ERNIE 3.0 Titan, we utilize the ERNIE 3.0 Corpus [ 2 ], a large-scale, wide-variety, and high-quality Chinese text corpora amounting to 4TB" Assuming 167M words/tokens per GB

NVIDIA Tesla V100 DGXS 32 GB

Huawei Ascend 910

Confident

Pre-trained language models have achieved state-of-the-art results in various Natural Language Processing (NLP) tasks. GPT-3 has shown that scaling up pre-trained language models can further exploit their enormous potential. A unified framework named ERNIE 3.0 was recently proposed for pre-training large-scale knowledge enhanced models and trained a model with 10 billion parameters. ERNIE 3.0 outperformed the state-of-the-art models on various NLP tasks. In order to explore the performance of sc

Hosted access (no API)

China

1920

Unreleased

The Ernie 3.0 Titan model was used in Ernie bot. Today, ERNIE has been widely deployed across finance, healthcare, insurance, equity, Internet, logistics, and other fields. http://research.baidu.com/Blog/index-view?id=165

Industry

Academia

TigerBot-70B

Language

Chat

Language generation

Language modeling/generation

Question answering

Tigerobo

Ye Chen, Wei Cai, Liangmin Wu, Xiaowei Li, Zhanxuan Xin, Cong Fu

2023-09-06

TigerBot: An Open Multilingual Multitask LLM

https://github.com/TigerResearch/TigerBot/blob/main/README_en.md, https://arxiv.org/abs/2312.08688

70000000000.00

70B

1.02e+24

~1.02e24 Tigerobo did ~2.1e23 additional pre-training. We estimated Llama 2 was trained on 8.1e23 FLOP.

"Tigerbot-70b is further pre-trained on the foundation of Llama-2-70b using high-quality multi-language data of 300 billion tokens. " "We collected data from Chinese books, the internet, and encyclopedia-type data based on the distribution of GPT3 pretraining data, and filtered the data through source quality control and tf-idf soft deduplication. From 20TB of data, we filtered down to 2TB, maintaining the proportion of language and categories. On this basis, we randomly sampled 100G of data an

300000000000

NVIDIA A100

Confident

(translated from https://github.com/TigerResearch/TigerBot/wiki/TigerBot%E2%80%9070B%E5%8F%91%E5%B8%83%EF%BC%81) We are pleased to release Tigerbot-70b, which continues to be open source and free for commercial use, including: Tigerbot-70b-base: Continuing pre-training on the basis of Llama-2-70b, the model's comprehensive capabilities are better than Llama-2-70b in 10 mainstream benchmark tests such as mmlu, reaching SOTA in the industry. a. Using high-quality multi-lingual data of 300 billi

Open weights (restricted use)

China

Llama 2-70B

126000000000000B

70b * 300b * 6 = 126000*10^18 = 1.26*10^23

512

Open source

Apache 2.0 https://github.com/TigerResearch/TigerBot/blob/main/README_en.md but it's also a Llama 2 finetune. training code here: https://github.com/TigerResearch/TigerBot/tree/main/train They released a 5% sample of training data: " On this basis, we randomly sampled 100G of data and released it open source."

Industry

DeepSeek-V2 (MoE-236B)

Language

Language modeling/generation

Chat

Code generation

DeepSeek

DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J.L. Cai, Jian Liang, Jianzhong Guo, Jiaqi Ni, Jiashi Li, Jin Chen, Jingyang Yuan, Junjie Qiu, Junxiao Song, Kai Dong, Kaige Gao, Kang Guan, Lean Wang, Lecong Z

2024-05-07

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

https://arxiv.org/abs/2405.04434 https://github.com/deepseek-ai/DeepSeek-V2

236000000000.00

21B active params, 236B total

1.02e+24

21b active params * 8.1 trillion * 6 = 1.02e24

Unspecified unreleased

"We construct a high-quality and multi-source pre-training corpus consisting of 8.1T tokens. Compared with the corpus used in DeepSeek 67B (our previous release) (DeepSeek-AI, 2024), this corpus features an extended amount of data, especially Chinese data, and higher data quality. We first pretrain DeepSeek-V2 on the full pre-training corpus"

8100000000000

8.1 Trillion

NVIDIA H800 SXM5

Confident

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE e

Open weights (restricted use)

China

Unreleased

open weights with harmful use restrictions: https://github.com/deepseek-ai/DeepSeek-V2/blob/main/LICENSE-MODEL

Industry

Inflection-1

Language

Language modeling

Inflection AI

2023-06-23

Inflection-1 technical memo

https://inflection.ai/assets/Inflection-1.pdf

1.0001e+24

<= 2.5e24 They define two "compute classes", one for models with more compute than PaLM 540B, i.e. GPT-4 and PaLM 2, and one for models with as much compute or less, i.e. GPT-3.5, Chinchilla, LLaMA, and Inflection-1. PaLM 540B required 2.5e24 FLOP to train (confirmed by Google)

"Inflection-1 was trained using thousands of NVIDIA H100 GPUs on a very large dataset."

NVIDIA H100 SXM5 80GB

Speculative

Large language models (LLMs) based on the Transformer architecture have been shown to possess a range of advanced capabilities in language generation and understanding. These capabilities have paved the way for deployment of LLMs in products like OpenAI’s Chat-GPT and Google’s Bard. At Inflection AI, our mission is to create personal AIs for everyone, and in May 2023 we released Pi (pi.ai) – an LLM-based personal AI which is designed to be empathetic, useful, and safe. In this work we introduce

Hosted access (no API)

United States of America

Unreleased

Industry

MegaScale (530B)

Language

Language modeling/generation

ByteDance

Peking University

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu

2024-02-23

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

https://arxiv.org/abs/2402.15627

40.00

530000000000.00

Two models are trained for epoch to evaluate the MegaScale training system; one model with 175B and another with 530B parameters. This entry reports the 530B model. There is a third production model mentioned, with fewer details.

9.6910000000001e+23

175B models uses 3.2e23 FLOPs (Table 2, bottom row) With constant dataset size and utilization, FLOPs should scale linearly in # parameters, so: 3.2e23 * (530/175) = 9.7e23

175B and 530B models trained for paper use 300B tokens each.

300000000000

300B tokens

NVIDIA A100

Confident

We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pip

Unreleased

China

11200

Unreleased

they open-sourced their framework but don't see training code for their big model. https://github.com/volcengine/vescale Model weights are unreleased

Industry

Academia

Phi-4

Language

Language modeling/generation

Question answering

Code generation

Quantitative reasoning

Microsoft Research

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu, Cyril Zhang, Yi Zhang

2024-12-12

Phi-4 Technical Report

https://arxiv.org/abs/2412.08905

14000000000.00

14B

9.3202015e+23

6ND = 6* 14*10^9 parameters * 10*10^12 tokens = 8.4e+23 FLOP 989500000000000 FLOP / sec [assumed bf16 precision] * 1920 GPUs * 504 hours * 3600 sec / hour * 0.3 [assumed utilization] = 1.0341209e+24 FLOP geometric mean sqrt(8.4e+23 * 1.0341209e+24) = 9.3202015e+23

Unspecified unreleased

"The pretraining phase of phi-4 relies heavily on synthetic datasets generated through a variety of techniques. " "We collected a wide variety of high-quality organic data sources for phi-4, prioritizing reasoning-dense and nuanced material (e.g., academic papers, educational forums, and programming tutorials)." "Our post-training data is composed of: • Supervised Fine-Tuning (SFT) Datasets • Direct Preference Optimization (DPO)

10000000000000

"The model was pretrained for approximately 10T tokens using linear warm-up and decay schedules with peak learning rate of 0.0003, constant weight decay of 0.1, and global batch size of 5760. " Table 5: Web 15% 1.3T unique tokens 1.2 epochs Web rewrites 15% 290B unique tokens 5.2 epochs Synthetic 40% 290B unique tokens 13.8 epochs Code data 20% 820B unique tokens 2.4 epochs Acquired sources 10% 580B unique tokens 1.7 epochs

NVIDIA H100 SXM5 80GB

Confident

We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Phi family largely distill the capabilities of a teacher model (specifically GPT-4), phi-4 substantially surpasses its teacher model on ST

Open weights (unrestricted)

United States of America

United Kingdom of Great Britain and Northern Ireland

1920

Unreleased

"Phi-4 is currently available on Azure AI Foundry under a Microsoft Research License Agreement (MSRLA) and will be available on Hugging Face next week. " Hugging Face: MIT license https://huggingface.co/microsoft/phi-4

Industry

Gemma 3 12B

Language

Vision

Multimodal

Language modeling/generation

Question answering

Translation

Chat

Quantitative reasoning

Visual question answering

Code generation

Google DeepMind

Core contributors: Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Beyer, Xiaohai Zhai, Anton Tsitsulin, Robert Busa-Fekete, Alex Feng, Noveen Sachdeva, Benjamin Coleman, Yi Gao, Basil Mustafa, Iain Barr, E

2025-03-12

Gemma 3 Technical Report

https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf

12000000000.00

Vision Encoder: 417M Embedding Parameters: 1,012M Non-embedding Parameters: 10,759M

8.64e+23

6ND = 6 * 12B parameters * 12T training tokens = 8.64 × 10^23 FLOP

Unspecified unreleased

12000000000000

12T

Google TPU v4

Confident

We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context – at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local atten

Open weights (restricted use)

United States of America

United Kingdom of Great Britain and Northern Ireland

Multinational

SigLIP 400M

6144

Unreleased

https://huggingface.co/google/gemma-3-12b-it Gemma License

Industry

Baichuan2-53B

Language

Language modeling/generation

Chat

Baichuan

2023-08-09

Chinese AI startup Baichuan rolls out third LLM in four months

https://technode.com/2023/08/09/chinese-ai-startup-baichuan-rolls-out-third-llm-in-four-months/

53000000000.00

8.268e+23

Given that it was announced at a similar time to the other Baichuan2 models, this assumes that the dataset size is the same at 2.6T tokens while the parameter count was scaled up. This would be consistent with many other model releases, such as Meta's Llama models. 53b * 2.6t * 6 = 8.268e23

Likely

On Tuesday, four-month-old AI startup Baichuan Intelligent Technology unveiled its first closed-source model equipped with 53 billion parameters. Following the Chinese company’s rapid release of two open-source large language models since its founding in April, the new model demonstrates the firm’s fast pace in delivering pre-trained models for larger parameters. The freshly introduced model, Baichuan-53B, is mainly for corporate clients and focused on text generation. A ChatGPT-like chat servic

China

Industry

Qwen2.5 Instruct (7B)

Language

Code generation

Code autocompletion

Quantitative reasoning

Question answering

Language modeling/generation

Alibaba

Qwen Team

2024-09-19

Qwen2.5: A Party of Foundation Models!

https://qwenlm.github.io/blog/qwen2.5/

7610000000.00

8.2188e+23

6ND = 6*7610000000.00 parameters *18000000000000 tokens = 8.2188e+23 (might be less if not entire training dataset was used)

Unspecified unreleased

18000000000000

"In terms of Qwen2.5, the language models, all models are pretrained on our latest large-scale dataset, encompassing up to 18 trillion tokens."

Likely

Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. Significant improvements in instruction following, generating long texts (over 8K tokens), unde

Open weights (restricted use)

China

Qwen2.5-7B

requires permission to use in applications with 100K+ users https://huggingface.co/Qwen/Qwen2.5-7B-Instruct

Industry

Qwen2.5-7B

Language

Language modeling/generation

Question answering

Quantitative reasoning

Alibaba

Qwen Team

2024-09-19

Qwen2.5: A Party of Foundation Models!

https://qwenlm.github.io/blog/qwen2.5/

7610000000.00

7.61B

8.2188e+23

Training dataset size was 18 trillion 6ND = 6 * 7.61 billion parameters * 18 trillion tokens = 8.2188e+23

Unspecified unreleased

"In terms of Qwen2.5, the language models, all models are pretrained on our latest large-scale dataset, encompassing up to 18 trillion tokens."

18000000000000

"In terms of Qwen2.5, the language models, all models are pretrained on our latest large-scale dataset, encompassing up to 18 trillion tokens"

Confident

In the past three months since Qwen2’s release, numerous developers have built new models on the Qwen2 language models, providing us with valuable feedback. During this period, we have focused on creating smarter and more knowledgeable language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5. We are announcing what might be the largest opensource release in history! Let’s get the party started! The Qwen2.5-7B model surpasses its predecessors and counte

Open weights (unrestricted)

China

Unreleased

Apache 2.0

Industry

Llama 2-70B

Language

Language modeling

Meta AI

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Mar

2023-07-18

Llama 2: Open Foundation and Fine-Tuned Chat Models

https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/ https://arxiv.org/abs/2307.09288

8056.00

70000000000.00

Llama has been released in 7B, 13B, 34B, and 70B variants.

8.1e+23

"Pretraining utilized a cumulative 3.3M GPU hours of computation on hardware of type A100-80GB" of which 1720320 GPU hours were used to train the 70B model. 311.84 BF16 TFLOP/s * 1720320 hours * 0.40 utilization = 7.725e+23 FLOP. Alternatively: the model was trained for 1 epoch on 2 trillion tokens and has 70B parameters. C = 6ND = 6*70B*2T = 8.4e+23 FLOP.

Llama 2 dataset

2 trillion tokens of publicly available text, with no text from Meta's products. "Our training corpus includes a new mix of data from publicly available sources, which does not include data from Meta’s products or services. We made an effort to remove data from certain sites known to contain a high volume of personal information about private individuals. We trained on 2 trillion tokens of data as this provides a good performance–cost trade-off, up-sampling the most factual sources in an effort

1500000000000

2 trillion tokens ~= 1.5 trillion words

NVIDIA A100 SXM4 80 GB

Confident

In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closedsource models. We provide a detailed description of our approach to

Open weights (restricted use)

United States of America

1000

Unreleased

Llama 2 license. can't use outputs to train models. https://github.com/meta-llama/llama/blob/main/LICENSE

Industry

DeepSeek LLM 67B

Language

Chat

Language modeling/generation

Question answering

DeepSeek

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li, Y.K. Li, Wenfeng Liang, Fangyun Lin, A.X. Liu, Bo Liu, Wen Liu, Xiaodong Liu, Xin Liu, Yiyuan Liu, Haoyu Lu, Shanghao Lu, Fuli Luo, Shirong Ma, Xiaotao Nie, Tian Pei, Yishi Piao, Junjie Qiu, Hui Qu, T

2024-01-05

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

https://arxiv.org/abs/2401.02954, https://github.com/deepseek-ai/DeepSeek-LLM

67000000000.00

67B

8.04e+23

67B * 2T * 6 = 8.04e23

Unspecified unreleased

"We collect 2 trillion tokens for pre-training, primarily in Chinese and English." "We have gained valuable insights from reputable sources such as (Computer, 2023; Gao et al., 2020; Penedo et al., 2023; Touvron et al., 2023a)... We adopted an aggressive deduplication strategy, expanding the deduplication scope. Our analysis revealed that deduplicating the entire Common Crawl corpus results in higher removal of duplicate instances compared to deduplicating within a single dump"

2000000000000

"We collect 2 trillion tokens for pre-training, primarily in Chinese and English"

Confident

The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate scaling of large scale models in two commonly used open-source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing ope

Open weights (restricted use)

China

Unreleased

repo with inference code and details, but no training code: https://github.com/deepseek-ai/deepseek-LLM/blob/main/LICENSE-MODEL

Industry

BlueLM 130B

Language

Chat

Language modeling/generation

Question answering

vivo AI lab

2023-11-02

https://baijiahao.baidu.com/s?id=1781445143383237948&wfr=spider&for=pc

130000000000.00

7.8e+23

6ND = 6*130B*1000B=7.8e+23

Unspecified unreleased

1000B Text data 10B Image data 100M video data 100M Knowledge graph (from the conference handout)

Confident

Unreleased

China

Unreleased

information about the model is from their paper catalogue and not found on the internet

Industry

Nemotron-4 15B

Language

Language modeling/generation

Code generation

Question answering

Translation

Quantitative reasoning

NVIDIA

Jupinder Parmar, Shrimai Prabhumoye, Joseph Jennings, Mostofa Patwary, Sandeep Subramanian, Dan Su, Chen Zhu, Deepak Narayanan, Aastha Jhunjhunwala, Ayush Dattagupta, Vibhu Jawa, Jiwei Liu, Ameya Mahabaleshwarkar, Osvald Nitski, Annika Brundyn, James Maki, Miguel Martinez, Jiaxuan You, John Kamalu, Patrick LeGresley, Denys Fridman, Jared Casper, Ashwath Aithal, Oleksii Kuchaiev, Mohammad Shoeybi, Jonathan Cohen, Bryan Catanzaro

2024-02-27

Nemotron-4 15B Technical Report

https://arxiv.org/abs/2402.16819

15000000000.00

15b

7.5005116e+23

6ND = 6 FLOP/token/parameter * 15*10^9 parameters * 8*10^12 tokens = 7.2e+23 FLOP "Nemotron-4 was trained using 384 DGX H100 nodes; each node contains 8 H100 80GB SXM5 GPUs based on the NVIDIA Hopper architecture (NVIDIA, 2022). Each H100 GPU has a peak throughput of 989 teraFLOP/s when doing 16-bit floating point (bfloat16) arithmetic without sparsity. Table 2 reports more detailed training schedule: 989*10^12 FLOP/sec * 3600 sec/hour * 24 hours * (768 gpus * 0.343 [reported utilization] * 0

Unspecified unreleased

"At a high-level, the data blend is split into three different types of data: English natural language data (70%), multilingual natural language data (15%), and source-code data (15%)."

8000000000000

"15-billion-parameter large multilingual language model trained on 8 trillion text tokens"

NVIDIA H100 SXM5 80GB

Confident

We introduce Nemotron-4 15B, a 15-billion-parameter large multilingual language model trained on 8 trillion text tokens. Nemotron-4 15B demonstrates strong performance when assessed on English, multilingual, and coding tasks: it outperforms all existing similarly-sized open models on 4 out of 7 downstream evaluation areas and achieves competitive performance to the leading open models in the remaining ones. Specifically, Nemotron-4 15B exhibits the best multilingual capabilities of all similarly

Unreleased

United States of America

3072

Unreleased

Industry

Yi-1.5-34B

Language

Chat

Language modeling/generation

Translation

Code generation

01.AI

2024-05-13

Yi-1.5 is an upgraded version of Yi, delivering stronger performance in coding, math, reasoning, and instruction-following capability.

https://huggingface.co/01-ai/Yi-1.5-34B

34000000000.00

34b

7.344e+23

6*34*10^9*3.6*10^12 = 7.344e+23

Unspecified unreleased

assuming same as Yi 34 - Chinese and English dataset

3600000000000

500b "Yi-1.5 is an upgraded version of Yi. It is continuously pre-trained on Yi with a high-quality corpus of 500B tokens and fine-tuned on 3M diverse fine-tuning samples." 3.6T total pre-trained tokens

Confident

Yi-1.5 is an upgraded version of Yi. It is continuously pre-trained on Yi with a high-quality corpus of 500B tokens and fine-tuned on 3M diverse fine-tuning samples. Compared with Yi, Yi-1.5 delivers stronger performance in coding, math, reasoning, and instruction-following capability, while still maintaining excellent capabilities in language understanding, commonsense reasoning, and reading comprehension. Yi-1.5 comes in 3 model sizes: 34B, 9B, and 6B. For model details and benchmarks, see M

Open weights (restricted use)

China

Unreleased

no training code the model https://huggingface.co/01-ai/Yi-1.5-34B Apache 2.0 "If you create derivative works based on this model, please include the following attribution in your derivative works:"

Industry

SEA-LION V2 8B

Language

Language modeling/generation

AI Singapore

2024-07-30

SEA-LION V2

https://sea-lion.ai/our-models/

8000000000.00

"pretrained on top of the Llama3 base model that is 8 billion parameters"

7.23e+23

Llama 3 base model: 7.2e+23 SEA LION extended pre train: 2*24*60*60*64*989500000000000*0.3=3.28292352e+21 Total: 7.2328292e+23

Dolma

Mix of datasets, mainly Dolma (see https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2-base#data)

48000000000

Llama3 8B CPT SEA-LIONv2 base model was continued pre-trained on 48B tokens

NVIDIA H100 SXM5 80GB

Unverified

SEA-LION version 2 has been continued-pretrained on top of the Llama3 base model that is 8 billion parameters in size

Open weights (unrestricted)

Llama 3-8B

64

Llama 3-8B

Language

Chat

Language modeling/generation

Code generation

Meta AI

Aaditya Singh; Aaron Grattafiori; Abhimanyu Dubey; Abhinav Jauhri; Abhinav Pandey; Abhishek Kadian; Adam Kelsey; Adi Gangidi; Ahmad Al-Dahle; Amit Sangani; Ahuva Goldstand; Aiesha Letman; Ajay Menon; Akhil Mathur; Alan Schelten; Alex Vaughan; Amy Yang; Andrei Lupu; Andres Alvarado; Andrew Gallagher; Andrew Gu; Andrew Ho; Andrew Poulton; Andrew Ryan; Angela Fan; Ankit Ramchandani; Anthony Hartshorn; Archi Mitra; Archie Sravankumar; Artem Korenev; Arun Rao; Ashley Gabriel; Ashwin Bharambe; Assaf E

2024-04-18

Introducing Meta Llama 3: The most capable openly available LLM to date

https://ai.meta.com/blog/meta-llama-3/

8000000000.00

7.2e+23

Counting operations 15000000000000 tokens*8000000000.00 parameters*6=7.2×10^23 GPU calculation 400 TFLOPS per GPU * 1.3M GPU hours * 3600s=1.872×10^24 (it is not confident that 400 TFLOPs applies to the Llama 3-8B training run)

Llama 3 dataset

15000000000000

NVIDIA H100 SXM5 80GB

Confident

Open weights (restricted use)

United States of America

Unreleased

https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md License A custom commercial license is available at: https://llama.meta.com/llama3/license

Industry

Gopher (280B)

Language

Language modeling

Question answering

DeepMind

Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buch

2021-12-08

"Scaling Language Models: Methods, Analysis & Insights from Training Gopher"

https://arxiv.org/abs/2112.11446

1122.00

280000000000.00

Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across

6.31e+23

Table A26 6.31E+08 Train PFLOPs

MassiveTex

300000000000

"We train all models for 300 billion tokens with a 2048 token context window, using the Adam (Kingma and Ba, 2014) optimiser." 1 token ~ 0.75 words

Google TPU v3

Confident

We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a 2 trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25× fewer parameters. After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen Bert retriever, a dif

Unreleased

United Kingdom of Great Britain and Northern Ireland

4096

Unreleased

Industry

Reka Flash

Multimodal

Language

Vision

Chat

Language modeling/generation

Image captioning

Code generation

Code autocompletion

Reka AI

Aitor Ormazabal Che Zheng Cyprien de Masson d’Autume Dani Yogatama Deyu Fu Donovan Ong Eric Chen Eugenie Lamprecht Hai Pham Isaac Ong Kaloyan Aleksiev Lei Li Matthew Henderson Max Bain Mikel Artetxe Nishant Relan Piotr Padlewski Qi Liu Ren Chen Samuel Phua Yazheng Yang Yi Tay Yuqi Wang Zhongkai Zhu Zhihui Xie

2024-04-15

Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models

https://publications.reka.ai/reka-core-tech-report.pdf

21000000000.00

6.3e+23

Reka Flash has 21B parameters and was trained on 5 trillion language tokens 6*21B*5trillion=6.3 × 10^23 Agrees with GPU details: "Reka Flash and Edge were trained on several hundreds of H100s across a period of several weeks." 3 weeks * 300 H100s * 7 day/week * 24 hour/day * 3600 s/day * 9.9e14 FLOP/GPU-s = 5.4e23 Not enough info to estimate SFT and RLHF post-training FLOPs.

Unspecified unreleased

The training data comprises a mixture of publicly available and proprietary/licensed datasets with a dataset knowledge cutoff of November 2023. The dataset ingested by our model comprises of text, images, videos, and audio clips. Reka Flash and Reka Edge were trained on approximately 5 trillion and 4.5 trillion extensively deduplicated and filtered language tokens, respectively. While the classification of corpora is not strictly defined to one class or category, approximately 25% of our pretrai

5000000000000

NVIDIA A100

NVIDIA H100 SXM5 80GB

Likely

API access

United States of America

Unreleased

Industry

xTrimoPGLM -100B

Biology

Proteins

Protein or nucleotide language model (pLM/nLM)

Tsinghua University

BioMap Research

Bo Chen, Xingyi Cheng, Yangli-ao Geng, Shen Li, Xin Zeng, Boyan Wang, Jing Gong, Chiming Liu, Aohan Zeng, Yuxiao Dong, Jie Tang, Le Song

2023-07-06

xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein

https://www.biorxiv.org/content/10.1101/2023.07.05.547496v4

65.00

100000000000.00

Abstract: "training xTrimoPGLM at an unprecedented scale of 100 billion parameters and 1 trillion training tokens"

6.2e+23

"xTrimoPGLM-100B is trained on a cluster of 96 DGX-A100 GPU (8×40G) servers in FP16 precision from January 18 to June 30, 2023. During this time, xTrimoPGLM-100B has consumed 1 trillion tokens from the dataset consisting of Uniref90 and ColAbFoldDB. As of the current date, xTrimoPGLM-100B continues its pre-training process to pass through as many tokens as possible" 6 * 100 billion params * 1T tokens = 6e23 8*96 * 312 trillion * 163 days * 24 * 3600 * 0.3 ~= 1e24 directly given in the paper (

UniRef50

~24M protein sequences

NVIDIA A100 SXM4 40 GB

Confident

Protein language models have shown remarkable success in learning biological information from protein sequences. However, most existing models are limited by either autoencoding or autoregressive pre-training objectives, which makes them struggle to handle protein understanding and generation tasks concurrently. This paper proposes a unified protein language model, xTrimoPGLM, to address these two types of tasks simultaneously through an innovative pre-training framework. Our key technical contr

Unreleased

China

768

Unreleased

Academia

Industry

Yi-34B

Language

Chat

Language modeling/generation

Translation

Code generation

01.AI

Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, Zonghong Dai

2023-11-02

Yi: Open Foundation Models by 01.AI

https://arxiv.org/abs/2403.04652

34000000000.00

34b

6.1e+23

"The dataset we use contains Chinese & English only. We used approximately 3T tokens" sounds like this means it was trained on 3T tokens, not necessarily that the dataset contains 3T tokens? If so, 34b * 3T * 6 = 6.1e23

Unspecified unreleased

Chinese and English dataset "For pretraining, we construct 3.1 trillion tokens of English and Chinese corpora using a cascaded data deduplication and quality filtering pipeline. For finetuning, we polish a small scale (less than 10K) instruction dataset over multiple iterations such that every single instance has been verified directly by our machine learning engineers. For vision-language, we combine the chat language model with a vision transformer encoder and train the model to align visual

3100000000000

"language models pretrained from scratch on 3.1T highly-engineered large amount of data, and finetuned on a small but meticulously polished alignment data."

NVIDIA A100

Confident

The Yi series models are large language models trained from scratch by developers at 01.AI.

Open weights (restricted use)

China

128

Unreleased

apply for commercial license: no training code https://github.com/01-ai/Yi/blob/main/MODEL_LICENSE_AGREEMENT.txt the model https://huggingface.co/01-ai/Yi-34B-Chat Apache 2.0 "If you create derivative works based on this model, please include the following attribution in your derivative works: ...."

Industry

Qwen2-VL-72B

Language

Vision

Multimodal

Visual question answering

Video description

Language modeling/generation

Translation

Question answering

Character recognition

Quantitative reasoning

Alibaba

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin

2024-09-18

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

https://arxiv.org/abs/2409.12191

72000000000.00

72 billion (language model) and 675M (vision encoder)

6.048e+23

6ND = 6×1.4×10^12×7.2×10^10 = 6.048e+23

Unspecified unreleased

"The model is pre-trained on a diverse dataset that includes image-text pairs, optical character recognition (OCR) data, interleaved image-text articles, visual question answering datasets, video dialogues, and image knowledge datasets. Our data sources primarily comprise cleaned web pages, open-source datasets, and synthetic data. The cutoff date for our data knowledge is June 2023."

1400000000000

"Throughout the pre-training stages, Qwen2-VL processes a cumulative total of 1.4 trillion tokens. Specifically, these tokens encompass not only text tokens but also image tokens"

Likely

We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The mo

Open weights (unrestricted)

China

https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct

Industry

InternLM

Language

Language modeling

Shanghai AI Lab

SenseTime

2023-07-06

https://internlm.org/

100000000000.00

Pre-training a bilingual 100B Foundation model on data with over a trillion tokens

6.000001e+23

6 * 100b * 1t = 6e23

1000000000

"Pre-training a bilingual 100B Foundation model on data with over a trillion tokens"

NVIDIA A100 SXM4 80 GB

Confident

Pre-training a bilingual 100B Foundation model on data with over a trillion tokens, the model exhibits excellent performance in scenarios such as Chinese, English, and coding due to the appropriate data ratio. Based on the foundation model, the application of high-quality human annotated dialogue data combined with RLHF technology enables the InternLM large language model to respond to complex commands during human interaction, while also demonstrating responses in line with human morality and v

China

Hong Kong

China

Academia

Industry

Granite 3.0 8B

Language

Language modeling/generation

Question answering

Translation

Text summarization

Text classification

Code generation

IBM

Granite Team IBM

2024-10-21

Granite 3.0 Language Models

https://github.com/ibm-granite/granite-3.0-language-models/tree/main

8100000000.00

8.1B

5.832e+23

6ND = 6*8.1*10^9*12*10^12 = 5.832e+23 "All our Granite 3.0 models are trained using a compute budget of 8.35 × 10^23 FLOPS." 8.35 × 10^23 * 757.0 (model's power consumption) / (174.6+757.0+64.5+121.2) = 5.6573436e+23 hardware estimation: 832102*3600*989500000000000*0.3 = 8.8923412e+23

Unspecified unreleased

Granite 3.0 language models are trained using data from various sources such as unstructured natural language text and code data from theWeb curated by IBM, a collection of synthetic datasets generated by IBM, and publicly available high-quality datasets with permissible licenses.

12000000000000

12T tokens

NVIDIA H100 SXM5 80GB

Confident

This report presents Granite 3.0, a new set of lightweight, state-of-the-art, open foundation models ranging in scale from 400 million to 8 billion active parameters. Equipped with native support of multilingual, coding, function calling, and strong safety performance, these models target enterprise use cases, including on-premise and on-device settings. Evaluations on a comprehensive set of tasks demonstrate that our models consistently reach state-of-the-art performance for their size (as sho

Open weights (unrestricted)

United States of America

256

Unreleased

Apache 2.0 license https://huggingface.co/ibm-granite/granite-3.0-8b-instruct

Industry

Chinchilla

Language

Language modeling

DeepMind

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Laurent Sifre

2022-03-29

Training Compute-Optimal Large Language Models

https://arxiv.org/abs/2203.15556

1486.00

70000000000.00

"We test this hypothesis by training a predicted compute-optimal model, \chinchilla, that uses the same compute budget as \gopher but with 70B parameters and 4× more more data. \chinchilla uniformly and significantly outperforms \Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks."

5.76e+23

"Both Chinchilla and Gopher have been trained for the same number of FLOPs but differ in the size of the model and the number of training tokens." We see the number of flops in table 3

MassiveWeb

C4

MassiveWeb, Books, C4, News, Github, Wikipedia (Table A1)

1050000000000

Table 1 shows Chinchilla was training on 1.4 trillion tokens 1 token ~ 0.75 words

Google TPU v4

Google TPU v3

Confident

We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over \nummodels language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model s

Unreleased

United Kingdom of Great Britain and Northern Ireland

Unreleased

Industry

BIG-G 137B

Language

Language modeling/generation

Google

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ambrose Slone, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen, Andrea Madotto, Andrea Santilli, Andreas Stuhlmü

2022-06-09

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

https://arxiv.org/abs/2206.04615

1394.00

137000000000.00

137B. Table App.1

5.6e+23

"BIG-G models were trained at Google. We use 13 dense decoder-only Transformer models (Vaswani et al., 2017) with gated activation layers (Dauphin et al., 2017) and GELU activations based on the LaMDA architectures (Thoppilan et al., 2022). These models were trained on a dataset consisting of a mixture of web documents, code, dialog, and Wikipedia data, with approximately three billion documents tokenized to 2.8 trillion BPE tokens using a 32k-token SentencePiece vocabulary" Appendix: "We use

GLaM dataset

"These models were trained on a dataset consisting of a mixture of web documents, code, dialog, and Wikipedia data, with approximately three billion documents tokenized to 2.8 trillion BPE tokens using a 32k-token SentencePiece vocabulary"

681000000000

Full dataset is comprised of 2.8 trillion tokens, but calculation based on batch size and steps suggests model was trained on only 681 billion tokens.

Confident

Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyon

Unreleased

United States of America

Unreleased

Industry

LLaMA-65B

Language

Language modeling

Code generation

Meta AI

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample

2023-02-24

LLaMA: Open and Efficient Foundation Language Models

https://arxiv.org/abs/2302.13971

8872.00

65200000000.00

Model card, table 1: https://github.com/facebookresearch/llama/blob/53011c3d7946dadb8274a4c5c7586ab54edf792d/MODEL_CARD.md

5.5e+23

1.4e12 tokens * 6.52e10 parameters * 6 FLOP/token/parameter = 5.5e23 FLOP Compared to 2048 A100 GPUs each with 311.84 TFLOPS maximum performance for 21 days, this implies 47% utilization. https://www.wolframalpha.com/input?i=5.5*10%5E23+FLOP+%2F+%282048+*+311.84+teraFLOPS+*+21+days%29

CCNet

GitHub

Wikipedia

books

arXiv

Stack Exchange

The model was trained using the following source of data: CCNet [67%], C4 [15%], GitHub [4.5%], Wikipedia [4.5%], Books [4.5%], ArXiv [2.5%], Stack Exchange[2%]. The Wikipedia and Books domains include data in the following languages: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk. See the paper for more details about the training set and corresponding preprocessing.

1340000000000

Table 1 indicates that 1.4T tokens involved sampling sub-datasets at more or less than one epoch. Correcting for this: (1.1 epoch * 3.3TB) + (1.06 epoch * 0.783TB) + ... = 1.4T tokens 5.24 epoch-TBs = 1.4T tokens 5.24 epoch-TB * 1000 GB/TB * 200M token/GB = 1.4T tokens 1.05T epoch*token = 1.4T tokens 1 epoch = 1.34T tokens

NVIDIA A100

Confident

We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the resea

Open weights (non-commercial)

United States of America

2048

Unreleased

"we are releasing our model under a noncommercial license focused on research use cases" https://ai.meta.com/blog/large-language-model-llama-meta-ai/

Industry

Guanaco-65B

Language

Chat

University of Washington

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer

2023-05-23

QLoRA: Efficient Finetuning of Quantized LLMs

https://arxiv.org/abs/2305.14314; https://github.com/artidoro/qlora

1578.00

65000000000.00

from Llama-65B (also 33B, 13B, 7B variants)

5.5e+23

Fine-tune of Llama-65B, which appears to have been trained on a "professional grade GPU" with 48GB VRAM (likely A6000) for 24 hours. Fine-tune compute is negligible compared to pretraining (5.5e23 for Llama-65b)

Fine-tuned on instruction datasets such as GLUE and Super-NaturalInstructions

Confident

We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only

Open weights (non-commercial)

United States of America

LLaMA-65B

8000000000B

"using a single professional GPU over 24 hours we achieve 99.3% with our largest model" no model specified, but if it's an A100, 312 tflop/s * 24 * 3600 * 0.3 utilization = 8e18

Open source

LLaMA license, non-commercial for weights. code is MIT code: https://github.com/artidoro/qlora/blob/main/scripts/finetune_guanaco_65b.sh

Academia

Code Llama-34B

Language

Code generation

Meta AI

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Ellen Tan, Yossef (Yossi) Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Defossez, Jade Copet, Faisal Azhar, Hugo Touvron, Gabriel Synnaeve, Louis Martin, Nicolas Usunier, Thomas Scialom

2023-08-14

Code Llama: Open Foundation Models for Code

https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/ https://arxiv.org/abs/2308.12950

1297.00

34000000000.00

34B

5.3e+23

1.22e23 finetune compute, or ~5.3e23 including Llama-2 34B base compute. See finetune compute notes for calculation.

Unspecified unreleased

"As shown in Table 1, Code Llama is trained predominantly on a near-deduplicated dataset of publicly available code. We also source 8% of our samples data from natural language datasets related to code. This dataset contains many discussions about code and code snippets included in natural language questions or answers. To help the model retain natural language understanding skills, we also sample a small proportion of our batches from a natural language dataset"

600000000000

Llama 2 used 2T tokens, and "We train Code Llama on 500B additional tokens and Code Llama - Python further on 100B tokens" 2T + 500B + 100B = 2600000000000

NVIDIA A100 SXM4 80 GB

Confident

We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters

Open weights (restricted use)

United States of America

Llama 2-34B

122000000000000B

Training the nine Code Llama models took 400k A100-hours across all the models, per model card. It's nine models because there are three base models at 7B, 13B, 34B, and then Instruct and Python models across all three sizes. I'll calculate for Code Llama Python-34B since it's the most trained. Code Llama-base is trained from Llama 2 with 500B tokens: "We train Code Llama on 500B tokens during the initial phase, starting from the 7B, 13B, and 34B versions of Llama 2" Code Llama-Python required

Unreleased

Llama 2 license. can't use outputs to train models. https://github.com/meta-llama/llama/blob/main/LICENSE

Industry

MM1-30B

Multimodal

Language

Vision

Chat

Image captioning

Apple

Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, Yinfei Yang

2024-03-14

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

https://arxiv.org/abs/2403.09611

122.00

30000000000.00

30B

4.86e+23

Pre-trained on ~2B image-text pairs and 2T tokens (Table 2). Each image is 144 tokens, so the images are ~300B tokens. Then additional multimodal training for 400B tokens, for a total of ~2.7T tokens. This is the final training recipe: "We initialize both the image encoder and the underlying LLM decoder weights for MM1 from in-house pre-trained models2. We then perform multimodal pre-training on the above data mix for 200k steps (approx. 400B tokens)." Compute = 6ND = 6 * 2.7 trillion * 30 bi

Conceptual Captions (CC3M)

Conceptual Captions 12M (CC12M)

COYO-700M

Unspecified unreleased

OBELICS

Text, captioned images. See Table 2

1500000000000

at least 2T tokens

Likely

In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and

Unreleased

United States of America

Unreleased

Industry

PanGu-Σ

Language

Code generation

Language modeling

Translation

Question answering

Huawei Noah's Ark Lab

Xiaozhe Ren, Pingyi Zhou, Xinfan Meng, Xinjing Huang, Yadao Wang, Weichao Wang, Pengfei Li, Xiaoda Zhang, Alexander Podolskiy, Grigory Arshinov, Andrey Bout, Irina Piontkovskaya, Jiansheng Wei, Xin Jiang, Teng Su, Qun Liu, Jun Yao

2023-03-20

PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing

https://arxiv.org/abs/2303.10845

48.00

1085000000000.00

"In this work, we present PanGu-Σ , a large language model with sparse architecture containing 1.085 trillion parameters."

4.669999999999999e+23

It has sparse architecture, so we can't use C=6ND. "We develop PanGu-Σ model under the framework of MindSpore and train it on a cluster with only 512 Ascend 910 AI Accelerators with 329 billion tokens over 100 days." 100 days * 512 processors * 320 teraFLOPS/processor * 33% utilization = 4.67e+23 FLOP https://www.wolframalpha.com/input?i=100+days+*+512+*+320+terahertz+*+0.33

"329B tokens in more than 40 natural and programming languages"

246750000000

329B tokens ~= 247B words

Huawei Ascend 910

Confident

The scaling of large language models has greatly improved natural language understanding, generation, and reasoning. In this work, we develop a system that trained a trillion-parameter language model on a cluster of Ascend 910 AI processors and MindSpore framework, and present the language model with 1.085T parameters named PanGu-{\Sigma}. With parameter inherent from PanGu-{\alpha}, we extend the dense Transformer model to sparse one with Random Routed Experts (RRE), and efficiently train the m

Unreleased

China

512

Unreleased

Industry

OLMo 2 Furious 13B

Language

Language modeling/generation

Question answering

Allen Institute for AI

University of Washington

New York University (NYU)

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William Merrill, Lester James V. Miranda, Jacob Morrison, Tyler Murray, Crystal Nam, Valentina Pyatkin, Aman Rangapur, Michael Schmitz, Sam Skjonsberg,

2024-12-31

2 OLMo 2 Furious

https://arxiv.org/abs/2501.00656

13000000000.00

13B

4.600000000000001e+23

4.6*10^23 FLOPs (Table 6 - developers calculated using 6ND formula)

OLMo-Mix-1124

Dolmino-Mix-1124

Tulu 3

4000000000000

Pretraining Stage 1 (OLMo-Mix-1124) 5 trillion tokens ( = 1.2 epochs) Pretraining Stage 2 (Dolmino-Mix-1124) 100B tokens (3 runs) 300B tokens (1 run) merged Post-training (Tulu 3 SFT OLMo mix) SFT + DPO + PPO (preference mix)

NVIDIA H100 SXM5 80GB

Confident

We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes dense autoregressive models with improved architecture and training recipe, pretraining data mixtures, and instruction tuning recipes. Our modified model architecture and training recipe achieve both better training stability and improved per-token efficiency. Our updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124, which significantly improves model capabilities

Open weights (unrestricted)

United States of America

Open source

apache 2 https://huggingface.co/allenai/OLMo-2-1124-13B https://github.com/allenai/OLMo

Research collective

Academia

AFM-on-device

Language

Language modeling/generation

Apple

Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chong Wang, Chung-Cheng Chiu, David Qiu, Deepak Gopinath, Dian Ang Yap, Dong Yin, Feng Nan, Floris Weers, Guoli Yin, Haoshuo Huang, Jianyu Wang, Jiarui Lu, John Peebles, Ke Ye, Mark Lee, Nan Du, Qibin Chen, Quentin Keunebroek, Ruoming Pang, Sam Wiseman, Syd Evans, Tao Lei, Tom Gunter, Vivek Rathod, Xiang Kong, Xianzhi Du, Yanghao Li, Yongqiang Wang, Yuan Gao, Zaid Ahmed, Zhaoyang Xu, Zhiyun Lu, Zirui Wang, Al Rashid, Albin Madappally Jose, Ale

2024-07-29

Apple Intelligence Foundation Language Models

https://machinelearning.apple.com/research/apple-intelligence-foundation-language-models

2730000000.00

Table 1, sum of non-embedding and embedding parameters

4.5126e+23

Model was initialized from a pruned version of a 6.4B parameter model trained using the same recipe as AFM-server. Assuming "same recipe" involves training for the full 6.3T tokens, this implies 6 * 6.3T * 6.4B = 2.42e23 FLOP. The pruning masks are learned by training over 188B tokens, which suggests 6 * 188B * 6.4B = 7.22e21 FLOPs. Pretraining is then run over 6.3T tokens; however, labels are a convex combination of true labels and the predicted labels from the unpruned 6.4B model. Since thi

188B of tokens are used to train a pruning mask to reduce a 6.4B model to the 2.73B used for AFM-on-device. Main pre-training data is 6.3T tokens of web text, code, and math, plus another 1T in the second pre-training stage and 100B in the third. See section 3.1 for details. Post-training details do not give details on dataset size.

7588000000000

Not explicitly mentioned, but I assume the 7.588T tokens do not involve multiple epochs.

Google TPU v5p

Confident

We present foundation language models developed to power Apple Intelligence features, including a ∼3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute [Apple, 2024b]. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used to train the model, the training process, how the models are optimized for inference

Hosted access (no API)

United States of America

2048

Unreleased

Industry

SEA-LION V3 Gemma2 9B

Language

Language modeling/generation

AI Singapore

2024-12-19

SEA-LION V3

https://huggingface.co/aisingapore/gemma2-9b-cpt-sea-lionv3-base

9000000000.00

4.484146e+23

Gemma 2 9B base model: 4.32e+23 Additional pretraining compute: 10*24*60*60*989500000000000*64*0.3=1.64146e+22 Total: 1.64146×10^22 + 4.32×10^23 = 4.484146×10^23

The Stack v2

Dolma

Trained on a mix of datasets including StackV2 and Dolma (see https://huggingface.co/aisingapore/gemma2-9b-cpt-sea-lionv3-base#data)

200000000000

"pre-trained on 200B tokens"

NVIDIA H100 SXM5 80GB

Unverified

Our SEA-LIONv3 Gemma2-9B base model has been continued pre-trained on top of the Gemma2 base model that is 9 billion parameters in size, and has a context length of 8192.

Open weights (unrestricted)

Gemma 2 9B

64

Pharia-1-LLM-7B

Language

Language modeling/generation

Translation

Question answering

Aleph Alpha

2024-08-26

Introducing Pharia-1-LLM: transparent and compliant

https://huggingface.co/Aleph-Alpha/Pharia-1-LLM-7B-control

7041544704.00

4.4299999999999995e+23

reported by the authors: 2.75*10^23 + 1.68*10^23 = 4.43*10^23 FLOP https://huggingface.co/Aleph-Alpha/Pharia-1-LLM-7B-control#compute--training-efficiency

Common Crawl

The training data of our models comprises two components: web-crawled data and structured datasets with a total size of 7.7T, with a cutoff date 04/2023. We performed some additional web scraping to augment these datasets. Web-crawled data was obtained by filtering and deduplicating data available in public datasets, derived from Common Crawl, in the following languages: English, French, German, Italian, Spanish, Dutch, Portuguese. To deduplicate the data, we applied a Bloomfilter for exact do

7700000000000

4.7T + 3T = 7.7T tokens

NVIDIA A100 SXM4 80 GB

NVIDIA H100 SXM5 80GB

Confident

We are pleased to announce our new foundation model family that includes Pharia-1-LLM-7B-control and Pharia-1-LLM-7B-control-aligned, now publicly available under the Open Aleph License, which explicitly allows for non-commercial research and educational use. Pharia-1-LLM-7B-control is engineered to deliver concise, length-controlled responses that match the performance of leading open-source models in the 7B to 8B parameter range and is culturally and linguistically optimized for German, French

Open weights (non-commercial)

Germany

256

Industry

Gemma 2 9B

Language

Language modeling/generation

Chat

Code generation

Question answering

Quantitative reasoning

Google DeepMind

Gemma Team, Google DeepMind

2024-06-24

Gemma 2 offers best-in-class performance, runs at incredible speed across different hardware and easily integrates with other AI tools.

https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf

9000000000.00

4.32e+23

"For the 9B model, we train on an 8x16x32 configuration of TPUv4, totaling 4096 chips" 6ND = 6*9000000000*8000000000000=4.32e+23

Unspecified unreleased

Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. Primarily English-language content. Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code or understand code-related questions. Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical quer

8000000000000

"the 9B model on 8 trillion tokens"

Google TPU v4

Confident

Now we’re officially releasing Gemma 2 to researchers and developers globally. Available in both 9 billion (9B) and 27 billion (27B) parameter sizes, Gemma 2 is higher-performing and more efficient at inference than the first generation, with significant safety advancements built in. In fact, at 27B, it offers competitive alternatives to models more than twice its size, delivering the kind of performance that was only possible with proprietary models as recently as December. And that’s now achie

Open weights (restricted use)

United States of America

United Kingdom of Great Britain and Northern Ireland

Multinational

4096

Unreleased

Gemma 2 is available under our commercially-friendly Gemma license, giving developers and researchers the ability to share and commercialize their innovations.

Industry

OPT-175B

Language

Language modeling

Chat

Language modeling/generation

Question answering

Meta AI

Susan Zhang∗ , Stephen Roller∗ , Naman Goyal∗ , Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott† , Sam Shleifer† , Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, Luke Zettlemoyer

2022-05-02

OPT: Open Pre-trained Transformer Language Models

https://arxiv.org/abs/2205.01068

2932.00

175000000000.00

"In line with Meta AI’s commitment to open science, we are sharing Open Pretrained Transformer (OPT-175B), a language model with 175 billion parameters trained on publicly available data sets"

4.3e+23

https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/final_update.md "As of yesterday, at 12:46pm PST on January 6, our 175B model finally completed its training run on 300B tokens. This required ~4.30E+23 FLOPs of compute"

The Pile

BookCorpus (BooksCorpus, Toronto Book Corpus)

CC-Stories

Pushshift Reddit

"The pre-training corpus contains a concatenation of datasets used in RoBERTa (Liu et al., 2019b), the Pile (Gao et al., 2021a), and PushShift.io Reddit (Baumgartner et al., 2020; Roller et al., 2021)" ... "RoBERTa We included the BookCorpus (Zhu et al., 2015) and Stories (Trinh and Le, 2018) subsets of the RoBERTa corpus and utilized an updated version of CCNews, containing news stories crawled through September 28, 2021. This CCNews v2 corpus was preprocessed the same way as the original RoBER

180000000000

"The training data contains 180B tokens corresponding to 800 GB of data" 1 token ~ 0.75 words

NVIDIA A100 SXM4 80 GB

Confident

Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M t

Open weights (non-commercial)

United States of America

1024

Open source

non-commercial for weights: https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/MODEL_LICENSE.md training code (MIT) https://github.com/facebookresearch/metaseq/blob/main/docs/training.md

Industry

OPT-IML (175B)

Language

Language modeling

Meta AI

Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pasunuru, Todor Mihaylov, Daniel Simig, Ping Yu, Kurt Shuster, Tianlu Wang, Qing Liu, Punit Singh Koura, Xian Li, Brian O'Horo, Gabriel Pereyra, Jeff Wang, Christopher Dewan, Asli Celikyilmaz, Luke Zettlemoyer, Ves Stoyanov

2022-12-22

OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

https://arxiv.org/abs/2212.12017

236.00

175000000000.00

4.3e+23

fine-tuned from OPT-175B (4.3e23) with an estimate 2.1e21 FLOP for fine-tuning. "During fine-tuning, our models saw approximately 2 billion tokens, which is only 0.6% of the pre-training budget of OPT"

OPT-IML Bench

(fine-tuning dataset) "To this end, we create OPT-IML Bench: a large benchmark for Instruction MetaLearning (IML) of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks"

2000000000

"During fine-tuning, our models saw approximately 2 billion tokens, which is only 0.6% of the pre-training budget of OPT"

NVIDIA A100

Likely

Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and w

Open weights (non-commercial)

United States of America

OPT-175B

2100000000000B

"We fine-tune all 30B models on 64 40GB A100s, and 175B models on 128 40GB A100s", no timeframe specified fine-tuned on 2B tokens. 2B * 175B * 6 = 2.1e21

128

Unreleased

unclear license https://huggingface.co/facebook/opt-iml-30b

Industry

BlenderBot 3

Language

Chat

McGill University

Meta AI

Mila - Quebec AI (originally Montreal Institute for Learning Algorithms)

Kurt Shuster, Jing Xu, Mojtaba Komeili, Da Ju, Eric Michael Smith, Stephen Roller, Megan Ung, Moya Chen, Kushal Arora, Joshua Lane, Morteza Behrooz, William Ngan, Spencer Poff, Naman Goyal, Arthur Szlam, Y-Lan Boureau, Melanie Kambadur, Jason Weston

2022-08-10

BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage

https://arxiv.org/abs/2208.03188, https://github.com/facebookresearch/ParlAI/blob/main/parlai/zoo/bb3/model_card.md training code: https://parl.ai/projects/bb3/

218.00

175000000000.00

4.3e+23

(taken from OPT-175 base)

BlenderBot 3 Data

Fine-tuned from OPT-175B. "The fine-tuning data for BB3 comprises roughly 4 million source/target examples spread across the various training modules. This corresponds to around 1.13B training tokens. When fine-tuning the OPT-based BB3 models, we additionally included 600k examples ( 170m tokens) of pre-training data to help with training stability. Table 16 and Table 17 enumerate the breakdown by module."

1300000000

NVIDIA A100 SXM4 40 GB

Likely

We present BlenderBot 3, a 175B parameter dialogue model capable of open-domain conversation with access to the internet and a long-term memory, and having been trained on a large number of user defined tasks. We release both the model weights and code, and have also deployed the model on a public web page to interact with organic users. This technical report describes how the model was built (architecture, model and training scheme), and details of its deployment, including safety mechanisms. H

Open weights (non-commercial)

Canada

United States of America

Canada

OPT-175B

1500000000000B

"The 30B and 175B parameter BlenderBot 3 models were each trained for one epoch of the training data on 64 (30B) or 128 (175B) x 40gb A100 GPUs; we found that the model (especially the 175B version) overfit significantly when seeing the training data more than once. The 175B model was trained with a batch size of 2^18 and the 30B model was trained with a batch size of 2^19, resulting in roughly 5600 updates and 2800 updates respectively." 175b params * 5600 * 2^18 * 6 = 1.5e21

128

Open source

weights have a non-commercial license, must go through request form: https://docs.google.com/forms/d/e/1FAIpQLSfRzw8xVzxaxgRyuodTZtkcYADAjzYjN5gcxx6DMa4XaGwwhQ/viewform meanwhile training code is here. repo is MIT-licensed https://github.com/facebookresearch/ParlAI/blob/main/parlai/scripts/train_model.py

Academia

Industry

Academia

EXAONE 3.5-R 7.8B

Language

Language modeling/generation

Question answering

Translation

LG AI Research

2025-03-14

7800000000.00

7.8B

4.2568e+23

4.21 × 10^23 (base model reported training compute) + 4.68 × 10^21 (finetune compute) = 4.2568e+23 FLOP

Confident

Unreleased

Korea (Republic of)

EXAONE 3.5 7.8B

4680000000000B

4.68e21

Unreleased

Industry

EXAONE Deep 7.8B

Language

Language modeling/generation

Question answering

Quantitative reasoning

Code generation

LG AI Research

LG AI Research, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Yongmin Park, Sihoon Yang, Heuiyeen Yeen, Sihyuk Yi, Hyeongu Yun

2025-03-16

EXAONE Deep: LLMs with Enhanced Reasoning Performance

https://arxiv.org/abs/2503.12524

7800000000.00

7.8B

4.23e+23

4.21 × 10^23 (base model reported training compute) + 1.71 × 10^21 (finetune compute) = 4.23 × 10^23 FLOP Table 1

Unspecified unreleased

12000000000

"To enhance the reasoning capabilities of language models, we have utilized 1.6M instances for SFT and 20K instances of preference data for DPO. The SFT dataset contains approximately 12B tokens"

NVIDIA H100 SXM5 80GB

Confident

We present EXAONE Deep series, which exhibits superior capabilities in various reasoning tasks, including math and coding benchmarks. We train our models mainly on the reasoning-specialized dataset that incorporates long streams of thought processes. Evaluation results show that our smaller models, EXAONE Deep 2.4B and 7.8B, outperform other models of comparable size, while the largest model, EXAONE Deep 32B, demonstrates competitive performance against leading open-weight models. All EXAONE Dee

Open weights (non-commercial)

Korea (Republic of)

EXAONE 3.5 7.8B

1710000000000B

Table 1 (reported): 1.71 × 10^21 FLOP 6ND = 6*7.8B parameters * 12B tokens = 5.616e+20 FLOP

Unreleased

https://huggingface.co/LGAI-EXAONE/EXAONE-Deep-7.8B Exaone License

Industry

EXAONE 3.5 7.8B

Language

Language modeling/generation

Question answering

Translation

LG AI Research

Soyoung An, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Yountae Jung, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Woohyung Lim, Sangha Park, Sooyoun Park, Yongmin Park, Sihoon Yang, Heuiyeen Yeen, Hyeongu Yun

2024-12-09

EXAONE 3.5: Series of Large Language Models for Real-world Use Cases

https://arxiv.org/abs/2412.04862

7800000000.00

7.8B

4.209999999999999e+23

4.21 × 10^23 (Table 2)

Unspecified unreleased

9000000000000

9T tokens (Table 2)

Confident

This technical report introduces the EXAONE 3.5 instruction-tuned language models, developed and released by LG AI Research. The EXAONE 3.5 language models are offered in three configurations: 32B, 7.8B, and 2.4B. These models feature several standout capabilities: 1) exceptional instruction following capabilities in real-world scenarios, achieving the highest scores across seven benchmarks, 2) outstanding long-context comprehension, attaining the top performance in four benchmarks, and 3) compe

Open weights (non-commercial)

Korea (Republic of)

Unreleased

Exaone license (allows only non-commercial usage)

Industry

BlueLM 70B

Language

Chat

Language modeling/generation

Question answering

vivo AI lab

2023-11-02

https://baijiahao.baidu.com/s?id=1781445143383237948&wfr=spider&for=pc

70000000000.00

4.1999999999999996e+23

6ND = 6*70B*1000B=4.2e+23

Unspecified unreleased

1000B Text data 10B Image data 100M video data 100M Knowledge graph (from the conference handout)

Confident

Unreleased

China

Unreleased

information about the model is from their paper catalogue and not found on the internet

Industry

Llama 2-34B

Language

Language modeling

Meta AI

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Mar

2023-07-18

Llama 2: Open Foundation and Fine-Tuned Chat Models

https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/ https://arxiv.org/abs/2307.09288

8056.00

34000000000.00

Llama has been released in 7B, 13B, 34B, and 70B variants.

4.08e+23

All models sizes trained on 2.0T tokens, per table 1 2T * 34b * 6 = 4.08e23 Also trained on 1038336 A100-hours, which is 3.5e23 at 30% utilization. So the utilization was probably around 35%.

Llama 2 dataset

2 trillion tokens of publicly available text, with no text from Meta's products. "Our training corpus includes a new mix of data from publicly available sources, which does not include data from Meta’s products or services. We made an effort to remove data from certain sites known to contain a high volume of personal information about private individuals. We trained on 2 trillion tokens of data as this provides a good performance–cost trade-off, up-sampling the most factual sources in an effort

NVIDIA A100 SXM4 80 GB

Confident

In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closedsource models. We provide a detailed description of our approach to

Unreleased

United States of America

Unreleased

Industry

phi-3-medium 14B

Language

Chat

Language modeling/generation

Microsoft

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, Ronen Eldan, Dan Iter, Amit Garg, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, Russell J. Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin

2024-04-23

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

https://arxiv.org/abs/2404.14219

14000000000.00

14B

4.032e+23

counting operations: 6×4.8×10^12 tokens × 14×10^9 parameters ≈ 4.032×10^23 FLOPS

Phi-3 Dataset

we also trained phi-3-medium, a model with 14B parameters using the same tokenizer and architecture of phi-3-mini, and trained on the same data for slightly more epochs (4.8T tokens total as for phi-3-small)

4800000000000

Likely

We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered web data a

Open weights (unrestricted)

United States of America

Unreleased

MIT license for weights: https://huggingface.co/microsoft/Phi-3-medium-128k-instruct

Industry

EXAONE 3.0

Language

Language modeling/generation

Code generation

Question answering

LG AI Research

LG AI Research: Soyoung An, Kyunghoon Bae, Eunbi Choi, Stanley Jungkyu Choi, Yemuk Choi, Seokhee Hong, Yeonjung Hong, Junwon Hwang, Hyojin Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Yountae Jung, Euisoon Kim, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Moontae Lee, Seungjun Lee, Woohyung Lim, Sangha Park, Sooyoun Park, Yongmin Park, Boseong Seo, Sihoon Yang, Heuiyeen Y

2024-08-07

EXAONE 3.0 7.8B Instruction Tuned Language Model

https://arxiv.org/abs/2408.03541

7820000000.00

7.8B

4.0000000000000003e+23

6ND = 6*7.8B parameters *8T tokens = 3.744e+23 FLOP "EXAONE language models were trained using Google Cloud Platform and a cluster powered by NVIDIA H100 GPUs and NVIDIA NeMo Framework. Then, they were optimized by NVIDIA TensorRT-LLM. The total amount of computation used for model training was about 4 × 1023 FLOPS"

Unspecified unreleased

8T training data (tokens) the token per word ration is 2.46 and given in the paper

8000000000000

NVIDIA H100 SXM5 80GB

Confident

We introduce EXAONE 3.0 instruction-tuned language model, the first open model in the family of Large Language Models (LLMs) developed by LG AI Research. Among different model sizes, we publicly release the 7.8B instruction-tuned model to promote open research and innovations. Through extensive evaluations across a wide range of public and in-house benchmarks, EXAONE 3.0 demonstrates highly competitive real-world performance with instruction-following capability against other state-of-the-art op

Open weights (non-commercial)

Korea (Republic of)

https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct Exaone license (allows only non-commercial usage)

Industry

Parti

Image generation

Text-to-image

Image generation

Google Research

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, Yonghui Wu

2022-06-22

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

https://arxiv.org/abs/2206.10789v1

880.00

20000000000.00

Abstract: "we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters"

3.962895376192635e+23

Calculated from architecture. Does not take into account the encoding and decoding of text and images, only the transformer stack. Table 1 shows for the 20B model 16 encoder layers 64 decoder layers Dmodel = 4096 Dhidden = 16384 Num heads = 64 Just below table 1: "We use a maximum length of text tokens of 128, and the length of image tokens are fixed to 1024" I take the length of the sequence to be 100 for the encoder stack and 1024 for the decoder stack. Section 3, Training: "a total of 450

LAION-400M

FIT400M

JFT-4B

4800000000

Google TPU v4

We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language

Unreleased

Multinational

United States of America

Unreleased

"For these reasons, we have decided not to release our Parti models, code, or data for public use without further safeguards in place" https://sites.research.google/parti/

Industry

DeepSeek Coder 33B

Language

Code generation

DeepSeek

Peking University

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y.K. Li, Fuli Luo, Yingfei Xiong, Wenfeng Liang

2024-01-25

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

https://arxiv.org/abs/2401.14196

33000000000.00

33B

3.96e+23

"Step 1: Initially pre-trained with a dataset consisting of 87% code, 10% code-related language (Github Markdown and StackExchange), and 3% non-code-related Chinese language. Models are pre-trained using 1.8T tokens and a 4K window size in this step. Step 2: Further Pre-training using an extended 16K window size on an additional 200B tokens, resulting in foundational models (DeepSeek-Coder-Base). Step 3: Instruction Fine-tuning on 2B tokens of instruction data, resulting in instruction-tuned mod

"Trained from scratch on 2T tokens, including 87% code and 13% linguistic data in both English and Chinese languages."

2000000000000

"Trained from scratch on 2T tokens, including 87% code and 13% linguistic data in both English and Chinese languages." "The total data volume is 798 GB with 603 million files."

Likely

The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. These models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window

Open weights (restricted use)

China

Unreleased

code doesn't seem to be training code. deepseek license: https://github.com/deepseek-ai/DeepSeek-Coder/blob/main/LICENSE-MODEL

Industry

Academia

StarCoder 2 15B

Language

Code generation

Code autocompletion

Hugging Face

ServiceNow

NVIDIA

BigCode

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muenn

2024-02-29

StarCoder 2 and The Stack v2: The Next Generation

https://arxiv.org/abs/2402.19173

15000000000.00

15B

3.87e+23

estimation is given in Table 6

The Stack v2

See Table 4. The Stack V2 plus some extras. Created from repositorites from Github with permissive licences.

913230000000

from Table 4

Confident

The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a

Open weights (restricted use)

Multinational

United States of America

Unreleased

commercial use allowed, but various use cases restricted: https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement code is fine-tune only: https://github.com/bigcode-project/starcoder2?tab=readme-ov-file#fine-tuning

Industry

FunSearch

Language

Search

Code generation

Google DeepMind

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, Alhussein Fawzi

2023-12-14

Mathematical discoveries from program search with large language models

https://www.nature.com/articles/s41586-023-06924-6 https://deepmind.google/discover/blog/funsearch-making-new-discoveries-in-mathematical-sciences-using-large-language-models/

170.00

15000000000.00

From the section called "Pretrained LLM": "We use Codey, an LLM built on top of the PaLM2 model family... Because FunSearch relies on sampling from an LLM extensively, an important performance-defining tradeoff is between the quality of the samples and the inference speed of the LLM. In practice, we have chosen to work with a fast-inference model (rather than slower-inference, higher-quality)" Unclear which PaLM2 model was used (of Gecko, Otter, Bison, and Unicorn); above quote indicates it was

3.87e+23

Appendix A.5: "Finding the full-sized symmetric admissible set I(15, 10) required the generation and analysis of approximately two million programs... To reproduce admissible set experiments done above (generating 2 million samples) one would have to use 15 instances of StarCoder-15B running on A100 40 GB GPU each and 5 CPU servers (each running 32 evaluators in parallel) for two days. We estimate that when running on Google Cloud, the price of an experiment is around $800 – $1400, and the energ

"The experiments carried out in this paper do not require any data corpus other than the publicly available OR-Library bin packing benchmarks"

0

"The experiments carried out in this paper do not require any data corpus other than the publicly available OR-Library bin packing benchmarks"

Speculative

Large language models (LLMs) have demonstrated tremendous capabilities in solving complex tasks, from quantitative reasoning to understanding natural language. However, LLMs sometimes suffer from confabulations (or hallucinations), which can result in them making plausible but incorrect statements1,2. This hinders the use of current large models in scientific discovery. Here we introduce FunSearch (short for searching in the function space), an evolutionary procedure based on pairing a pretraine

Open weights (unrestricted)

United States of America

United Kingdom of Great Britain and Northern Ireland

Multinational

PaLM 2

0B

No finetuning

Unreleased

Code to run FunSearch with an LLM of your choice is open source under Apache 2.0 (software) and CC-BY (all else). However, the actual LLM used in the main experiments is unknown and may or may not be one of the Codey models available via API access. (in other words code is available for the search tool but not for the model): https://github.com/google-deepmind/funsearch

Industry

Multi-Token Prediction 7B

Language

Code generation

Facebook AI Research

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve

2024-04-30

Better & Faster Large Language Models via Multi-token Prediction

https://arxiv.org/abs/2404.19737

6700000000.00

6.7B (“7B”)

3.841092e+23

"training all models reported in the paper required around 500K GPU hours of computation on hardware of type A100-80GB and H100." A100-80 GB peak FLOP/s [assumed fp16 precision]: 77970000000000 H100 peak FLOP/s [assumed SXM5 TensorCore]: 989000000000000 assuming 50/50 usage: (77970000000000+989000000000000)*0.5*500000hours*3600s*0.3=2.880819e+23 for ALL models in the paper assuming this model has taken around 40% of all used compute (https://docs.google.com/spreadsheets/d/1Yc-HAdYgn6e9SUIliMaQ

CodeContests

Unspecified unreleased

250000000000

1T total tokens over 4 epochs (Table 1)

NVIDIA A100 SXM4 80 GB

NVIDIA H100 SXM5 80GB

Likely

Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved do

Open weights (non-commercial)

United States of America

Unreleased

https://huggingface.co/facebook/multi-token-prediction "we’re releasing the pre-trained models for code completion under a non-commercial/research-only license."

Industry

Arctic

Language

Language modeling/generation

Question answering

Code generation

Quantitative reasoning

Snowflake

Snowflake AI Research

2024-04-24

Snowflake Arctic: The Best LLM for Enterprise AI — Efficiently Intelligent, Truly Open

https://www.snowflake.com/en/blog/arctic-open-efficient-foundation-language-models-snowflake/

480000000000.00

" It combines a 10B dense transformer model with a residual 128x3.66B MoE MLP resulting in 480B total and 17B active parameters chosen using a top-2 gating."

3.8347175e+23

from the graph: 1x - Arctic 1.9X - Llama 3 8B (7.2×10^23) ~ = Yi 34B (6.1e23) -> x = 3.2105263e+23 3X - Code Llama 70B (1.26e+24) -> x = 4.2e+23 17.5X - Llama 3 70B (7.861e24) -> x =4.492e+23 = 3.7975893e+23 Operation counting (17B active parameters): 6ND = 6*17*10^9*3.5*10^12 = 3.57e+23 (3.2105263e+23*4.492e+23*4.2e+23*3.57e+23)^(1/4) = 3.8347175e+23

3500000000000

"Arctic was trained with a three-stage curriculum each with a different data composition focusing on generic skills in the first phase (1T Tokens), and enterprise-focused skills in the latter two phases (1.5T and 1T tokens). " 1+1.5+1 = 3.5

Confident

Built by Snowflake, Arctic is a family of enterprise-grade LLMs with leading performance in enterprise intelligence and breakthrough efficiency. Snowflake Arctic is a truly open, Apache 2.0 licensed model.

Open weights (unrestricted)

United States of America

Open source

Apache 2.0 license with ungated access to weights and code paired with open data recipe and research insights.

Industry

Jurassic-1-Jumbo

Language

Language modeling/generation

Chat

AI21 Labs

Opher Lieber, Or Sharir, Barak Lenz, Yoav Shoham

2021-08-11

Jurassic-1: Technical Details and Evaluation

https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf

55.00

178000000000.00

"Jurassic-1 models come in two sizes, where the Jumbo version, at 178B parameters, is the largest and most sophisticated language model ever released for general use by developers."

3.7e+23

see here https://docs.google.com/document/d/1B8x6XYcmB1u6Tmq3VcbAtj5bzhDaj2TcIPyK6Wpupx4/edit

225000000000

"Our model was trained with the conventional self-supervised auto-regressive training objective on 300B tokens drawn from publicly available resources" 1 token ~ 0.75 words

NVIDIA A100

Jurassic-1 is a pair of auto-regressive language models recently released by AI21 Labs, consisting of J1-Jumbo, a 178B-parameter model, and J1-Large, a 7B-parameter model. We describe their architecture and training, and evaluate their performance relative to GPT-3. The evaluation is in terms of perplexity, as well as zero-shot and few-shot learning. To that end, we developed a zeroshot and few-shot test suite, which we made publicly available (https://github.com/ai21labs/ lm-evaluation) as a sh

API access

Israel

Unreleased

Industry

BLOOM-176B

Language

Language modeling

Translation

Code generation

Hugging Face

BigScience

Margaret Mitchell, Giada Pistilli, Yacine Jernite, Ezinwanne Ozoani, Marissa Gerchick, Nazneen Rajani, Sasha Luccioni, Irene Solaiman, Maraim Masoud, Somaieh Nikpoor, Carlos Muñoz Ferrandis, Stas Bekman, Christopher Akiki, Danish Contractor, David Lansky, Angelina McMillan-Major, Tristan Thrush, Suzana Ilić, Gérard Dupont, Shayne Longpre, Manan Dey, Stella Biderman, Douwe Kiela, Emi Baylor, Teven Le Scao, Aaron Gokaslan, Julien Launay, Niklas Muennighoff

2022-07-11

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

https://arxiv.org/abs/2211.05100

1984.00

176247271424.00

See "Technical Specifications" on Hugging Face: https://huggingface.co/bigscience/bloom

3.65664e+23

https://bigscience.huggingface.co/blog/bloom Blog post says 117 days. 384 A100 GPUs * 314 TFLOPS throughput per GPU * 117 days * 0.3 (utilization assumption) = 3.65664e23 https://www.wolframalpha.com/input?i=384+*+314+TFLOPS+*+117+days+*+0.3

BigScience ROOTS Corpus

In total, 1.6 terabytes of pre-processed text was converted into 350 billion unique tokens as BLOOM's training datasets. arXiv:2210.15424 "BLOOM was trained on the ROOTS corpus (Lauren¸con et al., 2022), a composite collection of 498 Hugging Face datasets (Lhoest et al., 2021) amounting to 1.61 terabytes of text that span 46 natural languages and 13 programming languages. A high-level overview of this dataset can be seen in Figure 3, while a detailed itemized list of every language along with i

379000000000

Table 3.5 https://arxiv.org/pdf/2211.05100 366B (pretrain) + 13B (finetune) = 379B tokens total

NVIDIA A100 SXM4 80 GB

Confident

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a d

Open weights (restricted use)

Multinational

United States of America

Multinational

France

384

Unreleased

responsible use restrictions: https://bigscience.huggingface.co/blog/the-bigscience-rail-license

Industry

Research collective

GLaM

Language

Language modeling/generation

Question answering

Google

Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, Claire Cui

2021-12-13

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

https://arxiv.org/abs/2112.06905

597.00

1200000000000.00

1.2 trillion parameters

3.6363112434e+23

The network activates 96.6 billion parameters per token and trained for 600B tokens. 6 * 600B * 96.6B = 3.478e23 Digitizing figure 4 (d) indicates 139.67 TPU-years of training. 2.75e14 * 139.67 * 365.25 * 24 * 3600 * 0.3 = 3.636e23 Since these are close, we will use the 6NC estimate and derive hardware utilization from the training time information. Later they say they measured 326W power usage per chip, which could maybe be used to estimate utilization.

Wikipedia

GLaM dataset

"To train our model, we build a high-quality dataset of 1.6 trillion tokens that are representative of a wide range of natural language use cases. Web pages constitute the vast quantity of data in our unlabeled dataset. However, their quality ranges from professional writing to low-quality comment and forum pages."

600000000000

The dataset is made of 1.6 trillion tokens, but later in the paper they say they only train the largest model for 600b tokens. 600b / 0.75 words/token = 800b words. "The complete GLaM training using 600B tokens consumes only 456 MWh and emits 40.2 net tCO2e."

Google TPU v4

Confident

Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amounts of computing resources. In this paper, we propose and develop a family of language models named GLaM (Generalist Language Model), which uses a sparsely activated mixture-of-experts architecture to s

Unreleased

United States of America

1024

Unreleased

Industry

Falcon 2 11B

Language

Language modeling/generation

Technology Innovation Institute

2024-05-09

Falcon2-11B

https://huggingface.co/tiiuae/falcon-11B

11000000000.00

11B

3.6e+23

trained on 5.5T tokens 6 * 11B * 5.5T = 3.6e23

RefinedWeb

"Falcon2-11B was trained over 5,000B tokens of RefinedWeb, a high-quality filtered and deduplicated web dataset which we enhanced with curated corpora. It followed a four stage training strategy. The first three stages were focused on increasing the context length, from to 2048 to 4096 and finally to 8192 tokens. The last stage aimed to further enhance performance using only high quality data." Possibly an updated version of RefinedWeb, which only had 3.5T tokens when Falcon 1 was released? not

5500000000000

5.5T tokens: https://falconllm.tii.ae/falcon-2.html

NVIDIA A100 SXM4 40 GB

Confident

Falcon2-11B is an 11B parameters causal decoder-only model built by TII and trained on over 5,000B tokens of RefinedWeb enhanced with curated corpora. The model is made available under the TII Falcon License 2.0, the permissive Apache 2.0-based software license which includes an acceptable use policy that promotes the responsible use of AI.

Open weights (restricted use)

United Arab Emirates

Open but has an acceptable use policy: https://falconllm-staging.tii.ae/falcon-2-acceptable-use-policy.html

Government

LaMDA

Language

Language modeling

Google

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Kathleen Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos

2022-02-10

LaMDA: Language Models for Dialog Applications

https://arxiv.org/abs/2201.08239

1375.00

137000000000.00

"LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters"

3.55e+23

"The total FLOPS is 56.5% * 123 TFLOPS/s * 1024 chips * 57.7 days = 3.55E+23" From https://arxiv.org/pdf/2201.08239.pdf p.18

Infiniset

LaMDA's underlying dataset is called 'Infiniset', and besides the dialogue also involves common crawl, wikipedia, a mixture of english and non-english web documents, and data from programming-related sites (so LaMDA models can also dabble in code).

1560000000000

"and are pre-trained on 1.56T words of public dialog data and web text"

Google TPU v3

Confident

We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and are pre-trained on 1.56T words of public dialog data and web text. While model scaling alone can improve quality, it shows less improvements on safety and factual grounding. We demonstrate that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improve

Unreleased

United States of America

1024

Unreleased

Industry

GLM-130B

Language

Language modeling/generation

Translation

Tsinghua University

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Peng Zhang, Yuxiao Dong, Jie Tang

2022-08-04

GLM-130B: An Open Bilingual Pre-trained Model

https://keg.cs.tsinghua.edu.cn/glm-130b/posts/glm-130b/

989.00

130000000000.00

Dense model

3.5490054945e+23

"96 NVIDIA A100 (40G * 8) servers for 2 months" 312 TFLOPS/GPU * 96 servers * 8 GPU/server * 2 months * 32.5% utilization = 4.037e23 utilization rate - citation from the paper: "we report hardware FLOPs utilization (HFU) of 43.3% and model FLOPs utilization (MFU) of 32.5% due to re-materialization." Aligns pretty well with 6ND: 6 * 400B * 130B = 3.12E23 Geometric mean: sqrt(4.037e23 * 3.12e23) = 3.549e23

The Pile

WuDao Corpora

"The pre-training data includes 1.2T Pile (train split) (Gao et al., 2020) English, 1.0T Chinese WudaoCorpora (Yuan et al., 2021), and 250G Chinese corpora (including online forums, encyclopedia, and QA) we crawl from the web, which form a balanced composition of English and Chinese contents"

400000000000

400B "We completed the 400B-token training and evaluation of GLM-130B in July, and subsequently released the model and pre-training details in August 2022. " from https://arxiv.org/pdf/2406.12793 "As of July 3rd, 2022, GLM-130B has been trained on over 400 billion text tokens (200B each for Chinese and English)"

NVIDIA A100 SXM4 40 GB

Confident

GLM-130B (ICLR 2023) is an open bilingual (English & Chinese) bidirectional dense model with 130 billion parameters, pre-trained using the General Language Model (GLM) algorithm1. It is designed to support inference tasks with the 130B parameters on a single A100 (40G * 8) or V100 (32G * 8) server. As of July 3rd, 2022, GLM-130B has been trained on over 400 billion text tokens (200B each for Chinese and English)

Open weights (non-commercial)

China

768

Unreleased

non commercial license. looks like inference but not training code: https://github.com/THUDM/GLM-130B/blob/main/MODEL_LICENSE

Academia

Luminous-supreme

Language

Language generation

Aleph Alpha

2022-08-15

Model Card Luminous

https://docs.aleph-alpha.com/docs/introduction/model-card/

70000000000.00

"~70B"

3.5461e+23

"~839000h" GPU-hours on A100s, per Environmental Impact section of model card. 312 trillion * 839000 * 3600 * 0.3 = 2.8e23 6ND = 6*70B*1069.30B = 4.49106e+23 sqrt(2.8e23*4.49106e+23) = 3.54612... × 10^23 reported here: 167TFLOPS https://docs.aleph-alpha.com/docs/Deprecated%20Luminous/Deprecated-Luminous/model-card/

"The Luminous family has been trained on a dataset compiled of sources in English, German, French, Spanish and Italian..." more details in model card https://docs.aleph-alpha.com/docs/introduction/model-card/

1069300000000

from the table Total Size: 2.77 + 0.79 + 0.18 + 0.07 + 0.06 + 0.02 = 3.89 TB Tokens: 761.41B + 217.15B + 49.47B + 19.29B + 16.49B + 5.49B = 1069.30B tokens

NVIDIA A100 SXM4 40 GB

NVIDIA A100 SXM4 80 GB

Confident

The Luminous series is a family of large language models. Large language models are powerful technological tools that can process and produce text. These capabilities emerge during model “training” where the model is exposed to significant amounts of human text data. Similar to a person who deliberately absorbs information while reading a whole library and half of the internet, large language models acquire structural understanding (and not necessarily also knowledge) of language and accumulated

API access

Germany

512

Unreleased

Industry

Yuan 1.0

Language

Language modeling

Inspur

Shaohua Wu, Xudong Zhao, Tong Yu, Rongguo Zhang, Chong Shen, Hongli Liu, Feng Li, Hong Zhu, Jiangang Luo, Liang Xu, Xuanwei Zhang, Jun Liu

2021-10-12

Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning

https://arxiv.org/abs/2110.04725

51.00

245730000000.00

Table 2: Parameters of Yuan models. "Parameters (billion)"

3.5380000000001e+23

Table 9: 4095 petaFLOPS-days which equals 3.538*10^23 FLOP https://www.wolframalpha.com/input?i=4095+petaFLOPS+*+1+day

Common Crawl

Wikipedia

Sogue News

"A Chinese corpus with 5TB high-quality text is built, which is sufficient to train Yuan 245B model without sampling the dataset twice." In order to obtain the high-quality dataset, we develop a Massive Data Filtering System (MDFS) built on Spark to clean and filter the raw data, and train a Bert-based model to select high quality samples. MDFS is consisted of three parts, data collection, coarse filtering and fine filtering (Fig. 5). The raw data is collected from Common Crawl, Sogou News, Sog

1000000000000

"Yuan 1.0 was trained on a new Chinese dataset of 5TB high-quality text that was built on 850TB raw data from Internet." 1 GB ~ 167M words in English or 333M words in Chinese. For a mixed dataset of mostly Chinese, 5TB may be equivalent to around 1T words. Table 2: 180B training tokens

Confident

Recent work like GPT-3 has demonstrated excellent performance of Zero-Shot and Few-Shot learning on many natural language processing (NLP) tasks by scaling up model size, dataset size and the amount of computation. However, training a model like GPT-3 requires huge amount of computational resources which makes it challengeable to researchers. In this work, we propose a method that incorporates large-scale distributed training performance into model architecture design. With this method, Yuan 1.0

API access

China

2128

Unreleased

https://github.com/Shawn-IEITSystems/Yuan-1.0

Industry

AlphaGo Zero

Games

Go

DeepMind

D Silver, J Schrittwieser, K Simonyan, I Antonoglou

2017-10-18

Mastering the game of Go without human knowledge

https://www.nature.com/articles/nature24270

8795.00

46400244.00

Quick calculation

3.41e+23

source: https://docs.google.com/spreadsheets/d/1Kj4Q5WADcDXtUJLIOfGTCE3tGvxNczEMwyy8QtgSkHk/edit#gid=54587040&fvid=1361937389 AGZ had two models, one of which was small and another of which was large. The compute for AGZ is for the large model, which has 40 residual blocks instead of 20. A second way of looking at this... we believe multiple TPUs were used for training. 29 million games * 211 moves per game on average * 0.8 seconds per move = 4.8952E+09 seconds of player-time across all TPUs.

5800000000

"Over the course of training, 29 million games of self-play were generated" Approx 200 moves per Go game on average https://homepages.cwi.nl/~aeb/go/misc/gostat.html Thus 200 * 29e6 = 5.8e9

Google TPU v1

A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on rein

Unreleased

United Kingdom of Great Britain and Northern Ireland

Unreleased

Industry

Qwen1.5-14B

Language

Chat

Language modeling/generation

Quantitative reasoning

Code generation

Translation

Alibaba

Qwen Team

2024-02-04

Introducing Qwen1.5

https://huggingface.co/Qwen/Qwen1.5-14B

14000000000.00

14B

3.36e+23

6*14*10^9*4*10^12 = 3.36e+23

Unspecified unreleased

"We pretrained the models with a large amount of data, and we post-trained the models with both supervised finetuning and direct preference optimization."

4000000000000

4 trillion tokens from this response https://github.com/QwenLM/Qwen2/issues/97

Confident

Qwen1.5 is the beta version of Qwen2, a transformer-based decoder-only language model pretrained on a large amount of data. In comparison with the previous released Qwen, the improvements include: 8 model sizes, including 0.5B, 1.8B, 4B, 7B, 14B, 32B and 72B dense models, and an MoE model of 14B with 2.7B activated; Significant performance improvement in human preference for chat models; Multilingual support of both base and chat models; Stable support of 32K context length for models of all si

Open weights (unrestricted)

China

Unreleased

https://huggingface.co/Qwen/Qwen1.5-14B

Industry

FragLlama: Next-fragment prediction for molecular design

Multimodal

Image generation

Vision

Language

Language modeling/generation

Vision-language generation

Visual question answering

Text-to-image

Facebook AI Research

Srinivasan Iyer, Bernie Huang, Lili Yu, Arun Babu, Chunting Zhou, Kushal Tirumala, Xi Victoria Lin, Hu Xu, Xian Li, Akshat Shrivastava, Omer Levy, Armen Aghajanyan, Ram Pasunuru, Andrew Cohen, Aram H. Markosyan, Koustuv Sinha, Xiaoqing Ellen Tan, Ivan Evtimov, Ping Yu, Tianlu Wang, Olga Golovneva, Asli Celikyilmaz, Pedro Rodriguez, Leonid Shamis, Vasu Sharma, Christine Jou, Karthik Padthe, Ching-Feng Yeh, Mingda Chen, Bapi Akula, Jacob Kahn, Daniel Li, Scott Yih, Barlas Oguz, Morteza Behrooz, Be

2024-05-16

Chameleon: Mixed-Modal Early-Fusion Foundation Models

https://arxiv.org/abs/2405.09818v1

7000000000.00

3.3399700602e+23

GPU method: Table 2 shows that 7B model pre-training uses 856481 GPU-hours, trained across 1024 A100s 3.12e14 * 856481 * 3600 * 0.3 = 2.89e23 Parameter-token method: Pre-training goes over 9.2T tokens, post-training only goes over 1.1B tokens (sum of tokens column in Table 3) 6 * 7B * 9.2T = 3.86e23 Geometric mean: sqrt(2.89e23 * 3.86e23) = 3.34e23

Unspecified unreleased

Pre-training: - 2.9 trillion tokens of pure text - 1.4 billion text-image pairs, which produces 1.5 trillion text-image tokens - Since each image is 1024 tokens, implies 1.43 trillion image tokens and 0.07 trillion text tokens - 400 billion tokens of image-text interleaved documents - Difficult to estimate image-to-text ratio, but references OBELIKS paper which had 141 million web pages, 353 million associated images, and 115 billion text tokens. - 353 million * 1024 = 361.5 billion image tok

4400000000000

Slightly conflicting info. Pre-training data details describe different types of data that sum to 4.8 trillion tokens, but Table 1 indicates 4.4T. Using table values as this agrees with other statements about epochs and total tokens seen.

NVIDIA A100 SXM4 80 GB

Confident

We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-fo

Open weights (non-commercial)

United States of America

Not enough info to estimate. GPU time given for pretraining, and while we know # of fine-tuning tokens we don't know # of epochs.

Unreleased

https://ai.meta.com/resources/models-and-libraries/chameleon-downloads/?gk_enable=chameleon_web_flow_is_live "The models we’re releasing today were safety tuned and support mixed-modal inputs and text-only output to be used for research purposes. While we’ve taken steps to develop these models responsibly, we recognize that risks remain. At this time, we are not releasing the Chameleon image generation model."

Industry

Qwen2.5-3B

Language

Language modeling/generation

Question answering

Quantitative reasoning

Alibaba

Qwen Team

2024-09-19

Qwen2.5: A Party of Foundation Models!

https://qwenlm.github.io/blog/qwen2.5-llm/

3090000000.00

3.09B

3.3372e+23

Training dataset size was 18 trillion 6ND = 6 * 3.09 billion parameters * 18 trillion tokens = 3.3372e+23

Unspecified unreleased

"In terms of Qwen2.5, the language models, all models are pretrained on our latest large-scale dataset, encompassing up to 18 trillion tokens."

18000000000000

"In terms of Qwen2.5, the language models, all models are pretrained on our latest large-scale dataset, encompassing up to 18 trillion tokens"

Confident

In the past three months since Qwen2’s release, numerous developers have built new models on the Qwen2 language models, providing us with valuable feedback. During this period, we have focused on creating smarter and more knowledgeable language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5. We are announcing what might be the largest opensource release in history! Let’s get the party started!

Open weights (non-commercial)

China

Unreleased

Qwen Research license

Industry

Galactica

Language

Biology

Language modeling

Question answering

Meta AI

Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, Robert Stojnic

2022-11-16

Galactica: A Large Language Model for Science

https://arxiv.org/abs/2211.09085

599.00

120000000000.00

"The largest 120B model we train runs on a single NVIDIA A100 node"

3.24e+23

Authors state the model is trained on 450b tokens. Using 6 FLOP/token/parameter, this is 6*120b*450b = 3.24e23

Galactica Corpus

"Our corpus consists of 106 billion tokens from papers, reference material, encyclopedias and other scientific sources. We combine natural language sources, such as papers and textbooks, and natural sequences, such as protein sequences and chemical formulae. We process LATEX where we can capture it, and also include academic code to capture computational science"

106000000000

"Total dataset size = 106 billion tokens"

NVIDIA A100 SXM4 80 GB

Likely

Information overload is a major obstacle to scientific progress. The explosive growth in scientific literature and data has made it ever harder to discover useful insights in a large mass of information. Today scientific knowledge is accessed through search engines, but they are unable to organize scientific knowledge alone. In this paper we introduce Galactica: a large language model that can store, combine and reason about scientific knowledge. We train on a large scientific corpus of papers,

Open weights (non-commercial)

United States of America

128

Unreleased

cc-by-nc (non-commercial): https://huggingface.co/facebook/galactica-120b repo but no training code: https://github.com/paperswithcode/galai/blob/main/README.md

Industry

InstructGPT 175B

Language

Language modeling/generation

OpenAI

Long Ouyang, Pamela Mishkin, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright,John Schulman Amanda Askell, Fraser Kelton Peter Welinder, Luke Miller Maddie Simens Paul Christiano,Ryan Lowe,Chong Zhang Jacob Hilton, Sandhini Agarwal Katarina Slama Alex Ray, Jan Leike

2022-01-27

Training language models to follow instructions with human feedback

https://arxiv.org/pdf/2203.02155

9228.00

175000000000.00

"We train three model sizes (1.3B, 6B, and 175B parameters)"

3.19181e+23

"training our 175B PPO-ptx model requires 60 petaflops/s-days, compared to 3,640 petaflops/s-days for GPT-3 (Brown et al., 2020)" 60/3640 = +1.65% to base model compute base model was reported 3.14e+23 FLOP 3.14e+23 * 1.0165 = 319181000000000000000000

374000033207

Table 6 - describes **number of prompts** 26584 + 6623 = 33207 This is added to GPT-3 dataset size.

Confident

Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the

United States of America

GPT-3 175B (davinci)

5181000000000B

Industry

GPT-3 175B (davinci)

Language

Text autocompletion

Language modeling/generation

OpenAI

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei

2020-05-28

Language Models are Few-Shot Learners

https://arxiv.org/abs/2005.14165

32643.00

175000000000.00

"we train GPT-3, an autoregressive language model with 175 billion parameters"

3.14e+23

Table D.1 https://arxiv.org/abs/2005.14165

Common Crawl

WebText2

Wikipedia

Books1

Books2

Table 2.2 (other datasets also used)

374000000000

From table 2.2, we determine that there are 410 + 19 + 12 + 55 + 3 = 499 billion tokens. We multiply this by 0.75 to give 374B words. 3.74e11 ======================== [Anson: I think the calculation below doesn't look at all the data, the CommonCrawl data only constitutes 60% of the data. Multiplying by 5/3 gives 4.75e11] "The CommonCrawl data was downloaded from 41 shards of monthly CommonCrawl covering 2016 to 2019, constituting 45TB of compressed plaintext before filtering and 570GB aft

NVIDIA Tesla V100 DGXS 32 GB

Confident

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to

API access

United States of America

10000

Unreleased

https://openai.com/blog/openai-api

Industry

InternLM2-20B

Language

Chat

Language modeling/generation

Question answering

Shanghai AI Lab

SenseTime

Chinese University of Hong Kong (CUHK)

Fudan University

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li, Shuaibin Li, Wei Li, Yining Li, Hongwei Liu, Jiangning Liu, Jiawei Hong, Kaiwen Liu, Kuikun Liu, Xiaoran Liu, Chengqi Lv, Haijun Lv, Kai Lv, Li

2024-01-12

InternLM2 Technical Report

https://arxiv.org/abs/2403.17297

20000000000.00

20B

3.12e+23

6ND = 6 * 2600000000000 * 20000000000 = 3.12e+23

Unspecified unreleased

"The text data in our pre-training dataset can be categorized by source into web pages, papers, patents, and books. To transform these sources into a pre-training dataset, we first standardize all data into a specified format, categorize them by type and language, and store them in JSON Lines (jsonl) format. Then, for all data, we apply several processing steps including rule-based filtering, data deduplication, safety filtering, and quality filtering. This results in a rich, safe, and high-qual

2600000000000

"The total number of tokens used for pre-training the 1.8B, 7B, and 20B models ranges from 2.0T to 2.6T, and the pre-training process consists of three distinct phases. "

Confident

The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context modeling, and open-ended subjective evaluations through innovative pre-training and optimization tech

Open weights (restricted use)

China

Hong Kong

China

Hong Kong

China

Unreleased

need to apply for commercial license. there's a repo but doesn't look like there's pretraining code. https://github.com/InternLM/InternLM

Academia

Industry

Academia

CodeFuse-13B

Language

Code generation

Ant Group

Peng Di, Jianguo Li, Hang Yu, Wei Jiang, Wenting Cai, Yang Cao, Chaoyu Chen, Dajun Chen, Hongwei Chen, Liang Chen, Gang Fan, Jie Gong, Zi Gong, Wen Hu, Tingting Guo, Zhichao Lei, Ting Li, Zheng Li, Ming Liang, Cong Liao, Bingchang Liu, Jiachen Liu, Zhiwei Liu, Shaojun Lu, Min Shen, Guangpei Wang, Huan Wang, Zhi Wang, Zhaogui Xu, Jiawei Yang, Qing Ye, Gehao Zhang, Yu Zhang, Zelin Zhao, Xunjin Zheng, Hailian Zhou, Lifu Zhu, Xianying Zhu

2023-10-10

CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model

https://arxiv.org/abs/2310.06266

3.00

13000000000.00

3.09e+23

"CodeFuse-13B was trained using 512 Nvidia A100 GPU cards, with a Hardware FLOPs Utilization (HFU) of approximately 60%. The training process took approximately 40 days to complete." Later they state utilization of 56% 512 * 312 trillion * 40 * 24 * 3600 * 0.56 = 3.09e23 Using params*tokens, we have 13 billion * 1 trillion * 6 = 7.8e22. might be a sign of multiple epochs? 1T is the size of the dataset; they don't clearly state the number of training tokens

The Stack

GitHub

80% code, 10% English, 10% Chinese: "The pre-training data for CodeFuse consists of 196TB of code, 1.75TB of Chinese raw data, and 1.7TB of English raw data, totaling 200TB, that are tokenized into 800 billion tokens of code, 100 billion tokens of Chinese corpus, and 100 billion tokens of English corpus (see Section 3.1)." "We collected about 200+ TB of code-related data, and finally refined it to around 1.6TB (1T Token) of clean data suitable for pre-training."

1000000000000

1T tokens, mostly code but some Chinese/English

NVIDIA A100 SXM4 80 GB

Confident

Code Large Language Models (Code LLMs) have gained significant attention in the industry due to their wide applications in the full lifecycle of software engineering. However, the effectiveness of existing models in understanding non-English inputs for multi-lingual code-related tasks is still far from well studied. This paper introduces CodeFuse-13B, an open-sourced pre-trained code LLM. It is specifically designed for code-related tasks with both English and Chinese prompts and supports over 4

Open weights (unrestricted)

China

512

Unreleased

apache: https://github.com/codefuse-ai/codefuse-chatbot?tab=License-1-ov-file#readme

Industry

Gemma 1.1 7B Instruct

Language

Language modeling/generation

Question answering

Google

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot and et al.

2024-02-24

https://huggingface.co/google/gemma-1.1-7b-it

8540000000.00

Safetensors Model size 8.54B params

3.0744e+23

6ND = 6*6000000000000*8540000000=3.0744e+23

Unspecified unreleased

"These models were trained on a dataset of text data that includes a wide variety of sources, totaling 6 trillion tokens. Here are the key components: Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. Primarily English-language content. Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code or understand code-related

6000000000000

"These models were trained on a dataset of text data that includes a wide variety of sources, totaling 6 trillion tokens. "

Google TPU v5e

Confident

This is Gemma 1.1 7B (IT), an update over the original instruction-tuned Gemma release. Gemma 1.1 was trained using a novel RLHF method, leading to substantial gains on quality, coding capabilities, factuality, instruction following and multi-turn conversation quality. We also fixed a bug in multi-turn conversations, and made sure that model responses don't always start with "Sure,". We believe this release represents an improvement for most use cases, but we encourage users to test in their p

Open weights (restricted use)

United States of America

Unreleased

https://huggingface.co/google/gemma-1.1-7b-it "This repository is publicly accessible, but you have to accept the conditions to access its files and content."

Industry

Gemma 7B

Language

Language modeling/generation

Chat

Code generation

Question answering

Quantitative reasoning

Google DeepMind

Gemma Team, Google DeepMind

2024-02-21

Gemma: Open Models Based on Gemini Research and Technology

https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf

8538074112.00

Table 2, sum of embedding and non-embedding parameters

3.07e+23

6ND aproximation 6*8.54B*6T = 3.07e23 "Gemma 2B and 7B are trained on 2T and 6T tokens respectively of primarily-English data from web documents, mathematics, and code." As confirmation: "We estimate the carbon emissions from pretraining the Gemma models to be ∼ 131 𝑡𝐶𝑂2𝑒𝑞. " U.S. avg CO2 per kWh is ~0.87lbs 131 tCO2 * 2000 lb/t * (1 kWh/0.87lb) = 3.01e5 kWh Per SemiAnalysis TPU v5e uses ~ 5x less power than H100, so ~140 W TDP 3.01e5 kWh * 1000 W/kW * 1 TPUv5e/140 W = 2.15e6 TPUv5e-ho

Unspecified unreleased

"Gemma 2B and 7B are trained on 2T and 6T tokens respectively of primarily-English data from web documents, mathematics, and code."

6000000000000

"Gemma 2B and 7B are trained on 2T and 6T tokens respectively of primarily-English data from web documents, mathematics, and code." Not explicitly stated that this doesn't involve multiple epochs, but I expect it does not.

Google TPU v5e

Confident

Open weights (restricted use)

United States of America

United Kingdom of Great Britain and Northern Ireland

Multinational

4096

Unreleased

https://ai.google.dev/gemma/terms no illegal use or abuse

Industry

Granite 20B

Language

Language modeling/generation

IBM Research

2024-05-31

Granite Foundation Models

https://www.ibm.com/downloads/documents/us-en/10a99803c92fdb35

20000000000.00

3.0000000000001e+23

6*2500000000000*20000000000=3e+23

Stack Exchange

Common Crawl

Wikimedia

"For English and code, we used Wikimedia, Stack Exchange, and commoncrawl. For multilingual data, we used portions of commoncrawl."

2500000000000

For pre-training, we used 0.5 trillion English, 0.4 trillion multilingual (es, fr, de, pt), and 1.6 trillion code tokens.

Confident

We introduce the Granite series of decoder-only foundation models for generative artificial intelligence (AI) tasks that are ready for enterprise use. We report on the architecture, capabilities, underlying data and data governance, training algorithms, compute infrastructure, energy and carbon footprint, testing and evaluation, socio-technical harms and mitigations, and usage policies.

Open weights (unrestricted)

United States of America

Multinational

Industry

Stable LM 2 12B

Language

Language modeling/generation

Translation

Stability AI

2024-04-08

Introducing Stable LM 2 12B

https://stability.ai/news/introducing-stable-lm-2-12b https://huggingface.co/stabilityai/stablelm-2-12b

12143605760.00

Precise number given in HF model card

2.91e+23

2* 12143605760 params * 3* 2T tokens * 2 epochs = 2.91e23. Trained on 384 H100s (AWS P5 instances).

RefinedWeb

RedPajama-Data

The Pile

StarCoder

CulturaX

The dataset is comprised of a filtered mixture of open-source large-scale datasets available on the HuggingFace Hub: Falcon RefinedWeb extract (Penedo et al., 2023), RedPajama-Data (Together Computer., 2023) and The Pile (Gao et al., 2020) both without the Books3 subset, and StarCoder (Li et al., 2023). We further supplement our training with multi-lingual data from CulturaX (Nguyen et al., 2023) and, in particular, from its OSCAR corpora, as well as restructured data in the style of Yuan & Liu

2000000000000

2T tokens

NVIDIA H100 SXM5 80GB

Confident

Introducing the latest additions to our Stable LM 2 language model series: a 12 billion parameter base model and an instruction-tuned variant, trained on 2 trillion tokens in seven languages: English, Spanish, German, Italian, French, Portuguese, and Dutch. This medium-sized model balances strong performance, efficiency, memory requirements, and speed, following our established Stable LM 2 1.6B framework as detailed in our previously released technical report. With this release, we’re extending

Open weights (restricted use)

Multinational

United Kingdom of Great Britain and Northern Ireland

Open source

Requires Stability AI Membership. Free for non-commercial use, $20/month for commercial use if less than $1M in annual revenue, $1M in institutional funding, and 1M monthly active users. Apache 2.0 license for repo, which includes detailed hyperparams and training details: https://github.com/Stability-AI/StableLM/blob/main/LICENSE

Industry

ST-MoE

Language

Language modeling/generation

Google

Google Brain

Google Research

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, William Fedus

2022-02-17

ST-MoE: Designing Stable and Transferable Sparse Expert Models

https://arxiv.org/abs/2202.08906v2

117.00

269000000000.00

269B. it's called ST-MoE-32B because it's equivalent to a 32B dense model.

2.9000000000000005e+23

The paper claims "scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder". If this is true for training cost, then 6*32e9*1.5e12 = 2.9e23

C4

"The pre-training dataset used to train our Sparse 32B model is a mix of C4 (Raffel et al., 2019) and the dataset introduced in GLaM (Du et al., 2021)."

1500000000000

"We pre-train for 1.5T tokens on a mixture of English-only C4 dataset (Raffel et al., 2019) and the dataset from GLaM (Du et al., 2021) summarized in Appendix E"

Likely

Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a spars

Unreleased

United States of America

Multinational

United States of America

Open source

Apache License 2.0 Code for our models is available at https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/moe.py

Industry

MegaScale (175B)

Language

Language modeling/generation

ByteDance

Peking University

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu

2024-02-23

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

https://arxiv.org/abs/2402.15627

40.00

175000000000.00

Two models are trained for epoch to evaluate the MegaScale training system; one model with 175B and another with 530B parameters. This entry reports the 175B model. There is a third production model mentioned, with fewer details.

2.7385671436e+23

Table 2 gives details for the 175B model. Looking at the largest 1 epoch run with 12288 GPUs: 2166.3 aggregate PFlops/sec * 1.75 days * 24 hours/day * 3600 seconds/hour = 3.275e23 This is consistent with the theoretical computation counting estimate, if they factor MFU rate into their 2166.3 figure: 2 × 175B params × 3 × 300B tokens × 1 epoch = 2.29e23 I use the geometric mean of these two: (3.275e23 + 2.29e23) / 2 = 2.74e23

175B and 530B models trained for paper use 300B tokens each.

225000000000

300B tokens * 0.75 words/token = 225B words

NVIDIA A100

Confident

We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pip

Unreleased

China

12288

Unreleased

repo, but no training code for the big model https://github.com/volcengine/vescale Model weights unreleased

Industry

Academia

LLaMA-33B

Language

Language modeling

Code generation

Meta AI

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample

2023-02-27

LLaMA: Open and Efficient Foundation Language Models

https://arxiv.org/abs/2302.13971

8872.00

32500000000.00

Table 2 in the paper

2.7300000000000996e+23

1.4T tokens * 32.5B params * 6 FLOP/token/param = 2.73e+23 FLOP

CCNet

GitHub

Wikipedia

books

arXiv

Stack Exchange

See Table 1

1340000000000

Table 1 indicates that 1.4T tokens involved sampling sub-datasets at more or less than one epoch. Correcting for this: (1.1 epoch * 3.3TB) + (1.06 epoch * 0.783TB) + ... = 1.4T tokens 5.24 epoch-TBs = 1.4T tokens 5.24 epoch-TB * 1000 GB/TB * 200M token/GB = 1.4T tokens 1.05T epoch*token = 1.4T tokens 1 epoch = 1.34T tokens

Confident

We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the resea

Open weights (non-commercial)

United States of America

Unreleased

"we are releasing our model under a noncommercial license focused on research use cases" https://ai.meta.com/blog/large-language-model-llama-meta-ai/

Industry

Whisper v3

Speech

Speech recognition

OpenAI

2023-11-06

https://huggingface.co/openai/whisper-large-v3

1550000000.00

2.7e+23

Could derive this in terms of Whisper v1, which according to the paper was trained for 680k hours for between 2-3 epochs. Whisper v3 was trained on 5 million hours for 2 epochs, or ~5-7x as much data, and has the same architecture. We have an estimate of 4.65e22 for Whisper 1. Assume Whisper v1 was trained on 2.5 epochs, or 2.5*680k = 1.7M hours. Whisper v3 was trained on 10M hours. 10/1.7 * 4.65e22 ~= 2.7e23

Unspecified unreleased

"The Whisper large-v3 model is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper large-v2"

60000000000

English audio is roughly 228 wpm: https://docs.google.com/document/d/1G3vvQkn4x_W71MKg0GmHVtzfd9m0y3_Ofcoew0v902Q/edit#heading=h.sxcem9l5k3ce The dataset is multilingual and other languages seem to have lower wpms. So using 200 wpm, we have 200*60*5 million hours = 60,000,000,000 (60B) words

Likely

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning. Whisper was proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. The original code repository can be found here.

Open weights (unrestricted)

United States of America

Unreleased

Apache 2.0: https://huggingface.co/openai/whisper-large-v3 this seems to be inference code not training: https://github.com/openai/whisper

Industry

Viking

Language

Language modeling/generation

Language generation

Translation

Silo AI

University of Turku

2024-04-04

Viking 33B is a 33B parameter decoder-only transformer pretrained on Finnish, English, Swedish, Danish, Norwegian, Icelandic and code. It is being trained on 2 trillion tokens (700B billion as of this release). Viking 33B is a fully open source model and is made available under the Apache 2.0 License.

https://huggingface.co/LumiOpen/Viking-33B

33000000000.00

2.574e+23

Plan is to train on 2 trillion tokens, but most recent release is at 1.3T 6 * 33B * 1.3 trillion = 2.574E23

2000000000000

Viking is being trained on a 2 trillion token mixed dataset of English, Finnish, Swedish, Danish, Norwegian, Icelandic and code.

AMD Radeon Instinct MI250X

Confident

Open weights (unrestricted)

Finland

1024

Open source

code here: https://github.com/LumiOpen/Megatron-DeepSpeed/blob/main/pretrain_viking_33B.sh

Industry

Academia

Qwen2.5-Coder (7B)

Language

Code generation

Code autocompletion

Quantitative reasoning

Question answering

Language modeling/generation

Alibaba

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, Junyang Lin

2024-09-18

Qwen2.5-Coder Technical Report

https://arxiv.org/abs/2409.12186

7610000000.00

Number of Parameters: 7.61B

2.5113e+23

6ND = 6*7610000000 parameters *5.5T tokens =2.5113e+23

GitHub

Common Crawl

"we constructed a dataset named Qwen2.5-Coder-Data. This dataset comprises five key data types: Source Code Data, Text-Code Grounding Data, Synthetic Data, Math Data, and Text Data."

5500000000000

Confident

In this report, we introduce the Qwen2.5-Coder series, a significant upgrade from its predecessor, CodeQwen1.5. This series includes two models: Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B. As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder demonstrates impressive code generation capabilities w

Open weights (unrestricted)

China

Apache 2.0 https://huggingface.co/Qwen/Qwen2.5-Coder-7B

Industry

Skywork-13B

Language

Language modeling

Language modeling/generation

Translation

Kunlun Inc.

Tianwen Wei, Liang Zhao, Lichang Zhang, Bo Zhu, Lijie Wang, Haihua Yang, Biye Li, Cheng Cheng, Weiwei Lü, Rui Hu, Chenxia Li, Liu Yang, Xilin Luo, Xuejie Wu, Lunan Liu, Wenjun Cheng, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Lei Lin, Xiaokun Wang, Yutuan Ma, Chuanhai Dong, Yanqi Sun, Yifu Chen, Yongyi Peng, Xiaojuan Liang, Shuicheng Yan, Han Fang, Yahui Zhou

2023-10-30

Skywork: A More Open Bilingual Foundation Model

https://arxiv.org/abs/2310.19341

75.00

13000000000.00

13B

2.5e+23

"Our Skywork-13B is trained on a cluster of 64 NVIDIA-HGX-A800 nodes, a total of 512 A800-80G SXM GPUs... The training process of Skywork-13B spanned a total of 39 days." They note that "we achieved a token throughput of 1873 per GPU per second and a model flops utilization (MFU) of 56.5%... ". "MFU" was coined in the Palm paper (https://arxiv.org/pdf/2204.02311.pdf) and only counts operations used to train the model, not all operations observed on the hardware. MFU is lower than traditionall

SkyPile

"In order to train Skywork-13B, we build SkyPile, a vast, high quality corpus comprising more than 6 trillion tokens. A segment of the corpus, comprising over 150 billion tokens of web text, has been open sourced to facilitate research and training on Chinese LLMs"

3180000000000

The full SkyPile dataset is 6 trillion tokens, roughly half English and half Chinese: (https://huggingface.co/Skywork/Skywork-13B-base). The model is trained for the equivalent of 0.53 epochs on the full dataset, or 3.18 trillion unique tokens. This is around 2.78 trillion words, based on an average of 1 word/token for the Chinese portion and 0.75 word/token on the English portion.

NVIDIA A800

Confident

In this technical report, we present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts. This bilingual foundation model is the most extensively trained and openly published LLMs of comparable size to date. We introduce a two-stage training methodology using a segmented corpus, targeting general purpose training and then domain-specific enhancement training, respectively. We show that our model not only

Open weights (restricted use)

China

512

Open (restricted use)

commercial but restrictive license: https://github.com/SkyworkAI/Skywork/blob/main/LICENSE part of the training data is open, but only 2.5%: "In order to train Skywork-13B, we build SkyPile, a vast, high quality corpus comprising more than 6 trillion tokens. A segment of the corpus, comprising over 150 billion tokens of web text, has been open sourced to facilitate research and training on Chinese LLMs" training code: https://github.com/SkyworkAI/Skywork/blob/main/train/train.py

Industry

Qwen-14B

Language

Language modeling/generation

Alibaba

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zha

2023-09-28

Qwen Technical Report

https://arxiv.org/abs/2309.16609

169.00

14000000000.00

14B

2.5e+23

3T tokens per Table 1 14B*3T*6 = 2.5e23

"Our dataset is designed to meet these requirements and includes public web documents, encyclopedia, books, codes, etc. Additionally, our dataset is multilingual, with a significant portion of the data being in English and Chinese."

3000000000000

"We have pretrained the language models, namely QWEN, on massive datasets containing trillions of tokens" Table 1 indicates 3T tokens for Qwen-14B, and the above quote suggests the 3T aren't from multiple epochs on a smaller dataset.

Confident

Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Qwen, the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignm

Open weights (restricted use)

China

Unreleased

commercial allowed, can't use to train models https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT

Industry

Granite 13B

Language

Chat

Language modeling/generation

Question answering

Text summarization

IBM

2023-11-30

Granite Foundation Models

https://www.ibm.com/downloads/cas/X9W4O6BM

13000000000.00

13 billion

2.44e+23

Estimate using hardware: "Granite.13b.v1 used 256 A100 GPUs for 1056 hours and 120 TFLOPs. Granite.13b.v2 was trained on the same infrastructure for an additional 1152 hours with 120 TFLOPS, bringing the total to 2208 hours" Seems like 120 TFLOPS is the output per GPU after utilization, though they don't explicitly explain that part. That's 38% utilization. 256 * 2208 * 3600 * 120 TFLOPS = 2.44e23 Using 6ND: "The second version of the granite.13b models leverages an updated base model train

Unspecified unreleased

Common Crawl

arXiv

OPENWEBTEXT

"To support the training of large enterprise-grade foundation models, including granite.13b, IBM curated a massive dataset of relevant unstructured language data from sources across academia, the internet, enterprise (e.g., financial, legal), and code." More breakdowns in paper, 20 sources in total https://www.ibm.com/docs/en/cloud-paks/cp-data/4.8.x?topic=models-granite-13b-v1-model-card

2500000000000

2.5T tokens, 1.875T words at 0.75 words/token https://www.ibm.com/docs/en/cloud-paks/cp-data/5.0.x?topic=models-granite-13b-chat-v2-model-card

NVIDIA A100

Likely

We introduce the Granite series of decoder-only foundation models for generative artificial intelligence (AI) tasks that are ready for enterprise use. We report on the architecture, capabilities, underlying data and data governance, training algorithms, compute infrastructure, energy and carbon footprint, testing and evaluation, socio-technical harms and mitigations, and usage policies.

API access

United States of America

Unreleased

Industry

Falcon-40B

Language

Language modeling

Technology Innovation Institute

2023-03-15

Abu Dhabi-based Technology Innovation Institute Introduces Falcon LLM: Foundational Large Language Model (LLM) outperforms GPT-3 with 40 Billion Parameters

https://arxiv.org/abs/2311.16867; https://www.tii.ae/news/abu-dhabi-based-technology-innovation-institute-introduces-falcon-llm-foundational-large

0.00

40000000000.00

Model comes in 7B and 40B variants.

2.4e+23

C = 6ND = 6 * 40B * 1000B = 2.4e+23 FLOP (assuming one epoch) Table 1 from https://arxiv.org/pdf/2311.16867 Falcon paper 2,800 petaflop-days * 1e15 * 24 * 3600 = 2.4192e+23 FLOPs

RefinedWeb

Falcon-40B was trained on 1,000B tokens of RefinedWeb, a high-quality filtered and deduplicated web dataset which we enhanced with curated corpora. Significant components from our curated copora were inspired by The Pile (Gao et al., 2020).

1000000000000

1000B tokens ~= 750B words

NVIDIA A100

Confident

Open weights (unrestricted)

United Arab Emirates

384

Unreleased

apache 2.0

Government

Nanbeige-16B

Language

Chat

Language modeling/generation

Code generation

Question answering

Nanbeige LLM Lab

2023-11-01

https://github.com/Nanbeige/Nanbeige/blob/main/README_EN.md

16000000000.00

16 billion

2.4e+23

"It uses 2.5T Tokens for pre-training". I think that's the number of tokens the model was trained on, not the dataset size, but I'm not sure. 16 billion * 2.5 trillion * 6 = 2.4e23

Unspecified unreleased

"The training data includes a large amount of high-quality internet corpus, various books, code, etc"

2500000000000

"It uses 2.5T Tokens for pre-training"

Likely

Nanbeige-16B is a 16 billion parameter language model developed by Nanbeige LLM Lab. It uses 2.5T Tokens for pre-training. The training data includes a large amount of high-quality internet corpus, various books, code, etc. It has achieved good results on various authoritative evaluation data sets. This release includes the Base, Chat, Base-32k and Chat-32k.

Open weights (unrestricted)

China

Open source

Apache 2.0 training code: https://github.com/Nanbeige/Nanbeige/blob/main/scripts/train.sh

Industry

LightOn Mini

Language

Language modeling/generation

Chat

LightOn

2023-03-21

LightOn's Large Language Model of 40 billion parameters: MINI

https://www.lighton.ai/blog/lighton-s-blog-4/lighton-s-large-language-model-of-40-billion-parameters-mini-19

40000000000.00

"Boasting an impressive 40 billion parameters, Mini is a formidable addition to the growing array of language models available in the market today."

2.4e+23

6ND aproximation: 6*40B*1T = 2.4e23

"The amount of data in Mini corpus is 1 trillion tokens. We mainly used data from the public web to pre-train our model, with strong filtering, toxicity reduction, and deduplication to ensure that only high-quality data is retained."

1000000000000

"The amount of data in Mini corpus is 1 trillion tokens. We mainly used data from the public web to pre-train our model, with strong filtering, toxicity reduction, and deduplication to ensure that only high-quality data is retained." assuming 0.75 words per token - 750000000000.0 words

Confident

Hosted access (no API)

France

Unreleased

Industry

BloombergGPT

Language

Language modeling

Bloomberg

Johns Hopkins University

Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, Gideon Mann

2023-03-30

BloombergGPT: A Large Language Model for Finance

https://arxiv.org/abs/2303.17564

556.00

50558868480.00

2.3599999999999997e+23

2.36e23 per Table 4 (using our usual hardware method, 512 A100s over 53 days would be 512 * 312 teraFLOP/s * 53 * 24 * 3600 * 0.3 = 2.19e23)

"To train BloombergGPT, we construct “FinPile”, a comprehensive dataset consisting of a range of English financial documents including news, filings, press releases, web-scraped financial documents, and social media drawn from the Bloomberg archives. These documents have been acquired through our business process over the past two decades. We augment FinPile with public data widely used to train LLMs. The result is a training corpus that is roughly half domain-specific text and half general-purp

532000000000

708.9 billion tokens. At 0.75 English words per token, that's 532B words

NVIDIA A100

Confident

The use of NLP in the realm of financial technology is broad and complex, with applications ranging from sentiment analysis and named entity recognition to question answering. Large Language Models (LLMs) have been shown to be effective on a variety of tasks; however, no LLM specialized for the financial domain has been reported in literature. In this work, we present BloombergGPT, a 50 billion parameter language model that is trained on a wide range of financial data. We construct a 363 billion

Unreleased

United States of America

512

Unreleased

Industry

Academia

YaLM

Language

Language modeling

Chat

Yandex

Mikhail Khrushchev, Ruslan Vasilev, Alexey Petrov, Nikolay Zinov

2022-06-23

Yandex Publishes YaLM 100B. It’s the Largest GPT-Like Neural Network in Open Source

https://medium.com/yandex/yandex-publishes-yalm-100b-its-the-largest-gpt-like-neural-network-in-open-source-d1df53d0e9a6

100000000000.00

100B

2.2e+23

"It took us 65 days to train the model on a pool of 800 A100 graphics cards and 1.7 TB of online texts, books, and countless other sources."

The Pile

YaLM Russian Dataset

"25% The Pile — open English dataset by Eleuther AI team 75% Texts in Russian collected by our team (percentages of the whole dataset are given)" https://github.com/yandex/YaLM-100B?tab=readme-ov-file

300000000000

1.7TB of data 300B tokens – from github https://github.com/yandex/YaLM-100B I've assumed that 1 token correspond to 1 word in russian language.

NVIDIA A100

Likely

Open weights (unrestricted)

Russia

800

Unreleased

Apache 2.0 for weights. training details, but no code: https://medium.com/yandex/yandex-publishes-yalm-100b-its-the-largest-gpt-like-neural-network-in-open-source-d1df53d0e9a6

Industry

Flamingo

Multimodal

Vision

Language

Video

Visual question answering

Image captioning

DeepMind

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karen Simonyan

2022-04-29

Flamingo: a Visual Language Model for Few-Shot Learning

https://arxiv.org/abs/2204.14198

2473.00

80000000000.00

"We obtain three models, Flamingo-3B, Flamingo-9B and Flamingo-80B" " The Flamingo-80B model builds on top of the frozen Chinchilla 70B language model [42]. Starting from the very first layer and before every seventh transformer blocks, we add a GATED XATTN-DENSE layer attending to the visual inputs; this accounts for 10B additional learned parameters. For simplicity, we refer to this model as simply Flamingo throughout the paper"

2.1897200000000104e+23

1536 TPU v4 chips for 15 days. Assuming 40% utilization: C = 1536 TPU * 275*10^12 FLOP/s/TPU * 15 day * 86400 s/day * 0.40 = 2.2*10^23 FLOP "All training and evaluation was performed on TPUv4 instances. The largest model containing 80 billion parameters is trained on QUSV chips for 15 days and sharded across 16 devices." "All trained parameters and optimizer accumulators are stored and updated in float32; all activations and gradients are computed in bfloat16 after downcasting of parameters fr

MultiModal MassiveWeb

LTIP

VTP

ALIGN

Flamingo was trained on a mixture of web-scraped datasets: 43M pages of text with interleaved images (MultiModal MassiveWeb dataset) 312M image-text pairs (LTIP dataset) 27M video-text pairs (VTP dataset) 1.8B image-alt text pairs (ALIGN dataset) Training dataset size is at least 2.1 billion.

Google TPU v4

Confident

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks t

Unreleased

United Kingdom of Great Britain and Northern Ireland

Chinchilla

1536

Unreleased

Industry

phi-3-small 7.4B

Language

Chat

Language modeling/generation

Microsoft

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, Ronen Eldan, Dan Iter, Amit Garg, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, Russell J. Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin

2024-04-23

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

https://arxiv.org/abs/2404.14219

7400000000.00

7.4B

2.1312000000000003e+23

6ND = 6*7.4B parameters * 4.8T tokens =2.1312e+23

"4.8T tokens total as for phi-3-small"

4800000000000

Confident

We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered web data a

United States of America

Industry

Poro 34B

Language

Code generation

Language modeling/generation

High-Performance Language Technologies (HPLT)

University of Turku

Risto Luukkonen, Jonathan Burdge, Elaine Zosa, Aarne Talman, Ville Komulainen, Väinö Hatanpää, Peter Sarlin, Sampo Pyysalo

2023-12-14

Poro 34B and the Blessing of Multilinguality

https://arxiv.org/abs/2404.01856

34200000000.00

https://huggingface.co/LumiOpen/Poro-34B

2.052e+23

6ND = 6*1T*34.2B= 2.04e+23 "This allowed total training cycle throughput of 49618 TFLOPs and 174378 tokens/second." the training took around 18 months (https://hplt-project.org/deliverables) 49618*18*30*24*3600*10^12=2.3149774e+24

mC4

SlimPajama

StarCoder

Dolma

https://huggingface.co/LumiOpen/Poro-34B "The Finnish dataset is a combination of many Finnish resources: Finnish Internet Parsebank mC4 multilingual colossal, cleaned Common Crawl Common Crawl Finnish Finnish Wikipedia Lönnrot Projekti Lönnrot Suomi24 The Suomi 24 Corpus 2001-2020 Reddit r/Suomi submissions and comments STT Finnish News Agency Archive 1992-2018 Yle Finnish News Archive 2011-2018 Yle Finnish News Archive 2019-2020 Yle News Archive Ea

1000000000000

1T tokens, assuming 0.75 word per token "Poro is a 34B parameter decoder-only transformer pretrained on Finnish, English and code. It is being trained on 1 trillion tokens. Poro is a fully open source model and is made available under the Apache 2.0 License."

AMD Radeon Instinct MI250X

Confident

The pretraining of state-of-the-art large language models now requires trillions of words of text, which is orders of magnitude more than available for the vast majority of languages. While including text in more than one language is an obvious way to acquire more pretraining data, multilinguality is often seen as a curse, and most model training efforts continue to focus near-exclusively on individual large languages. We believe that multilinguality can be a blessing and that it should be possi

Open weights (unrestricted)

Multinational

Finland

512

Apache 2.0 https://huggingface.co/LumiOpen/Poro-34B

Research collective

Academia

AlexaTM 20B

Language

Language modeling

Translation

Question answering

Amazon

Saleh Soltan, Shankar Ananthakrishnan, Jack FitzGerald, Rahul Gupta, Wael Hamza, Haidar Khan, Charith Peris, Stephen Rawls, Andy Rosenbaum, Anna Rumshisky, Chandana Satya Prakash, Mukund Sridhar, Fabian Triefenbach, Apurv Verma, Gokhan Tur, Prem Natarajan

2022-08-02

AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model

https://arxiv.org/abs/2208.01448

73.00

19750000000.00

See Table 1 on p.3 of the paper

2.04374016e+23

Training throughput is reported as 154 TFLOP/s - see p.5 of the paper. "We relied on an internal and optimized version of DeepSpeed that we have since open-sourced (Chiu & Zheng, 2022) to obtain training throughput of up to 154 TFLOPS/GPU on 16 AWS p4d.24xlarge compute instances." Accelerator compute days are reported as 15,360 days - see Table 17 on p.18 of the paper.

mC4

Wikipedia

1319000000000

See Table 2 on p.3 of the paper. 119B Wikipedia tokens + 1.2T mC4 tokens = 1319000000000 tokens

NVIDIA A100

Confident

In this work, we demonstrate that multilingual large-scale sequence-to-sequence (seq2seq) models, pre-trained on a mixture of denoising and Causal Language Modeling (CLM) tasks, are more efficient few-shot learners than decoder-only models on various tasks. In particular, we train a 20 billion parameter multilingual seq2seq model called Alexa Teacher Model (AlexaTM 20B) and show that it achieves state-of-the-art (SOTA) performance on 1-shot summarization tasks, outperforming a much larger 540B P

API access

United States of America

128

https://aws.amazon.com/about-aws/whats-new/2022/11/alexatm-20b-model-available-sagemaker-jumpstart/?nc1=h_ls

Industry

Baichuan2-13B

Language

Chat

Baichuan

Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, Fan Yang, Fei Deng, Feng Wang, Feng Liu, Guangwei Ai, Guosheng Dong, Haizhou Zhao, Hang Xu, Haoze Sun, Hongda Zhang, Hui Liu, Jiaming Ji, Jian Xie, JunTao Dai, Kun Fang, Lei Su, Liang Song, Lifeng Liu, Liyun Ru, Luyao Ma, Mang Wang, Mickel Liu, MingAn Lin, Nuolan Nie, Peidong Guo, Ruiyang Sun, Tao Zhang, Tianpeng Li, Tianyu Li, Wei Cheng, Weipeng Chen, Xiangrong Zeng, Xiaochuan Wang, Xi

2023-09-06

Baichuan 2: Open Large-scale Language Models

https://huggingface.co/baichuan-inc/Baichuan2-13B-Base, https://arxiv.org/abs/2309.10305

13000000000.00

2.03e+23

They describe the dataset as having 2.6T tokens, but the checkpoint graph makes it clear that's also the number of tokens the model was trained on. 13b * 2.6t * 6 = 2.03e23

2.6 trillion tokens, bilingual. paper/model card don't give breakdown between English and Chinese

2275000000000

2.6T tokens, or ~2.3T words assuming that the dataset is roughly even English (0.75 words/token) and Chinese (1 word/token) 1.3T Chinese tokens * (1 word/token) = 1.3T Chinese words 1.3T English tokens * (0.75 words/token) = 0.975T English words total: 2.275T, or ~2.3T

Confident

Open weights (restricted use)

China

1024

Unreleased

Baichuan community license, restrictive commercial: https://huggingface.co/baichuan-inc/Baichuan2-13B-Base

Industry

AlphaGo Master

Games

Go

DeepMind

D Silver, J Schrittwieser, K Simonyan, I Antonoglou

2017-10-19

Mastering the game of Go without human knowledge

https://www.nature.com/articles/nature24270

8795.00

2.0001000000000102e+23

This is a guess. There was no single journal publication that accompanied this model, that gave information about architecture/model training time etc. All I could find was that it has the same architecture as AlphaGo Zero, and that it had roughly the same power consumption as AGZ. See for instance: https://deepmind.com/blog/article/alphago-zero-starting-scratch Since AGZ reaches the ELO of AlphaGo Master in about 25-30 days (60-75% of the total training time), I estimate the compute to be aro

Google TPU v1

A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on rein

Unreleased

United Kingdom of Great Britain and Northern Ireland

Unreleased

Industry

ViT-22B

Vision

Object detection

Image classification

Google

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Patrick Collier, Alexey Gritsenko, Vighnesh Birodk

2023-02-10

Scaling Vision Transformers to 22 Billion Parameters

https://arxiv.org/abs/2302.05442v1

428.00

21743000000.00

21.743B, Table 1

1.93248e+23

"ViT-22B was trained using 256 visual tokens per image, where each token represents a 14 × 14 patch extracted from 224 × 224 sized images. ViT-22B is trained for 177k steps with batch size of 65k: approximately 3 epochs" "ViT-22B was trained on 1024 TPU V4 chips for 177K steps" "Using these techniques, ViT-22B processes 1.15k tokens per second per core during training (forward and backward pass) on TPUv4 [...] ViT-22B’s model flops utilization (MFU) is 54.9%" 256 * 177k * 65k = 2.945T tokens

JFT-4B

"Dataset. ViT-22B is trained on a version of JFT (Sun et al., 2017), extended to around 4B images (Zhai et al., 2022a). These images have been semi-automatically annotated with a class-hierarchy of 30k labels"

4000000000

"Dataset. ViT-22B is trained on a version of JFT (Sun et al., 2017), extended to around 4B images (Zhai et al., 2022a). These images have been semi-automatically annotated with a class-hierarchy of 30k labels"

Google TPU v4

Confident

The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-2

Unreleased

United States of America

1024

Unreleased

don't see it here: https://github.com/google-research/vision_transformer?tab=readme-ov-file#available-vit-models

Industry

MPT-30B

Language

Language generation

Code generation

MosaicML

2023-06-22

https://huggingface.co/mosaicml/mpt-30b

30000000000.00

30B

1.8900000000001e+23

According to their blog post, "MPT-30B FLOPs ~= 6 * 30e9 [params] * 1.05e12 [tokens] = 1.89e23 FLOPs"

mC4

C4

RedPajama

The Stack

https://www.databricks.com/sites/default/files/inline-images/open-source-foundations-models-1.png

1050000000000

~4T tokens across sources, but only trained on 1.05T of these

NVIDIA H100 SXM5 80GB

Confident

MPT-30B is a decoder-style transformer pretrained from scratch on 1T tokens of English text and code. This model was trained by MosaicML. MPT-30B is part of the family of Mosaic Pretrained Transformer (MPT) models, which use a modified transformer architecture optimized for efficient training and inference. MPT-30B comes with special features that differentiate it from other LLMs, including an 8k token context window (which can be further extended via finetuning; see MPT-7B-StoryWriter), suppo

Open weights (unrestricted)

United States of America

512

Open source

apache 2.0 for weights. pretrain code here: https://github.com/mosaicml/llm-foundry/tree/main/scripts/train/yamls/pretrain

Industry

Mamba2-Hybrid

Language

Language modeling/generation

Question answering

NVIDIA

Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, Garvit Kulshreshtha, Vartika Singh, Jared Casper, Jan Kautz, Mohammad Shoeybi, Bryan Catanzaro

2024-06-12

An Empirical Study of Mamba-based Language Models

https://arxiv.org/abs/2406.07887

8660000000.00

Table 6

1.8186e+23

6ND = 6*8660000000.00 parameters * 3500000000000 tokens = 1.8186 × 10^23

Unspecified unreleased

"We train the models discussed in this report on 1.1T and 3.5T token datasets. Both datasets are predecessors of the dataset used to train Nemotron-4 and are comprised of 70% English, 15% non-English, and 15% code."

3500000000000

NVIDIA H100 SXM5 80GB

Likely

Selective state-space models (SSMs) like Mamba overcome some of the shortcomings of Transformers, such as quadratic computational complexity with sequence length and large inference-time memory requirements from the key-value cache. Moreover, recent studies have shown that SSMs can match or exceed the language modeling capabilities of Transformers, making them an attractive alternative. In a controlled setting (e.g., same data), however, studies so far have only presented small scale experiments

Open weights (unrestricted)

United States of America

1024

Open source

https://github.com/NVIDIA/Megatron-LM/tree/ssm/examples/mamba Apache 2.0 train script: https://github.com/NVIDIA/Megatron-LM/blob/ssm/examples/mamba/train.sh

Industry

Nemotron-3-8B

Language

Chat

Language generation

Language modeling/generation

Translation

Code generation

Question answering

NVIDIA

2023-11-15

NVIDIA AI Foundation Models: Build Custom Enterprise Chatbots and Co-Pilots with Production-Ready LLMs

https://developer.nvidia.com/blog/nvidia-ai-foundation-models-build-custom-enterprise-chatbots-and-co-pilots-with-production-ready-llms/ https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/nemotron-3-8b-base-4k

8000000000.00

1.8e+23

https://huggingface.co/nvidia/nemotron-3-8b-base-4k "This model was trained on a dataset containing 3.8 Trillion tokens of text" 8 billion * 3.8 trillion * 6 = 1.8e23 Also, using the hardware method: "1,024 A100s were used for 19 days to train the model." 19*1024 * 312 trillion * 24 * 3600 * 0.3 = 1.57e23

Unspecified unreleased

Flan

P3 (Public Pool of Prompts)

"NVIDIA models are trained on a diverse set of public and proprietary datasets. This model was trained on a dataset containing 3.8 Trillion tokens of text. The dataset contains 53 different human languages (including English, German, Russian, Spanish, French, Japanese, Chinese, Italian, and Dutch) and 37 programming languages. The model also uses the training subsets of downstream academic benchmarks from sources like FLANv2, P3, and NaturalInstructions v2"

3800000000000

NVIDIA A100

Confident

Large language models (LLMs) are revolutionizing data science, enabling advanced capabilities in natural language understanding, AI, and machine learning. Custom LLMs, tailored for domain-specific insights, are finding increased traction in enterprise applications. The NVIDIA Nemotron-3 8B family of foundation models is a powerful new tool for building production-ready generative AI applications for the enterprise–fostering innovations ranging from customer service AI chatbots to cutting-edge A

Open weights (restricted use)

United States of America

1024

can't use to train other models: https://developer.download.nvidia.com/ai-foundation-models/nvidia-ai-foundation-models-license-10Nov2023.pdf

Industry

Granite 3.0 2B

Language

Language modeling/generation

Question answering

Translation

Text summarization

Text classification

Code generation

IBM

Granite Team IBM

2024-10-21

Granite 3.0 Language Models

https://github.com/ibm-granite/granite-3.0-language-models/tree/main

2500000000.00

2.5B

1.8e+23

6ND = 6*2.5*10^9*12*10^12 = 1.8e+23 ""All our Granite 3.0 models are trained using a compute budget of 8.35 × 10^23 FLOPS." 8.35 × 10^23 * 174.6 (model's power consumption) / (174.6+757.0+64.5+121.2) =1.304851e+23 hardware estimation: 192030*3600*989500000000000*0.3 = 2.0521478e+23

Unspecified unreleased

Granite 3.0 language models are trained using data from various sources such as unstructured natural language text and code data from theWeb curated by IBM, a collection of synthetic datasets generated by IBM, and publicly available high-quality datasets with permissible licenses.

12000000000000

12T tokens

NVIDIA H100 SXM5 80GB

Confident

This report presents Granite 3.0, a new set of lightweight, state-of-the-art, open foundation models ranging in scale from 400 million to 8 billion active parameters. Equipped with native support of multilingual, coding, function calling, and strong safety performance, these models target enterprise use cases, including on-premise and on-device settings. Evaluations on a comprehensive set of tasks demonstrate that our models consistently reach state-of-the-art performance for their size (as sho

Open weights (unrestricted)

United States of America

768

Unreleased

Apache 2.0 license https://huggingface.co/ibm-granite/granite-3.0-2b-instruct

Industry

OLMo 2 Furious 7B

Language

Language modeling/generation

Question answering

Allen Institute for AI

University of Washington

New York University (NYU)

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William Merrill, Lester James V. Miranda, Jacob Morrison, Tyler Murray, Crystal Nam, Valentina Pyatkin, Aman Rangapur, Michael Schmitz, Sam Skjonsberg,

2024-12-31

2 OLMo 2 Furious

https://arxiv.org/abs/2501.00656

7000000000.00

7B

1.8e+23

1.8*10^23 FLOPs (Table 6 - developers calculated using 6ND formula)

OLMo-Mix-1124

Dolmino-Mix-1124

Tulu 3

4000000000000

Pretraining Stage 1 (OLMo-Mix-1124) 4 trillion tokens (= 1 epoch) Pretraining Stage 2 (Dolmino-Mix-1124) 50B tokens (3 runs) merged Post-training (Tulu 3 SFT OLMo mix) SFT + DPO + PPO (preference mix)

NVIDIA H100 SXM5 80GB

Confident

We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes dense autoregressive models with improved architecture and training recipe, pretraining data mixtures, and instruction tuning recipes. Our modified model architecture and training recipe achieve both better training stability and improved per-token efficiency. Our updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124, which significantly improves model capabilities

Open weights (unrestricted)

United States of America

Open source

apache 2 https://huggingface.co/allenai/OLMo-2-1124-7B https://github.com/allenai/OLMo

Research collective

Academia

Yuan 2.0

Language

Language modeling/generation

Translation

Code generation

Inspur

Shaohua Wu, Xudong Zhao, Shenling Wang, Jiangang Luo, Lingjun Li, Xi Chen, Bing Zhao, Wei Wang, Tong Yu, Rongguo Zhang, Jiahua Zhang, Chao Wang

2023-11-27

YUAN 2.0: A Large Language Model with Localized Filtering-based Attention

https://arxiv.org/abs/2311.15786v1

102600000000.00

102.6 billion

1.78e+23

Trained on 288B tokens 6*102.6b*288b = 1.78e23

"The pretraining corpus includes a mix of books, codes, and encyclopedia in both Chinese and English (Table 2)" with synthetic code data: "Code (CN). Considering the diversity of programming tasks, we also build a synthesized instruction dataset with 4 million code samples in Chinese. To cover the concepts involved in programming tasks as many as possible, we collect 15,000 words of programming, computer science, mathematics, and other relevant topics from the Sogou input dictionary. Two topic

288000000000

Most likely the 288B tokens do not represent multiple epochs. As a sense check, Table 2 appears to indicate that 5.73% of pre-training tokens come from synthetically generated text output by GPT-3.5. If the full training corpus is 288B tokens, this would imply ~$24k in API costs at $1.50/1M tokens to generate the data, which seems plausible.

Confident

In this work, the Localized Filtering-based Attention (LFA) is introduced to incorporate prior knowledge of local dependencies of natural language into Attention. Based on LFA, we develop and release Yuan 2.0, a large language model with parameters ranging from 2.1 billion to 102.6 billion. A data filtering and generation method is presented to build pretraining and fine-tuning dataset in high quality. A distributed training method with non-uniform pipeline parallel, data parallel, and optimizer

Open weights (restricted use)

China

commercial ok, but nothing that "may cause harm to the country and society, or for any services that have not undergone security assessment and filing" https://huggingface.co/IEITYuan/Yuan2-102B-hf

Industry

InternVL

Vision

Language

Visual question answering

Image classification

Image captioning

Shanghai AI Lab

Nanjing University

The University of Hong Kong

Tsinghua University

SenseTime

University of Science and Technology of China

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai

2024-01-15

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

https://arxiv.org/abs/2312.14238

14000000000.00

14B

1.744956e+23

trainable / total parameters stage 1: 13B / 13B stage 2: 1B / 14B training tokens: stage 1: (28.7-0.5)*0.5*(196/16)^2 + 0.5*(224/16)^2 = 2213B stage 2: 1.6*(224/16)^2 = 313.6 B 6*13B*2213B + 6*1B*313.6 B = 174495.6 *10^18 = 1.744956 × 1023

LAION-COCO

COYO-700M

SBU

Conceptual Captions (CC3M)

Conceptual Captions 12M (CC12M)

Wukong

LAION

2527000000000

Stage 1 " The training involves a total batch size of 164K across 640 A100 GPUs, extending over 175K iterations to process about 28.7 billion samples. To enhance efficiency, we initially train at a 196×196 resolution, masking 50% of image tokens [87], and later switch to 224×224 resolution without masking for the final 0.5 billion samples." Stage 2 1.6B samples "The input images are processed at a resolution of 224×224. For optimization, the AdamW optimizer [98] is employed with β1 = 0.9, β2

NVIDIA A100 SXM4 80 GB

Likely

The exponential growth of large language models (LLMs) has opened up numerous possibilities for multimodal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the LLM, using web-scale image-text data fr

Open weights (unrestricted)

China

Hong Kong

China

Hong Kong

China

InternViT-6B

LLaMA-7B

640

https://huggingface.co/OpenGVLab/InternVL-14B-224px MIT license

Academia

Industry

Academia

Llama 3.2 3B

Language

Language modeling/generation

Text summarization

Question answering

Quantitative reasoning

Translation

Meta AI

2024-09-24

Llama 3.2: Revolutionizing edge AI and vision with open, customizable models

https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/

3210000000.00

https://huggingface.co/meta-llama/Llama-3.2-1B

1.7334e+23

6ND = 6*3210000000.00*9000000000000 = 1.7334e+23 460000 hours * 3600 s * 133800000000000 FLOPS/s* 0.3 = 6.647184e+22

Unspecified unreleased

9000000000000

"Llama 3.2 was pretrained on up to 9 trillion tokens of data from publicly available sources."

NVIDIA H100 SXM5 80GB

Confident

Today, we’re releasing Llama 3.2, which includes small and medium-sized vision LLMs (11B and 90B), and lightweight, text-only models (1B and 3B) that fit onto edge and mobile devices, including pre-trained and instruction-tuned versions. The Llama 3.2 1B and 3B models support context length of 128K tokens and are state-of-the-art in their class for on-device use cases like summarization, instruction following, and rewriting tasks running locally at the edge. These models are enabled on day one f

Open weights (restricted use)

United States of America

Unreleased

LLAMA 3.2 COMMUNITY LICENSE AGREEMENT https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE

Industry

Konan LLM 41B

Language

Vision

Language modeling/generation

Konan Technology

Yang Seung-hyun, Wiretin, Changmin, Kim Jong-tae

2023-12-15

Konan LLM: A Korean Large Language Model

https://en.konantech.com/en/llm/konanllm https://techfinch.kr/ai/konan-technology-unveils-konan-llm--its-own-ai-language-model https://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE11610127

41000000000.00

1.722e+23

=41000000000*700000000000*6=1.722 × 10^23

Unspecified unreleased

700000000000

https://www.konantech.com/pr/press?number=2628&pn=1&stype2=&sfi=subj&sword= Since 2007, via the real-time AI analysis service pulseK, over 20.5 billion pieces of data have been independently secured. Among them, only 2 billion high-quality, large-scale data pieces have been used for training.

Likely

Konan LLM is a Large Language Model developed in-house by Konan Technology. Konan Technology optimized for super-large AI training, it leverages high-quality, large-scale data and over 20 years of expertise in natural language processing. Konan LLM supports all corporate documentation and creative tasks, leading the way in workplace innovation.

Hosted access (no API)

Korea (Republic of)

Unreleased

Industry

Ovis-7B

Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Han-Jia Ye

2024-06-17

Ovis: Structural Embedding Alignment for Multimodal Large Language Model

https://arxiv.org/abs/2405.20797

12.00

7000000000.00

1.7e+23

Fine tune: 989500000000000*128*0.3*15*60*60=2.0518272e+21 QWEN 1.5 training FLOP: 1.68e+23 Total: 1.7e+23

15M datapoints (image-text pairs)

NVIDIA H100 SXM5 80GB

Unverified

We propose Ovis, a novel MLLM architecture designed to structurally align visual and textual embeddings. Ovis integrates an additional learnable visual embedding table into the visual encoder’s process. To capture rich visual semantics, each image patch indexes the visual embedding table multiple times, resulting in a final visual embedding that is a probabilistic combination of the indexed embeddings.

Qwen1.5-7B

128

PaLI

Language

Vision

Multimodal

Visual question answering

Language modeling/generation

Google

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut

2022-09-14

PaLI: A Jointly-Scaled Multilingual Language-Image Model

https://arxiv.org/abs/2209.06794v4

567.00

16900000000.00

3.9b Image Encoder, 14b Multimodal Encoder-Decoder

1.69e+23

Pre-training the ViT component involved 1.1 million steps (they train over 1M steps but run the last 100k twice and then average the two resulting models). Batch size is 16384 and the inputs are 224x224. Table 8 indicates a forward pass with ViT-e/14 on a 224 image takes 1980 GFLOPs, so total training compute for the ViT-e/14 model is: 1980e9 * 16384 * 1.1 million * 3 (account for backward passes) = 1.07e23 In the "overal model" section, they then say: "The largest model, PaLI-17B, is pretraine

WebLI

"we introduce WebLI, a multilingual imagelanguage dataset built from images and texts available on the public web... Due to the abundance of multilingual content on the internet, the collection process for the WebLI dataset can be scaled to cover 10 billion images and 12 billion alt-texts. In addition to annotation with web text, we use publicly available automatic service to extract OCR annotations on all images, resulting in 29 billion image-OCR pairs. To balance quality and retain scale, we f

1600000000

"During training, the model passes over 1.6B images, one epoch over the entire pretraining dataset"

Google TPU v4

Likely

Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pre-trained encoder-decoder language models and Vision Transformers (ViTs).

Unreleased

United States of America

1024

Unreleased

Industry

Qwen1.5-7B

Language

Chat

Language modeling/generation

Quantitative reasoning

Code generation

Translation

Alibaba

Qwen Team

2024-02-04

Introducing Qwen1.5

https://huggingface.co/Qwen/Qwen1.5-7B

7000000000.00

7B

1.68e+23

6*7*10^9*4*10^12 = 1.68e+23

Unspecified unreleased

"We pretrained the models with a large amount of data, and we post-trained the models with both supervised finetuning and direct preference optimization."

4000000000000

4 trillion tokens from this response https://github.com/QwenLM/Qwen2/issues/97

Confident

Qwen1.5 is the beta version of Qwen2, a transformer-based decoder-only language model pretrained on a large amount of data. In comparison with the previous released Qwen, the improvements include: 8 model sizes, including 0.5B, 1.8B, 4B, 7B, 14B, 32B and 72B dense models, and an MoE model of 14B with 2.7B activated; Significant performance improvement in human preference for chat models; Multilingual support of both base and chat models; Stable support of 32K context length for models of all si

Open weights (unrestricted)

China

Unreleased

https://huggingface.co/Qwen/Qwen1.5-7B

Industry

Jiutian

Language

Language modeling/generation

China Mobile

2023-10-12

https://www.globaltimes.cn/page/202310/1299716.shtml

13900000000.00

A 13.9B parameter model is mentioned prominently at https://jiutian.10086.cn/portal/#/home 2025-01-13.

1.668e+23

6*13.9e9*2e12=1.668e23

2000000000000

"Designed to enhance efficiency, the model has trained over 2 trillion tokens"

Likely

China Mobile, the largest telecom operator in the world by subscribers, unveiled its "Jiutian" artificial intelligence (AI) large-scale model on Thursday, which has reportedly won support from large enterprises including China Ocean Shipping (Group) Co and China Railway Construction Co.

China

Industry

Qwen2.5-1.5B

Language

Language modeling/generation

Question answering

Quantitative reasoning

Alibaba

Qwen Team

2024-09-19

Qwen2.5-LLM: Extending the boundary of LLMs

https://qwenlm.github.io/blog/qwen2.5-llm/

1540000000.00

1.54B

1.6632e+23

Training dataset size was 18 trillion 6ND = 6 * 1.54B billion parameters * 18 trillion tokens = 1.6632e+23

Unspecified unreleased

"In terms of Qwen2.5, the language models, all models are pretrained on our latest large-scale dataset, encompassing up to 18 trillion tokens."

18000000000000

"In terms of Qwen2.5, the language models, all models are pretrained on our latest large-scale dataset, encompassing up to 18 trillion tokens"

Confident

In the past three months since Qwen2’s release, numerous developers have built new models on the Qwen2 language models, providing us with valuable feedback. During this period, we have focused on creating smarter and more knowledgeable language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5. We are announcing what might be the largest opensource release in history! Let’s get the party started!

Open weights (unrestricted)

China

Unreleased

Apache 2.0

Industry

AlphaCode

Language

Code generation

DeepMind

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d'Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, Oriol Vinyals

2022-02-02

Competition-Level Code Generation with AlphaCode

https://arxiv.org/abs/2203.07814

1013.00

41100000000.00

41.1B. Table 3

1.63944e+23

Figure 7 (a) shows a maximum training compute budget of approx 23000 TPU-days per model. 23000 days * 24 h/day * 3600 sec/h * 2.75e14 FLOP/s * 0.3 utilization = 1.64e23 FLOP

CodeContests

Unspecified unreleased

Looks like evaluation data is released but not pretraining data: "We use large transformer language models to generate code, pre-training them on selected GitHub code and fine-tuning on our curated set of competitive programming problems... A core part of developing our system was ensuring that submissions are rigorously evaluated and that evaluation problems are truly unseen during training, so difficult problems cannot be solved by copying from the training set. Towards this goal, we release

Appendix part A has answers for pretraining.

Google TPU v4

Programming is a powerful and ubiquitous problem-solving tool. Developing systems that can assist programmers or even generate programs independently could make programming more productive and accessible, yet so far incorporating innovations in AI has proven challenging. Recent large-scale language models have demonstrated an impressive ability to generate code, and are now able to complete simple programming tasks. However, these models still perform poorly when evaluated on more complex, unsee

Unreleased

United Kingdom of Great Britain and Northern Ireland

3750

Unreleased

Industry

FinGPT-13B

Language

Named entity recognition

Sentiment classification

Language modeling/generation

University of California Los Angeles (UCLA)

Columbia University

New York University (NYU)

Neng Wang, Hongyang Yang, Christina Dan Wang

2023-10-07

FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets

https://arxiv.org/abs/2310.04793; https://github.com/AI4Finance-Foundation/FinGPT

33.00

13000000000.00

Finetunes using LoRA, so only trains 3.67 million parameters

1.6e+23

From Llama 2-13B

Financial sentiment data (for fine-tuning): https://huggingface.co/datasets/FinGPT/fingpt-sentiment-train

NVIDIA GeForce RTX 3090

Likely

In the swiftly expanding domain of Natural Language Processing (NLP), the potential of GPT-based models for the financial sector is increasingly evident. However, the integration of these models with financial datasets presents challenges, notably in determining their adeptness and relevance. This paper introduces a distinctive approach anchored in the Instruction Tuning paradigm for open-source large language models, specifically adapted for financial contexts. Through this methodology, we capi

Open weights (unrestricted)

United States of America

Llama 2-13B

653248800B

fine-tuned Llama 2 13B RTX 3090 for 17 hours, at a cost of $17 35.5 trillion flops * 17 * 3600 * 0.3 = 6.532488e+17

1

Open source

MIT license (though probably subject to Llama 2 license too) https://github.com/AI4Finance-Foundation/FinGPT/blob/master/LICENSE train code: https://github.com/AI4Finance-Foundation/FinGPT/blob/master/fingpt/FinGPT_Benchmark/train.sh

Academia

Llama 2-13B

Language

Language modeling

Meta AI

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Mar

2023-07-18

Llama 2: Open Foundation and Fine-Tuned Chat Models

https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/ https://arxiv.org/abs/2307.09288

8056.00

13000000000.00

Llama has been released in 7B, 13B, and 70B variants.

1.6e+23

13 billion * 2 trillion * 6 = 1.6e23

Llama 2 dataset

2 trillion tokens of publicly available text, with no text from Meta's products. "Our training corpus includes a new mix of data from publicly available sources, which does not include data from Meta’s products or services. We made an effort to remove data from certain sites known to contain a high volume of personal information about private individuals. We trained on 2 trillion tokens of data as this provides a good performance–cost trade-off, up-sampling the most factual sources in an effort

2000000000000

2 trillion tokens ~= 1.5 trillion words

NVIDIA A100 SXM4 80 GB

Confident

In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closedsource models. We provide a detailed description of our approach to

Open weights (restricted use)

United States of America

Unreleased

Llama 2 license. can't use outputs to train models. https://github.com/meta-llama/llama/blob/main/LICENSE

Industry

Llama Guard

Language

Chat

Meta AI

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Davide Testuggine, Madian Khabsa

2023-12-07

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

https://arxiv.org/abs/2312.06674

201.00

7000000000.00

7B

1.6e+23

1.7e17 finetune compute, plus Llama 2-13B pretrain compute (1.6e+23)

Dataset of prompt-response pairs of human-AI conversations "We leverage the human preference data about harmlessness from Anthropic (Ganguli et al., 2022). From this dataset, we pick the first human prompt and discard the corresponding response from the assistant, as well as all the other turns to create an initial single-turn prompt dataset. Next, we use one of our internal Llama checkpoints to generate a mix of cooperating and refusing responses for these prompts. We employ our expert, in-hou

4096000

14k prompt-response pairs. Based on training details it's trained on ~4M tokens, which is stated to be ~1 epoch: 2 * 4096 * 500 = 4,096,000 (batch size) * (sequence length) * (steps)

NVIDIA A100 SXM4 80 GB

Confident

We introduce Llama Guard, an LLM-based input-output safeguard model geared towards Human-AI conversation use cases. Our model incorporates a safety risk taxonomy, a valuable tool for categorizing a specific set of safety risks found in LLM prompts (i.e., prompt classification). This taxonomy is also instrumental in classifying the responses generated by LLMs to these prompts, a process we refer to as response classification. For the purpose of both prompt and response classification, we have met

Open weights (restricted use)

United States of America

Llama 2-7B

170000000B

"We train on a single machine with 8xA100 80GB GPUs using a batch size of 2, with sequence length of 4096, using model parallelism of 1 and a learning rate of 2 × 10−6. We train for 500 steps, which corresponds to ∼1 epoch over our training set." 6 * 2*4096*500 * 7 billion = 1.7e17

Unreleased

Llama 2 license https://github.com/facebookresearch/PurpleLlama/tree/main/Llama-Guard

Industry

SparseOPT-175B

Language

Language modeling/generation

Question answering

Institute of Science and Technology Austria (ISTA)

Neural Magic

Elias Frantar, Dan Alistarh

2023-01-02

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

https://arxiv.org/abs/2301.00774

406.00

87500000000.00

1.58e+23

this is a distillation of OPT; see OPT dataset

NVIDIA A100 SXM4 80 GB

We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models. We can execute SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, in under 4.5 hours, and can reach 60% unstructured spar

Unreleased

Austria

United States of America

OPT-175B

312000000000000 FLOP/sec [A100 with assumed bf16 precision] * 1 GPU * 4 hours * 3600 sec/hour * 0.3 [assumed utilization] = 1.34784e+18 FLOP OPT-175 estimated compute: 4.3e+23 FLOP

1

Open source

code is Apache 2.0 (but OPT, which you'd need to recreate this model, is non-commercial) https://github.com/IST-DASLab/sparsegpt/blob/master/opt.py

Academia

Industry

StarCoder 2 7B

Language

Code generation

Code autocompletion

Hugging Face

ServiceNow

NVIDIA

BigCode

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muenn

2024-02-29

StarCoder 2 and The Stack v2: The Next Generation

https://arxiv.org/abs/2402.19173

7000000000.00

7B

1.55e+23

estimation is given in Table 6

The Stack v2

See Table 4. The Stack V2 plus some extras. Created from repositorites from Github with permissive licences.

658580000000

from Table 4

NVIDIA H100 SXM5 80GB

Confident

The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a

Open weights (restricted use)

Multinational

United States of America

https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement

Industry

Multi-Token Prediction 13B

Language

Code generation

Facebook AI Research

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve

2024-04-30

Better & Faster Large Language Models via Multi-token Prediction

https://arxiv.org/abs/2404.19737

13000000000.00

13B (Figure 1)

1.5364368e+23

"training all models reported in the paper required around 500K GPU hours of computation on hardware of type A100-80GB and H100." A100-80 GB peak FLOP/s [assumed fp16 precision]: 77970000000000 H100 peak FLOP/s [assumed SXM5 TensorCore]: 989000000000000 assuming 50/50 usage: (77970000000000+989000000000000)*0.5*500000hours*3600s*0.3=2.880819e+23 for ALL models in the paper assuming this model has taken around 16% of all used compute (https://docs.google.com/spreadsheets/d/1Yc-HAdYgn6e9SUIliMaQ

209700000000

209.7B (Table S13)

NVIDIA A100 SXM4 80 GB

NVIDIA H100 SXM5 80GB

Likely

Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved do

Unreleased

United States of America

Unreleased

Industry

Hunyuan Video

Video

Video generation

Tencent

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, Weiyan Wang, Wenqing Yu, Xinchi Deng, Yang Li, Yi Chen, Yutao Cui, Yuanbo Peng, Zhentao Yu, Zhiyu He, Zhiyong Xu, Zixiang Zhou, Zunnan Xu, Yang

2024-12-03

HunyuanVideo: A Systematic Framework For Large Video Generative Models

https://www.arxiv.org/abs/2412.03603

13000000000.00

13b

1.4814814999999999e+23

from Figure 10: the optimal model has 13b parameters, 5.8e+07PF (image training) + 7.0e+07PF (video training) of compute and 740B (image tokens) + 928B (video tokens) 5.8e+07PF + 7.0e+07PF = 12.8e+07PF = 12.8*10^7*10^20/(24*3600) = 1.4814815e+23 FLOPs 6ND = 6*13*10^9*(740+928)*10^9 = 1.30104e+23

Unspecified unreleased

"We employ various filters for data filtering and progressively increase their thresholds to build 4 training datasets, i.e., 256p, 360p, 540p, and 720p, while the final SFT dataset is built through manual annotation."

Confident

Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models

Open weights (restricted use)

China

Unreleased

"THIS LICENSE AGREEMENT DOES NOT APPLY IN THE EUROPEAN UNION, UNITED KINGDOM AND SOUTH KOREA" also requires additional licensing in case of massive commercial use https://huggingface.co/tencent/HunyuanVideo/blob/main/LICENSE the code seems to be just inference code not training code

Industry

HyperCLOVA 82B

Language

Language modeling/generation

Chat

Translation

Text classification

NAVER

Search Solutions

Boseop Kim, HyoungSeok Kim, Sang-Woo Lee, Gichang Lee, Donghyun Kwak, Dong Hyeon Jeon, Sunghyun Park, Sungju Kim, Seonhoon Kim, Dongpil Seo, Heungsub Lee, Minyoung Jeong, Sungjae Lee, Minsub Kim, Suk Hyun Ko, Seokhun Kim, Taeyong Park, Jinuk Kim, Soyoung Kang, Na-Hyeon Ryu, Kang Min Yoo, Minsuk Chang, Soobin Suh, Sookyo In, Jinseong Park, Kyungduk Kim, Hiun Kim, Jisu Jeong, Yong Goo Yeo, Donghoon Ham, Dongju Park, Min Young Lee, Jaewook Kang, Inho Kang, Jung-Woo Ha, Woomyoung Park, Nako Sung

2021-09-10

What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers

https://arxiv.org/abs/2109.04650

92.00

82000000000.00

"We introduce a Korean in-context large-scale LM with 82B parameters, i.e., HyperCLOVA. This is the first discovery on near 100B-scale non-English LM." According to media reports, HyperCLOVA has 204B parameters (i.e. a different version than in the paper) https://m.koreaherald.com/view.php?ud=20210525000824

1.476e+23

"For experiments in Section 4, the model trained with 150B is used for fair comparison, because not all models are finished training at the same iteration. However, experiments in Section 5.2 use the model trained with 300B tokens, as HyperCLOVA Studio provided the 39B and 82B models trained with 300B tokens." 82e9 connections * 2 FLOP/connection * 300e9 tokens * 3 backward pass = 1.476e23 FLOP Calculation using GPU time corroborates this: - "Our model is based on megatron-LM (Shoeybi et al.,

Unspecified unreleased

Blog corpus: 273.6 billion tokens Cafe corpus (online community): 83.3 billion tokens News corpus: 73.8 billion tokens Comments (crawled from various platforms): 41.1 billion tokens KiN (Korean QnA website): 27.3 billion tokens Modu (collection of five datasets): 6.0 billion tokens WikiEn, WikiJp (Foreign Wikipedia): 5.2 billion tokens Other unspecified sources: 51.5 billion tokens

300000000000

"However, experiments in Section 5.2 use the model trained with 300B tokens, as HyperCLOVA Studio provided the 39B and 82B models trained with 300B tokens." "We introduce HyperCLOVA, a large-scale Korean in-context learning-based LM with nearly 100B parameters, by constructing a large Korean-centric corpus of 560B tokens." Based on tokenizing the Hyperclova article itself using OpenAI's tiktoken BPE tokenizer (https://github.com/openai/tiktoken), there are 3285 tokens for 1069 words - about 3

NVIDIA A100

Confident

GPT-3 shows remarkable in-context learning ability of large-scale language models (LMs) trained on hundreds of billion scale data. Here we address some remaining issues less reported by the GPT-3 paper, such as a non-English LM, the performances of different sized models, and the effect of recently introduced prompt optimization on in-context learning. To achieve this, we introduce HyperCLOVA, a Korean variant of 82B GPT-3 trained on a Korean-centric corpus of 560B tokens. Enhanced by our Korean

API access

Korea (Republic of)

1024

Unreleased

"We introduce HyperCLOVA Studio, an interactive prompt engineering interface which provides GUI and API interfaces like the OpenAI playground1"

Industry

Movie Gen Audio

Audio

Audio generation

Meta AI

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Mitesh Kumar Singh, Peizhao Zhang, Peter Vajda, Quentin Duval, Rohit Girdhar, Roshan Sumbaly, Sai Saketh Rambhatla, Sam Tsai, Samaneh Azadi, Samyak Datta, Sanyuan Chen, Sea

2024-10-04

Movie Gen: A Cast of Media Foundation Models

https://ai.meta.com/static-resource/movie-gen-research-paper

13000000000.00

13B

1.4e+23

Pre trained for 14 days on 384 H100s. I assumed a 0.3 utilization rate

It was trained on O(1k) hours of audio

NVIDIA H100 SXM5 80GB

Confident

We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user’s image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generatio

Unreleased

United States of America

384

Industry

Yi 6B

Language

Chat

Language modeling/generation

Translation

Code generation

01.AI

Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, Zonghong Dai

2023-11-02

Yi: Open Foundation Models by 01.AI

https://arxiv.org/abs/2403.04652

6000000000.00

6B

1.26e+23

6*7*10^9*3*10^12 = 1.26e+23

Unspecified unreleased

3100000000000

"language models pretrained from scratch on 3.1T highly-engineered large amount of data, and finetuned on a small but meticulously polished alignment data."

Confident

The Yi series models are large language models trained from scratch by developers at 01.AI.

Open weights (restricted use)

China

Unreleased

llama license https://huggingface.co/01-ai/Yi-6B no training code

Industry

VARCO LLM 2.0 base

Language

Language modeling/generation

Chat

Translation

Question answering

NCSOFT

2023-08-16

VARCO LLM 2.0 is NCSOFT's large language model that can be applied to the development of natural language processing-based AI services.

https://ncsoft.github.io/ncresearch/varco-llm-details/ https://aws.amazon.com/marketplace/pp/prodview-d7amr4yxpibew?sr=0-3&ref_=beagle&applicationId=AWSMPContessa

13000000000.00

1.248e+23

=1600000000000*6*13000000000=1.248×10^23

"Our LLM is trained with datasets that are either publicly available for pretraining, collected from the Internet or internally constructed,” Jehee Lee, CRO of NCSOFT, told Engadget via email.

1600000000000

https://ncsoft.github.io/ncresearch/varco-llm-details/

Likely

VARCO LLM 2.0 is NCSOFT's large language model that can be applied to the development of various natural language processing-based AI services such as text generation, question answering, chatbots, summarization, and information extraction. NCSOFT's VARCO LLM 2.0 was developed with our own technology, including data construction, pre-training, instruction tuning and alignment tuning. We evaluated VARCO LLM 2.0 on various NLP tasks and its performance has significantly improved compared to VARCO

API access

Korea (Republic of)

Industry

Phi-4-Multimodal

Multimodal

Language

Vision

Speech

Language modeling/generation

Question answering

Visual question answering

Speech recognition

Translation

Audio question answering

Character recognition

Microsoft

Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong Chen, Dongdong Chen, Junkun Chen, Weizhu Chen, Yen-Chun Chen, Yi-ling Chen, Qi Dai, Xiyang Dai, Ruchao Fan, Mei Gao, Min Gao, Amit Garg, Abhishek Goswami, Junheng Hao, Amr Hendy, Yuxuan Hu, Xin Jin, Mahmoud Khademi, Dongwoo Kim, Young Jin Kim, Gina Lee, Jinyu Li, Yunsheng Li, Chen Liang, Xihui Lin, Zeqi Lin, Mengchen Liu, Yang Liu, Gilsin

2025-03-03

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

https://arxiv.org/abs/2503.01743

5600000000.00

5.6B 1. base: Phi-4 Mini (3.8b parameters) 2. "The audio encoder and projector introduce 460M parameters while LoRAA consumes another 460M parameters." 3. "The image encoder and projector introduce 440M model parameters while the vision adapter LoRAV consumes another 370M model parameters."

1.2117239999999998e+23

1.14e+23 (base model training compute) + 7.1724e+21 (finetune compute) = 1.211724e+23

Unspecified unreleased

"The Phi-4-Multimodal model’s pre-training phase involves a rich and varied dataset, encompassing interleaved image-text documents, image-text pairs, image grounding data, synthetic datasets from OCR of PDFs and realistic images, and synthesized datasets for chart comprehension" "For vision-speech data, Phi-4-Multimodal model is trained on a diverse set of synthetic vision-speech data, covering single-frame and multi-frame scenarios. "

1100000000000

"The pre-training process involves a total of 0.5T tokens, combining both visual and textual elements." "To pre-train the adapter and reduce the modality gap between the speech and text sequences, we curate a dataset of approximately 2M hours of anonymized in-house speech-text pairs with strong/weak ASR supervisions, covering the eight supported languages" "Note that the speech token rate is 80ms, indicating 750 tokens for 1-minute audio." 2*10^6 hours * 60 min / hour * 750 tokens = 90B token

Likely

We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement is driven by a carefully curated synthetic data recipe emphasizing high-quality math and coding dat

API access

United States of America

Phi-4 Mini

7172400000000B

3.8B frozen parameters (Base LM) 1. Vision-Language Training (0.5T tokens) 810M (440M Image Encoder + Projector + 370M LoRA_V) 6ND = 6*0.5*10^12*810*10^6 = 2.43e+21 2. Multimodal SFT (0.3T tokens) 810M 6ND = 6*0.3*10^12*810*10^6 = 1.458e+21 3. Speech Pre-training (2M hours = 90B tokens, see dataset size notes) 460M (Audio Encoder + Projector) 6ND = 6*90*10^9*460*10^6 = 2.484e+20 4. Speech Post-training (100M samples ~ 1.1T tokens, see dataset size notes) 460M (LoRA_A) 6ND = 6*1.1*1

Industry

UL2

Language

Language modeling/generation

Google Research

Google Brain

Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler

2022-05-10

Unifying Language Learning Paradigms

https://arxiv.org/abs/2205.05131v1

253.00

20000000000.00

Taken from Directory of LLMs

1.2e+23

Trained on 1T tokens 20B * 1T * 6 = 1.2e23 Second source: Section 5.1 says model was trained on 512 TPUv4 chips, and took slightly over 1 month 512 * 2.75e14 * 31 * 24 * 3600 * 0.3 = 1.13e23

C4

'The model is trained on a total of 1 trillion tokens on C4 (2 million steps).'

1000000000000

1T tokens

Google TPU v4

Confident

Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized and unified perspecti

Open weights (unrestricted)

Multinational

United States of America

512

Apache 2.0

Industry

PLaMo-13B

Language

Language modeling/generation

Chat

Question answering

Preferred Networks Inc

Preferred Networks, Inc

2023-09-28

PLaMo-13B

https://huggingface.co/pfnet/plamo-13b

13000000000.00

1.17e+23

6ND = 6*13e9*1.5e12=1.17e+23 from https://huggingface.co/pfnet/plamo-13b#model-details 480 GPUs * 30 days [assumed, likely less] * 24 hours * 3600 s * 77970000000000 FLOP/s * 41.0 [reported utilization] = 3.9772934e+24

C4

Project Gutenberg

RedPajama

mC4

Wikipedia (ja)

from https://huggingface.co/pfnet/plamo-13b#training-dataset

1500000000000

Trained tokens: 1.5T tokens (English: 1.32T tokens, Japanese: 0.18T tokens) from https://huggingface.co/pfnet/plamo-13b#model-details 0.75*1.32T + 0.18T = 1170000000000 0.75 words per token for English 1 for Japanese

NVIDIA A100 SXM4 40 GB

Confident

Open weights (unrestricted)

Japan

480

Unreleased

Apache 2.0 for weights. Open data

Industry

IDEFICS-80B

Multimodal

Language

Vision

Language modeling

Image captioning

Visual question answering

Hugging Face

Hugo Laurencon, Daniel van Strien, Stas Bekman, Leo Tronchon, Lucile Saulnier, Thomas Wang, Siddharth Karamcheti, Amanpreet Singh, Giada Pistilli, Yacine Jernite, Victor Sanh

2023-08-22

Introducing IDEFICS: An Open Reproduction of State-of-the-Art Visual Language Model

https://huggingface.co/blog/idefics

80000000000.00

IDEFICS... comes in two variants—the base version and the instructed version. Each variant is available at the 9 billion and 80 billion parameter sizes.

1.1593580544e+23

flops = 512 * 312e12 * 28*24*3600 * 0.3 = 1.159e23 (num gpus) * (peak perforemence) * (time in seconds) * (assumed utilization rate) "The IDEFICS models were trained on an AWS SageMaker cluster with 8x80GB A100 GPUs nodes and EFA network. IDEFICS-80B took ~28 days of training on 64 nodes (512 GPUs)." https://huggingface.co/HuggingFaceM4/idefics-80b-instruct trained on 150B text tokens + images 6ND = 6*734000000000*80*10^9 = 3.5232e+23

Wikipedia

Public Multimodal Dataset (PMD)

LAION

OBELICS

IDEFICS was trained on a mixture of openly available datasets: Wikipedia, Public Multimodal Dataset, and LAION, as well as a new 115B token dataset called OBELICS that we created. OBELICS consists of 141 million interleaved image-text documents scraped from the web and contains 353 million images.

734000000000

IDEFICS was trained on a mixture of openly available datasets: Wikipedia, Public Multimodal Dataset, and LAION, as well as a new 115B token dataset called OBELICS that we created. OBELICS consists of 141 million interleaved image-text documents scraped from the web and contains 353 million images. See https://huggingface.co/HuggingFaceM4/idefics-80b-instruct 149.6B tokens and 1.582B images in total. Effective Batch Size (# of tokens) 3.67M Max Training Steps 200K 3.67*10^6*200000 = 7340000000

NVIDIA A100

Confident

We are excited to release IDEFICS (Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS), an open-access visual language model. IDEFICS is based on Flamingo, a state-of-the-art visual language model initially developed by DeepMind, which has not been released publicly. Similarly to GPT-4, the model accepts arbitrary sequences of image and text inputs and produces text outputs.

Open weights (non-commercial)

Multinational

United States of America

LLaMA-65B

CLIP ViT-H/14 - LAION-2B

512

Llama license (non commercial)

Industry

Phi-4 Mini

Language

Language modeling/generation

Visual question answering

Code generation

Quantitative reasoning

Translation

Microsoft

Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong Chen, Dongdong Chen, Junkun Chen, Weizhu Chen, Yen-Chun Chen, Yi-ling Chen, Qi Dai, Xiyang Dai, Ruchao Fan, Mei Gao, Min Gao, Amit Garg, Abhishek Goswami, Junheng Hao, Amr Hendy, Yuxuan Hu, Xin Jin, Mahmoud Khademi, Dongwoo Kim, Young Jin Kim, Gina Lee, Jinyu Li, Yunsheng Li, Chen Liang, Xihui Lin, Zeqi Lin, Mengchen Liu, Yang Liu, Gilsin

2025-03-03

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

https://arxiv.org/abs/2503.01743

3800000000.00

3.8-billion

1.14e+23

6ND = 6 * 3800000000 parameters * 5000000000000 tokens = 1.14e+23

Unspecified unreleased

5000000000000

"With these techniques, we built the 5 trillion pre-training data corpus"

Confident

We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement is driven by a carefully curated synthetic data recipe emphasizing high-quality math and coding dat

API access

United States of America

Industry

Meena

Language

Text autocompletion

Google Brain

Dongling Xiao, Han Zhang, Yukun Li, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang

2020-01-28

Towards a Human-like Open-Domain Chatbot

https://arxiv.org/abs/2001.09977

879.00

2600000000.00

"We present Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token."

1.12e+23

https://arxiv.org/ftp/arxiv/papers/2104/2104.10350.pdf Table 4 In the paper: "We trained our best model for 30 days on a TPUv3 Pod (2,048 TPU cores) on the Meena dataset containing 40B words (or 61B BPE tokens) [...] by the end of training, the model had traversed the full training set 164 times (or epochs) and observed a total of about 10T tokens" Hardware: 30 * 24 * 3600 * (2048/2) * 1.23e14 * 0.3 = 9.794e22 Ops counting: 6 * 10T * 2.6B = 1.56E23 Geometric mean: sqrt(9.79e22*1.56E23) = 1.24e

40000000000

"The final Meena dataset contains 341GB of text (40B words)" Converting from GB to words yields 6.8e10, which is in the same OOM

Google TPU v3

Confident

We present Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token. We also propose a human evaluation metric called Sensibleness and Specificity Average (SSA), which captures key elements of a human-like multi-turn conversation. Our experiments show strong correlation between perplexity and SSA. The fact that the best perplexi

Unreleased

United States of America

1024

Unreleased

Industry

WizardCoder-15.5B

Language

Code generation

Microsoft

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Daxin Jiang

2023-06-14

WizardCoder: Empowering Code Large Language Models with Evol-Instruct

https://arxiv.org/abs/2306.08568

449.00

15500000000.00

15.5B

1.12e+23

1.12e23 base compute (StarCoder estimate) + 1.95e19 finetune compute (see below) ~= 1.12e23

synthetic code data: "To construct the training dataset, we initialized it with the 20K instruction-following dataset called Code Alpaca5. We iteratively employ the Evol-Instruct technique on this dataset consisting of 20,000 samples to produce evolved data"

"The evolved dataset consists of approximately 78k samples" Not sure how big the samples are.

Likely

Code Large Language Models (Code LLMs), such as StarCoder, have demonstrated exceptional performance in code-related tasks. However, most existing models are solely pre-trained on extensive raw code data without instruction finetuning. In this paper, we introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning, by adapting the Evol-Instruct method to the domain of code. Through comprehensive experiments on four prominent code generation benchmarks, namely HumanEval, Hu

Open weights (restricted use)

United States of America

StarCoder

19503513600B

"The StarCoder [11] serves as our basic foundation model. The evolved dataset consists of approximately 78k samples. To fine-tune the basic models, we employ specific configurations, including a batch size of 512, a sequence length of 2048, 200 fine-tuning steps, 30 warmup steps, a learning rate of 2e-5, a Cosine learning rate scheduler, and fp16 mixed precision." 512*2048*200 = 209,715,200 training tokens 209715200 * 15.5B * 6 = 1.95e19

Open source

commercial, responsible use restrictions: https://github.com/nlpxucan/WizardLM/blob/main/WizardCoder/MODEL_WEIGHTS_LICENSE code is apache: https://github.com/nlpxucan/WizardLM/blob/main/WizardCoder/CODE_LICENSE training code here: https://github.com/nlpxucan/WizardLM/blob/main/WizardCoder/src/train_wizardcoder.py data non-commercial: https://github.com/nlpxucan/WizardLM/blob/main/WizardCoder/DATA_LICENSE

Industry

OPT-66B

Language

Language modeling

Chat

Language modeling/generation

Question answering

Meta AI

Susan Zhang∗ , Stephen Roller∗ , Naman Goyal∗ , Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott† , Sam Shleifer† , Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, Luke Zettlemoyer

2022-06-21

OPT: Open Pre-trained Transformer Language Models

https://arxiv.org/abs/2205.01068

2932.00

66000000000.00

1.100000000001e+23

OPT-66B was trained for 140k steps, using a batch size of 2M tokens (see the OPT baselines logbook and Table 1 in Zhang et al. (2022), respectively), so training took 140e3 ∗ 2e6 ∗ 66e9 ∗ 6 = 1.1e23 FLOP

The Pile

BookCorpus (BooksCorpus, Toronto Book Corpus)

CC-Stories

Pushshift Reddit

C.2 Composition section: – BookCorpus (Zhu et al., 2015) consists of more than 10K unpublished books – CC-Stories (Trinh and Le, 2018) contains a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas – The Pile (Gao et al., 2021a) from which the following was included: * Pile-CC * OpenWebText2 * USPTO * Project Gutenberg * OpenSubtitles * Wikipedia * DM Mathematics * HackerNews – Pushshift.io Reddit dataset that was developed in Baumgartner et al. (2020) and proce

180000000000

"Our final corpus contains roughly 180B tokens."

NVIDIA A100 SXM4 80 GB

Confident

Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M t

Open weights (non-commercial)

United States of America

Open source

non-commercial for weights: https://ai.meta.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/ training code (MIT) https://github.com/facebookresearch/metaseq/blob/main/docs/training.md

Industry

Whisper v2

Speech

Speech recognition

OpenAI

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever

2022-12-05

Robust Speech Recognition via Large-Scale Weak Supervision

https://huggingface.co/openai/whisper-large-v2 https://arxiv.org/abs/2212.04356

2240.00

1550000000.00

1550M

1.1e+23

"Compared to the Whisper large model, the large-v2 model is trained for 2.5x more epochs with added regularization for improved performance." We (roughly) estimated Whisper v1 as 4.65e22. 2.5x that is 1.16e23 or ~1.1e23

Unspecified unreleased

"The models are trained on 680,000 hours of audio and the corresponding transcripts collected from the internet. 65% of this data (or 438,000 hours) represents English-language audio and matched English transcripts, roughly 18% (or 126,000 hours) represents non-English audio and English transcripts, while the final 17% (or 117,000 hours) represents non-English audio and the corresponding transcript. This non-English data represents 98 different languages."

9302400000

"When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zeroshot transfer setting without the need for any finetuning." 13,680 words/h (estimate) * 680,000h = 9,302,400,000 words

Likely

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning. Whisper was proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. The original code repository can be found here. Compared to the Whisper large model, the large-v2 model is trained

Open weights (unrestricted)

United States of America

Unreleased

Apache 2.0 for weights code for v1 is MIT: https://github.com/openai/whisper

Industry

Code Llama-7B

Language

Code generation

Meta AI

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Ellen Tan, Yossef (Yossi) Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Defossez, Jade Copet, Faisal Azhar, Hugo Touvron, Gabriel Synnaeve, Louis Martin, Nicolas Usunier, Thomas Scialom

2023-08-14

Code Llama: Open Foundation Models for Code

https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/ https://arxiv.org/abs/2308.12950

1297.00

7000000000.00

7B

1.1e+23

2.5e22 finetune compute + 8.4e22 base compute for Llama 2-7B, for ~1.1e23 compute overall Table 26: "In aggregate, training all 12 Code Llama models required 1400K GPU hours of computation on hardware of type A100-80GB" Suggests all versions took a combined 4.7e23 FLOPs: 3.12e14 * 1400000 * 3600 * 0.3 = 4.7e23 Assuming this refers to the finetune compute only, agrees with our finetune estimate if compute is proportional to parameter count: 7 / (7+13+34+70) = 0.056 0.056 * 4.7e23 = 2.65e22

Unspecified unreleased

"As shown in Table 1, Code Llama is trained predominantly on a near-deduplicated dataset of publicly available code. We also source 8% of our samples data from natural language datasets related to code. This dataset contains many discussions about code and code snippets included in natural language questions or answers. To help the model retain natural language understanding skills, we also sample a small proportion of our batches from a natural language dataset"

600000000000

Llama 2 used 2T tokens, and "We train Code Llama on 500B additional tokens and Code Llama - Python further on 100B tokens" 2T + 500B + 100B = 2600000000000

NVIDIA A100 SXM4 80 GB

Confident

We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters

Open weights (restricted use)

United States of America

Llama 2-7B

25000000000000B

Code Llama-base is trained from Llama 2 with 500B tokens: "We train Code Llama on 500B tokens during the initial phase, starting from the 7B, 13B, and 34B versions of Llama 2" Code Llama-Python required an additional 100B tokens in fine-tuning: "We train Code Llama on 500B additional tokens and Code Llama - Python further on 100B tokens." Code Llama-Instruct is fine-tuned on 5B tokens: "For Code Llama - Instruct, we train with a batch size of 524,288 tokens and on approx. 5B tokens in total."

Unreleased

Llama 2 license. can't use outputs to train models. https://github.com/meta-llama/llama/blob/main/LICENSE

Industry

OpenVLA

Robotics

Vision

Language

Robotic manipulation

Stanford University

University of California (UC) Berkeley

Toyota Research Institute

Google DeepMind

Massachusetts Institute of Technology (MIT)

Physical Intelligence

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, Chelsea Finn

2024-06-13

OpenVLA: An Open-Source Vision-Language-Action Mode

https://openvla.github.io/ https://arxiv.org/abs/2406.09246

73.00

7188100000.00

Based on a Prismatic-7B VLM backbone, which itself is comprised of 600M parameter vision encoder (DinoV2 + SigLIP) plus Llama-2 7B. Table 1 indicates 7.1881 billion trainable parameters

1.1e+23

Majority of compute is from VLA pre-training embedded in Prismatic-7B and it's constituent models. The fine-tuning compute used in this paper is "64 A100 GPUs for 14 days, or a total of 21,500 A100-hours" 21500 * 3600 * 3.12e14 * 0.4 = 9.66e21 Prismatic-7B training took "less than 9 hours" on 8 A100s: 9 * 3600 * 8 * 3.12e14 * 0.4 = 3.23e19 Add in the pre-trained components: - DinoV2 = 7.42e21, per our database - The SigLIP model in question is SoViT-400m/14 from the cited Alabdulmohsin et al

Open X-Embodiment

"The full OpenX dataset, at the time of writing, consists of more than 70 individual robot datasets, with more than 2M robot trajectories [...] we apply multiple steps of data curation to the raw dataset."

970000

"OpenVLA consists of a pretrained visually-conditioned language model backbone that captures visual features at multiple granularities, fine-tuned on a large, diverse dataset of 970k robot manipulation trajectories from the Open-X Embodiment [1] dataset" Filtered from 2M total in OpenX.

NVIDIA A100

Confident

Large policies pretrained on a combination of Internet-scale vision-language data and diverse robot demonstrations have the potential to change how we teach robots new skills: rather than training new behaviors from scratch, we can fine-tune such vision-language-action (VLA) models to obtain robust, generalizable policies for visuomotor control. Yet, widespread adoption of VLAs for robotics has been challenging as 1) existing VLAs are largely closed and inaccessible to the public, and 2) prior w

Open weights (unrestricted)

United States of America

United Kingdom of Great Britain and Northern Ireland

Multinational

United States of America

Llama 2-7B

9660000000000B

"64 A100 GPUs for 14 days, or a total of 21,500 A100-hours" 21500 * 3600 * 3.12e14 * 0.4 = 9.66e21

64

Open source

"OpenVLA uses multiple pretrained model components: SigLIP [9] and DinoV2 [25] vision encoders and a Llama 2 [10] language model backbone. For all three models, weights are open, but not their training data or code. We release training data, code and model weights for reproducing OpenVLA on top of these components." All published material is on an MIT license. train code: https://github.com/openvla/openvla/blob/main/scripts/pretrain.py

Academia

Industry

Academia

Industry

BlueLM 7B

Language

Chat

Translation

Language modeling/generation

Question answering

Code generation

vivo AI lab

2023-10-31

BlueLM: An Open Multilingual 7B Language Model

https://github.com/vivo-ai-lab/BlueLM/blob/main/BlueLM_technical_report.pdf

7000000000.00

"BlueLM is a large-scale pre-trained language model independently developed by vivo AI Global Research Institute. This release includes 7B base (base) model and 7B conversation (chat) model. At the same time, we have open sourced the long text base (base) model that supports 32K and conversation (chat) model." from GitHub https://github.com/vivo-ai-lab/BlueLM

1.0920000000001e+23

C = 6DN = 6 * 2.6T * 7B = 1.092*10^23 FLOP https://www.wolframalpha.com/input?i=6*7+billion+*+2.6+trillion (assuming 1 epoch) Figure 1 gives compute of 10^12 FLOPs but this seems improbable Training over 2.59T tokens took approximately 26 days using the vivolm system, with a throughput of 3150 tokens/sec/GPU.

Unspecified unreleased

2592000000000

"Larger amounts of high-quality data : high-quality corpus for training, reaching a scale of 2.6 trillion tokens. The corpus includes Chinese, English and a small amount of Japanese and Korean data" from GitHub see 2.1 https://github.com/vivo-ai-lab/BlueLM/blob/main/BlueLM_technical_report.pdf

Confident

BlueLM is a large-scale open-source language model independently developed by the vivo AI Lab. This release includes 2K and 32K context length versions for both Base and Chat models. High-quality Data: BlueLM is trained on a high-quality data with 2.6 trillion tokens. Our train corpus mainly consists of Chinese and English data, with a small amount of Japanese and Korean data. Stronger Performance: BlueLM-7B-Chat achieves a strong competitive performance in C-Eval and CMMLU benchmarks of the sa

Open weights (restricted use)

China

Unreleased

https://github.com/vivo-ai-lab/BlueLM/blob/main/MODEL_LICENSE_EN.pdf Our code is licensed under the Apache-2.0 and Community License for BlueLM Model. The BlueLM weights are completely open for academic research, and free commercial use is allowed after completing the questionnaire. "BlueLM weights are open for academic research and commercial use."

Industry

Baichuan 2-7B

Baichuan

Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, Fan Yang, Fei Deng, Feng Wang, Feng Liu, Guangwei Ai, Guosheng Dong, Haizhou Zhao, Hang Xu, Haoze Sun, Hongda Zhang, Hui Liu, Jiaming Ji, Jian Xie, Juntao Dai, Kun Fang, Lei Su, Liang Song, Lifeng Liu, Liyun Ru, Luyao Ma, Mang Wang, Mickel Liu, MingAn Lin, Nuolan Nie, Peidong Guo, Ruiyang Sun, Tao Zhang, Tianpeng Li, Tianyu Li, Wei Cheng, Weipeng Chen, Xiangrong Zeng, Xiaochuan Wang, Xiaoxi Chen

2023-09-20

Baichuan 2: Open Large-scale Language Models

https://arxiv.org/pdf/2309.10305

405.00

7000000000.00

1.0919999999999998e+23

7b * 2.6t * 6 = 1.092e23 Also mentions 1,024 NVIDIA A800 GPUs at 180 TFLOPS per GPU

2600000000000

NVIDIA A800

Unverified

In this technical report, we present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens.

China

1024

Industry

ruGPT-3.5 13B

Language

Chat

Language modeling/generation

Sber

2023-04-24

ruGPT-3.5 13B

https://huggingface.co/ai-forever/ruGPT-3.5-13B

13000000000.00

1.0699776e+23

"Model was trained using Deepspeed and Megatron libraries, on 300B tokens dataset for 3 epochs, around 45 days on 512 V100. After that model was finetuned 1 epoch with sequence length 2048 around 20 days on 200 GPU A100 on additional data" 512 GPUs * 125000000000000 FLOPs/s [peak] * 45 days * 24 hours * 3600 s * 0.3 + 200 GPUs * 312000000000000 FLOPs/s [peak for fp16] * 20 days * 24 hours * 3600 s * 0.3 = 1.0699776e+23 they probably used fp16 as in their similar project: https://habr.com/ru/co

300000000000

NVIDIA A100

NVIDIA Tesla V100 SXM2

Confident

Open weights (unrestricted)

Russia

512

Unreleased

MIT license https://huggingface.co/ai-forever/ruGPT-3.5-13B/discussions

Industry

Government

OLMo-7B

Language

Language modeling/generation

Chat

Allen Institute for AI

University of Washington

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Emma Strubell, N

2024-02-01

OLMo: Accelerating the Science of Language Models

https://arxiv.org/abs/2402.00838v1

7000000000.00

1.0332e+23

direct calculation: 6*7B*2.46trillion=1.0332 × 10^23 (calculation also reporoduced by the developers in https://arxiv.org/pdf/2501.00656)

Dolma

2000000000000

"We built our training dataset out of a 2T-token sample from our open dataset, Dolma [...] All of our released models have been trained to at least 2T tokens (a single epoch over our training data), and some have been trained beyond that by starting a second epoch over the data with a different shuffling order" Table 1 indicates total tokens seen are 2.46T for the 7B parameter model, though note that a later release in July 2024 has been trained to 2.75T tokens: https://github.com/allenai/OLMo

AMD Radeon Instinct MI250X

NVIDIA A100

Confident

Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community

Open weights (unrestricted)

United States of America

Open source

License: The code and model are released under Apache 2.0. Weights https://huggingface.co/allenai/OLMo-7B Code https://github.com/allenai/OLMo

Research collective

Academia

DeepSeekMath 7B

Language

Quantitative reasoning

DeepSeek

Tsinghua University

Peking University

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo

2024-02-05

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

https://arxiv.org/abs/2402.03300

7000000000.00

"Our model is initialized with DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024) and trained for 500B tokens."

1.014e+23

8.04e+22 (base model) + 2.1e+22 (fine-tune) = 1.014e+23

DeepSeekMath Corpus

arXiv

GitHub

Common Crawl

"By implementing a meticulously designed data selection pipeline, we successfully construct the DeepSeekMath Corpus, a high-quality dataset of 120B tokens from web pages filtered for mathematical content, which is almost 7 times the size of the math web pages used by Minerva (Lewkowycz et al., 2022a) and 9 times the size of the recently released OpenWebMath (Paster et al., 2023)." "The distribution of the data is as follows: 56% is from the DeepSeekMath Corpus, 4% from AlgebraicStack, 10% from

500000000000

"Our model is initialized with DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024) and trained for 500B tokens."

Confident

Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the perfor

Open weights (restricted use)

China

DeepSeek Coder 6.7B

21000000000000B

6*7B*500B = 2.1e+22

Unreleased

deepseek license https://huggingface.co/deepseek-ai/deepseek-math-7b-base

Industry

Academia

Qwen-7B

Language

Language modeling/generation

Translation

Alibaba

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zha

2023-09-28

Qwen Technical Report

https://arxiv.org/abs/2309.16609, https://huggingface.co/Qwen/Qwen-7B

169.00

7000000000.00

7B

1.0099999999999998e+23

2.4T tokens per Table 1 7b*2.4T*6 = 1.01e23

Unspecified unreleased

"Our dataset is designed to meet these requirements and includes public web documents, encyclopedia, books, codes, etc. Additionally, our dataset is multilingual, with a significant portion of the data being in English and Chinese."

2400000000000

"We have pretrained the language models, namely QWEN, on massive datasets containing trillions of tokens" Table 1 indicates 2.4T tokens for Qwen-7B, and the above quote suggests the 2.4T aren't from multiple epochs on a smaller dataset.

Confident

Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Qwen, the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignm

Open weights (restricted use)

China

Unreleased

commercial allowed, can't use to train models https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT

Industry

Evo 2 7B

Biology

Protein or nucleotide language model (pLM/nLM)

Arc Institute

Stanford University

NVIDIA

Liquid

University of California (UC) Berkeley

Goodfire

Columbia University

University of California San Francisco

Garyk Brixi, Matthew G. Durrant, Jerome Ku, Michael Poli, Greg Brockman, Daniel Chang, Gabriel A. Gonzalez, Samuel H. King, David B. Li, Aditi T. Merchant, Mohsen Naghipourfar, Eric Nguyen, Chiara Ricci-Tam, David W. Romero, Gwanggyu Sun, Ali Taghibakshi, Anton Vorontsov, Brandon Yang, Myra Deng, Liv Gorton, Nam Nguyen, Nicholas K. Wang, Etowah Adams, Stephen A. Baccus, Steven Dillmann, Stefano Ermon, Daniel Guo, Rajesh Ilango, Ken Janik, Amy X. Lu, Reshma Mehta, Mohammad R.K. Mofrad, Madelena Y

2025-02-19

Genome modeling and design across all domains of life with Evo 2

https://arcinstitute.org/manuscripts/Evo2

7000000000.00

1.008e+23

7e9 parameters *2.4e12 training datapoints*6=1.008e23

OpenGenome 2

2400000000000

"We trained two versions of Evo 2: a smaller version at 7B parameters trained on 2.4 trillion tokens and a full version at 40B parameters trained on 9.3 trillion tokens."

Unverified

All of life encodes information with DNA. While tools for sequencing, synthesis, and editing of genomic code have transformed biological research, intelligently composing new biological systems would also require a deep understanding of the immense complexity encoded by genomes. We introduce Evo 2, a biological foundation model trained on 9.3 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life. We train Evo 2 with 7B and 40B parameters to have an unprecedente

Open weights (unrestricted)

United States of America

Academia

Industry

Academia

Luminous-extended

Language

Language modeling/generation

Aleph Alpha

2022-08-15

https://docs.aleph-alpha.com/docs/Deprecated%20Luminous/Deprecated-Luminous/model-card/

30000000000.00

~30B (~42B with multi-modality)

1.0019457e+23

311840000000000*360000*3600*0.3 = 1.2124339e+23 6ND = 6*30*10^9*460000000000 = 8.28e+22 sqrt(8.28e+22*1.2124339e+23) = 1.0019457e+23

"""The Luminous family has been trained on a dataset compiled of sources in English, German, French, Spanish and Italian..."" more details in model card https://docs.aleph-alpha.com/docs/introduction/model-card/"

460000000000

~460B tokens 230000 iterations

NVIDIA A100 SXM4 40 GB

Confident

Aleph Alpha luminous-extended is the second largest model which is faster and cheaper than Luminous-supreme. the model can perform information extraction, language simplification and has multi-capable image description capability. You can try Aleph Alpha models with predefined examples for free. Go to at the Jumpstart page on their site and click through the examples on Classification and Labelling, Generation, Information Extraction, Translation and Conversion and Multimodal. Aleph Alpha are ba

API access

Germany

512

Unreleased

Industry

Alert