Compute-intensive models
Report abuse
Use this data
Sign up for free
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
Drag to adjust the number of frozen columns
Model
Domain
Task
Organization
Authors
Publication date
Reference
Link
Citations
Parameters
Parameters notes
Training compute (FLOP)
Training compute notes
Training dataset
Training dataset notes
Training dataset size (datapoints)
Dataset size notes
Training hardware
Confidence
Abstract
Model accessibility
Country (from Organization)
Base model
Finetune compute (FLOP)
Finetune compute notes
Hardware quantity
Training code accessibility
Accessibility notes
Organization categorization (from Organization)
Grok-3
Language
Vision
Multimodal
Chat
Language modeling/generation
Question answering
Code generation
Visual question answering
xAI
2025-02-17
Grok 3 Beta — The Age of Reasoning Agents
https://x.ai/blog/grok-3
4.6400000000000005e+26
Estimate based on training time for a cluster of 100,000 H100s, and xAI's statement that Grok 2 was trained on more compute than GPT-4 (2.1e25) and that Grok 3 was trained on around 15 times more compute than Grok 2. Full estimate here: https://docs.google.com/document/d/1C_dABuZrAqYE_ui4_GZ4bRLtq3TBjIGoBSktaPElhEU/edit?usp=sharing
Unspecified unreleased
NVIDIA H100 SXM5 80GB
Confident
Hosted access (no API)
United States of America
100000
Unreleased
Industry
Gemini 1.0 Ultra
Multimodal
Language
Vision
Language modeling
Visual question answering
Chat
Translation
Google DeepMind
Gemini Team
2023-12-06
Gemini: A Family of Highly Capable Multimodal Models
https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf
633.00
5.0000000001e+25
This number is an estimate based on limited evidence. In particular, we combine information about the performance of Gemini Ultra on various benchmarks compared to other models, and guesstimates about the hardware setup used for training to arrive at our estimate. Our reasoning and calculations are detailed in this Colab notebook. https://colab.research.google.com/drive/1sfG91UfiYpEYnj_xB5YRy07T5dv-9O_c
Unspecified unreleased
"Gemini models are trained on a dataset that is both multimodal and multilingual. Our pretraining dataset uses data from web documents, books, and code, and includes image, audio, and video data... We find that data quality is critical to a highlyperforming model, and believe that many interesting questions remain around finding the optimal dataset distribution for pretraining."
Google TPU v4
Speculative
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks — notably being the first mode
API access
United States of America
United Kingdom of Great Britain and Northern Ireland
Multinational
57000
Unreleased
API access: https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models
Industry
GPT-4o
Multimodal
Language
Audio
Speech
Vision
Chat
Image generation
Audio generation
Vision-language generation
Table tasks
Language modeling/generation
Question answering
Speech recognition
OpenAI
Aidan Clark, Alex Paino, Jacob Menick, Liam Fedus, Luke Metz, Clemens Winter, Lia Guy, Sam Schoenholz, Daniel Levy, Nitish Keskar, Alex Carney, Alex Paino, Ian Sohl, Qiming Yuan, Reimar Leike, Arka Dhar, Brydon Eastman, Mia Glaese, Ben Sokolowsky, Andrew Kondrich, Felipe Petroski Such, Henrique Ponde de Oliveira Pinto, Jiayi Weng, Randall Lin, Youlong Cheng, Nick Ryder, Lauren Itow, Barret Zoph, John Schulman, Mianna Chen, Adam Lerer, Adam P. Goucher, Adam Perelman, Akila Welihinda, Alec Radford
2024-05-13
Hello GPT-4o
https://openai.com/index/hello-gpt-4o/ https://openai.com/index/gpt-4o-system-card/
Not known. Inference costs in the API are 2x cheaper than GPT-4 Turbo
3.8100010000000003e+25
Training compute estimated from benchmark scores.
Unspecified unreleased
"With GPT-4o, we trained a single new model end-to-end across text, vision, and audio."
Speculative
We’re announcing GPT-4o, our new flagship model that can reason across audio, vision, and text in real time. GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time(opens in a new window) in a
API access
United States of America
Definitely a new model, not a GPT-4 finetune
Unreleased
Industry
Llama 3.1-405B
Language
Language modeling/generation
Meta AI
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu,
2024-07-23
The Llama 3 Herd of Models
https://ai.meta.com/research/publications/the-llama-3-herd-of-models/
405000000000.00
405B
3.8e+25
Stated in paper. Also, 6 * 405B * 15.6T training tokens = 3.8e25
Llama 3 dataset
15600000000000
15.6T tokens
NVIDIA H100 SXM5 80GB
Confident
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models s
Open weights (restricted use)
United States of America
16384
Open (restricted use)
Llama 3.1 model license: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE must seek separate license if over 700m monthly users, acceptable use restrictions training code here: https://github.com/meta-llama/llama-recipes/blob/main/src/llama_recipes/utils/train_utils.py#L70
Industry
Claude 3.5 Sonnet
Multimodal
Language
Vision
Chat
Image captioning
Code generation
Language modeling/generation
Anthropic
2024-06-20
Claude 3.5 Sonnet
https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf
3.650001e+25
Training compute estimated from benchmark scores. Blog post by Dario Amodei includes some info on 3.5 Sonnet compute: https://darioamodei.com/on-deepseek-and-export-controls "Claude 3.5 Sonnet is a mid-sized model that cost a few $10M's to train (I won't give an exact number). Also, 3.5 Sonnet was not trained in any way that involved a larger or more expensive model (contrary to some rumors)."
Unspecified unreleased
Training data cutoff Apr 2024
Speculative
API access
United States of America
Unreleased
Industry
GLM-4-Plus
Language
Language modeling
Zhipu AI
Zhipu AI
2024-08-29
GLM-4-Plus
https://bigmodel.cn/dev/howuse/glm-4
3.6000000000000006e+25
Estimated using benchmark imputation
Unknown
At the KDD International Conference on Data Mining and Knowledge Discovery, the Zhipu GLM team unveiled the new generation of base large model—GLM-4-Plus. As the latest version of Zhipu’s fully self-developed GLM large model, GLM-4-Plus signifies Zhipu AI’s continuous dedication in the field of general artificial intelligence, advancing the independent and autonomous innovation of large model technology.
API access
China
Industry
Claude 3.7 Sonnet
Language
Vision
Multimodal
Language modeling/generation
Question answering
Code generation
Quantitative reasoning
Translation
Instruction interpretation
Visual question answering
Anthropic
2025-02-24
Claude 3.7 Sonnet
https://www.anthropic.com/news/claude-3-7-sonnet
3.3500000000000006e+25
https://docs.google.com/spreadsheets/d/10bhwdVrfHI8tysVIz62ZxtvQ30L-HojYvmU18_b-WIM/edit?gid=0#gid=0
Unspecified unreleased
"Claude 3.7 Sonnet is trained on a proprietary mix of publicly available information on the Internet, as well as non-public data from third parties, data provided by data labeling services and paid contractors, and data we generate internally. While trained on publicly available information on the internet through November 2024, Claude 3.7 Sonnet’s knowledge cut-off date is the end of October 2024. This means the model’s knowledge base is most extensive and reliable on information and events up
Likely
Today, we’re announcing Claude 3.7 Sonnet1, our most intelligent model to date and the first hybrid reasoning model on the market. Claude 3.7 Sonnet can produce near-instant responses or extended, step-by-step thinking that is made visible to the user. API users also have fine-grained control over how long the model can think for. Claude 3.7 Sonnet shows particularly strong improvements in coding and front-end web development.
API access
United States of America
Unreleased
Industry
Grok-2
Language
Vision
Multimodal
Chat
Language modeling/generation
Question answering
Code generation
Visual question answering
xAI
2024-08-13
Grok-2 Beta Release
https://x.ai/blog/grok-2
2.96e+25
Estimate based on xAI statements comparing Grok-2 compute to GPT-4 and Grok-3. Full estimate here: https://docs.google.com/document/d/1C_dABuZrAqYE_ui4_GZ4bRLtq3TBjIGoBSktaPElhEU/edit?usp=sharing
Unspecified unreleased
NVIDIA H100 SXM5 80GB
Confident
Grok-2 is our frontier language model with state-of-the-art reasoning capabilities. This release includes two members of the Grok family: Grok-2 and Grok-2 mini. Both models are now being released to Grok users on the 𝕏 platform.
Hosted access (no API)
United States of America
Unreleased
Industry
Doubao-pro
Language
Language modeling/generation
Question answering
Text summarization
Text classification
ByteDance
2024-10-28
Doubao General Model Pro (Doubao-pro)
https://www.volcengine.com/docs/6360/1264663
500000000000.00
[Speculative] Doubao's large language model has scaled up from 35 billion parameters to 800 billion, with 500 billion and 800 billion parameter models currently under training. https://xueqiu.com/9637001584/309910396?md5__1038=7qmx2DyDuie4cDBqDTQEWqDtMvO4iTphD
2.505e+25
6ND = 6 * 500*10^9 * 8350*10^9 = 2.505e+25
Unspecified unreleased
Doubao's data sources primarily rely on proprietary business data, accounting for 50-60%; externally sourced data comprises 15-20%; and synthetic data has been used since June of this year, although Doubao is cautious in feeding synthetic data due to its uncertain quality.
8350000000000
[Speculative] Doubao's pre-training data volume is approximately 500TB, with only about 10% of this data actually used for training. The current version employs a non-Mixture-of-Experts (MoE) architecture. In the future, MoE architecture may be introduced to increase parameter count and performance, while also integrating multimodal data solutions. So this model is dense, and the training data is probably all text tokens, not multimodal. 50TB * 167M tokens/GB ~= 8.35 trillion tokens
Speculative
A professional-grade, self-developed LLM supporting up to 128k tokens, enabling fine-tuning across the entire series.
API access
China
Unreleased
Industry
GPT-4 Turbo
Multimodal
Vision
Language
Image generation
Chat
Language modeling/generation
Image generation
Speech synthesis
Table tasks
Visual question answering
Image captioning
OpenAI
2023-11-06
New models and developer products announced at DevDay
https://openai.com/blog/new-models-and-developer-products-announced-at-devday
Not known. Maybe smaller/sparser than GPT-4.
2.2e+25
Estimated using benchmark imputation
Unspecified unreleased
Unknown
Today, we shared dozens of new additions and improvements, and reduced pricing across many parts of our platform. These include: New GPT-4 Turbo model that is more capable, cheaper and supports a 128K context window
API access
United States of America
Unreleased
Industry
Mistral Large 2
Language
Language modeling/generation
Translation
Code generation
Mistral AI
Albert Jiang, Alexandre Sablayrolles, Alexis Tacnet, Alok Kothari, Antoine Roux, Arthur Mensch, Audrey Herblin-Stoop, Augustin Garreau, Austin Birky, Bam4d, Baptiste Bout, Baudouin de Monicault, Blanche Savary, Carole Rambaud, Caroline Feldman, Devendra Singh Chaplot, Diego de las Casas, Diogo Costa, Eleonore Arcelin, Emma Bou Hanna, Etienne Metzger, Gaspard Blanchet, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Harizo Rajaona, Henri Roussez, Hichem Sattouf, Ian Mack, Jean-Malo Delignon, Je
2024-07-24
Top-tier reasoning for high-complexity tasks, for your most sophisticated needs.
https://mistral.ai/news/mistral-large-2407/
123000000000.00
2.13e+25
Details are sparse, but we can hazard a guess based on evidence about the training cluster they may have used, the scale up in compute they likely would have used relative to Mistral Large 1, and from the model's MMLU score. Extended reasoning given here: https://docs.google.com/document/d/1I2ZWBLFMpRZYcdMMUfKAGZFJrOJpduNDS9ZeVFIHnd8/edit?usp=sharing
Unspecified unreleased
Likely
Today, we are announcing Mistral Large 2, the new generation of our flagship model. Compared to its predecessor, Mistral Large 2 is significantly more capable in code generation, mathematics, and reasoning. It also provides a much stronger multilingual support, and advanced function calling capabilities.
Open weights (non-commercial)
France
Unreleased
"We are releasing Mistral Large 2 under the Mistral Research License, that allows usage and modification for research and non-commercial usages. For commercial usage of Mistral Large 2 requiring self-deployment, a Mistral Commercial License must be acquired by contacting us."
Industry
GPT-4
Multimodal
Language
Vision
Image generation
Language modeling
OpenAI
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie C
2023-03-15
GPT-4 Technical Report
https://arxiv.org/abs/2303.08774
8281.00
2.1e+25
90% CI: 8.2E+24 to 4.4E+25 NOTE: this is a rough estimate based on public information, much less information than most other systems in the database. Calculation and confidence intervals here: https://colab.research.google.com/drive/1O99z9b1I5O66bT78r9ScslE_nOj5irN9?usp=sharing
Unspecified unreleased
4900000000000
Speculative. Reported secondhand by online sources such as Semianalysis, but not verified by OpenAI. If total number of tokens seen was 13T, text was repeated for 2 epochs, and text was the majority of tokens, then dataset size roughly is 13T*0.75/2 = 4.9T words. Note this examines only the text dataset, since GPT-4 was first and foremost a language model. However, the vision component had its own vision dataset, which we believe accounted for a much smaller part of the compute budget.
NVIDIA A100 SXM4 40 GB
Speculative
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results
API access
United States of America
25000
Unreleased
Industry
Nemotron-4 340B
Language
Language modeling/generation
Chat
Question answering
NVIDIA
Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das, Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, Aleksander Ficek, Denys Fridman, Shaona Ghosh, Boris Ginsburg, Igor Gitman, Tomasz Grzegorzek, Robert Hero, Jining Huang, Vibhu Jawa, Joseph Jennings, Aastha Jhunjhunwala, John Kamalu, Sadaf Khan, Oleksii Kuchaiev, Patrick LeGresley, Hui Li, Jiwe
2024-06-14
NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models
https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/
340000000000.00
340B
1.8000000000000003e+25
9 trillion tokens for training 6 * 340B * 9T = 1.8E25 alternatively, can do a hardware estimate with a few extra steps: According to the technical report, Nemotron-4 340B was trained using up to 6144 H100 GPUs. Helpfully, they also report the model FLOP utilization (MFU), which was 41-42% (Table 2). This is the ratio of the actual output of their GPUs, in FLOP used for training, relative to their theoretical max of 989 teraFLOP/s per GPU. Unfortunately, the report omits the last ingredient, w
Unspecified unreleased
The technical report for the 340B model cites the report for the 15B version (https://arxiv.org/pdf/2402.16819 ) from that paper: "We train Nemotron-4 15B on a pre-training dataset consisting of 8 trillion tokens. At a high-level, the data blend is split into three different types of data: English natural language data (70%), multilingual natural language data (15%), and source-code data (15%). The English corpus consists of curated documents from a variety of sources and domains including web
6750000000000
9T training tokens. They first train on an 8T token dataset and then an additional 1T tokens, it's slightly unclear if that's more data or a partial second epoch 6.75T words using 1 token = 0.75 words
NVIDIA H100 SXM5 80GB
Confident
We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4- 340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows distribution, modification, and use of the models and its outputs. These models perform competitively to open access models on a wide range of evaluation benchmarks, and were sized to fit on a single DGX H100 with 8 GPUs when deployed in FP8 precision. We
Open weights (unrestricted)
United States of America
6144
Unreleased
Permissive commercial license: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
Industry
Claude 3 Opus
Multimodal
Language
Vision
Chat
Image captioning
Code generation
Language modeling/generation
Anthropic
2024-03-04
The Claude 3 Model Family: Opus, Sonnet, Haiku
https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf
1.6400009999999998e+25
Training compute estimated from benchmark scores.
Unspecified unreleased
Claude 3 models are trained on a proprietary mix of publicly available information on the Internet as of August 2023, as well as non-public data from third parties, data provided by data labeling services and paid contractors, and data we generate internally. We employ several data cleaning and filtering methods, including deduplication and classification. The Claude 3 suite of models have not been trained on any user prompt or output data submitted to us by users or customers, including free us
Speculative
We introduce Claude 3, a new family of large multimodal models – Claude 3 Opus, our most capable offering, Claude 3 Sonnet, which provides a combination of skills and speed, and Claude 3 Haiku, our fastest and least expensive model. All new models have vision capabilities that enable them to process and analyze image data. The Claude 3 family demonstrates strong performance across benchmark evaluations and sets a new standard on measures of reasoning, math, and coding. Claude 3 Opus achieves sta
API access
United States of America
Unreleased
Industry
Gemini 1.5 Pro
Language
Multimodal
Language modeling
Visual question answering
Google DeepMind
Gemini Team
2024-02-15
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf
MoE architecture
1.5800010000000001e+25
Training compute imputed from benchmark scores.
Unspecified unreleased
Google TPU v4
Speculative
API access
United States of America
United Kingdom of Great Britain and Northern Ireland
Multinational
Unreleased
API access: https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models
Industry
GLM-4 (0116)
Language
Language modeling/generation
Question answering
Code generation
Quantitative reasoning
Translation
Zhipu AI
Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yang, Weng Lam Tam, Wenyi Zhao, Xiao Liu, Xiao Xia, Xiaohan Zhang, Xiaotao Gu, Xin Lv, Xinghan Liu, Xinyi Liu, Xinyue Yang, Xixuan Song, Xunkai Zh
2024-01-17
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
https://arxiv.org/abs/2406.12793 https://zhipuai.cn/en/devday
ChatGLM was 130B parameters, and the paper implies GLM-4 was scaled larger than previous models.
1.2e+25
- 0116 has slightly worse performance than 0520 - “the GLM-4 models are pre-trained on ten trillions of tokens” - I did not find any information about parameters or compute. Over here they speculatively estimate GLM-4 to be 200B parameters (which seems plausible to me), though no source provided. - “GLM-4 gets close to the state-of-the-art models (GPT-4-Turbo, Gemini 1.5 Pro, and Claude 3 Opus)” none of these models has parameters disclosed or compute estimation. 6*10000000000000*200000000000
"To date, the GLM-4 models are pre-trained on ten trillions of tokens mostly in Chinese and English, along with a small set of corpus from 24 languages, and aligned primarily for Chinese and English usage"
10000000000000
Likely
We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B. They represent our most capable models that are trained with all the insights and lessons gained from the preceding three generations of ChatGLM. To date, the GLM-4 models are pre-trained on ten trillions of tokens mostly in Chinese and English, along with a small set of corpus from 24
API access
China
GLM-4 (0116) has been made available through the GLM-4 API at https://bigmodel.cn
Industry
Mistral Large
Language
Chat
Mistral AI
2024-02-26
Mistral Large, our new flagship model
https://mistral.ai/news/mistral-large/
1.12e+25
https://www.wsj.com/tech/ai/the-9-month-old-ai-startup-challenging-silicon-valleys-giants-ee2e4c48 Mistral spent <20 million euro (meaning approximately 20 million?) to train Mistral Large: https://x.com/EMostaque/status/1762152740938031484?s=20 "assuming this is on H100s with @Scaleway who are €1.9/hour => 10m H100 hours (c 30m A100 hrs), 3 months at 4k H100s :timer_clock:" -Emad Mostaque Assuming bf16 or fp16, H100 SXM performance is 989 TFLOPS At 1.9 euro per H100-hour and 30% utilization,
NVIDIA H100 SXM5 80GB
Likely
API access
France
Unreleased
Industry
Aramco Metabrain AI
Language
Language modeling/generation
Saudi Aramco
Saudi Aramco
2024-03-04
Saudi Aramco unveils industry’s first generative AI model
https://www.offshore-technology.com/news/saudi-aramco-unveils-industry-first-generative-ai-model/
250000000000.00
"It has 250 billion parameters that are adjustable during training to generate outputs or make predictions."
1.05e+25
6*250B*7T=1.05e+25
"The AI was trained using seven trillion data points, collecting more than 90 years of company history."
7000000000000
Likely
Unreleased
Saudi Arabia
Industry
Government
Inflection-2
Language
Language modeling
Language modeling/generation
Chat
Question answering
Inflection AI
2023-11-22
Inflection-2: The Next Step Up
https://inflection.ai/inflection-2
1.001e+25
"Inflection-2 was trained on 5,000 NVIDIA H100 GPUs in fp8 mixed precision for ~10²⁵ FLOPs" (the second 1 is there because of airtable being wonky, it's not a real sig fig)
Unspecified unreleased
NVIDIA H100 SXM5 80GB
Confident
Today we are proud to announce that we have completed training of Inflection-2, the best model in the world for its compute class and the second most capable LLM in the world today. Our mission at Inflection is to create a personal AI for everyone. Just a few months ago, we announced Inflection-1 — a best-in-class language model that currently powers Pi. Our new model, Inflection-2, is substantially more capable than Inflection-1, demonstrating much improved factual knowledge, better stylistic c
Hosted access (no API)
United States of America
5000
Unreleased
via Pi, no API
Industry
Inflection-2.5
Language
Chat
Inflection AI
2024-03-07
Inflection-2.5: meet the world's best personal AI
https://inflection.ai/inflection-2-5
1.0001e+25
"Inflection-1 used approximately 4% the training FLOPs of GPT-4 and, on average, performed at approximately 72% GPT-4 level on a diverse range of IQ-oriented tasks. Inflection-2.5, now powering Pi, achieves more than 94% the average performance of GPT-4 despite using only 40% the training FLOPs." This is a weird one - we estimated GPT-4 at 2.1e25 FLOP (which could be off somewhat, or Inflection could believe a different number). 40% of that is ~8e24. But Inflection 2, the previous model, was tr
NVIDIA H100 SXM5 80GB
Speculative
At Inflection, our mission is to create a personal AI for everyone. Last May, we released Pi—a personal AI, designed to be empathetic, helpful, and safe. In November we announced a new major foundation model, Inflection-2, the second best LLM in the world at the time. Now we are adding IQ to Pi’s exceptional EQ. We are launching Inflection-2.5, our upgraded in-house model that is competitive with all the world's leading LLMs like GPT-4 and Gemini. It couples raw capability with our signature p
Hosted access (no API)
United States of America
Unreleased
Industry
Grok-1.5
Language
Language modeling
Chat
xAI
2024-03-28
Introducing Grok-1.5, our latest model capable of long context understanding and advanced reasoning. Grok-1.5 will be available to our early testers and existing Grok users on the 𝕏 platform in the coming days.
https://x.ai/blog/grok-1.5
9.26e+24
Lower bound is taken from Grok-1 estimation Upper bound is taken from Grok-2 estimation geometric mean: sqrt(2.90000000001*29.6)*10^24 = 9.26e+24
Unspecified unreleased
Speculative
Hosted access (no API)
United States of America
Unreleased
Musk noted that Grok-1.5 will power xAI’s ChatGPT-challenging chatbot on the X platform, while Grok-2, the successor of the new model, is still in the training phase
Industry
Reka Core
Multimodal
Language
Vision
Chat
Language modeling/generation
Image captioning
Code generation
Code autocompletion
Reka AI
Aitor Ormazabal Che Zheng Cyprien de Masson d’Autume Dani Yogatama Deyu Fu Donovan Ong Eric Chen Eugenie Lamprecht Hai Pham Isaac Ong Kaloyan Aleksiev Lei Li Matthew Henderson Max Bain Mikel Artetxe Nishant Relan Piotr Padlewski Qi Liu Ren Chen Samuel Phua Yazheng Yang Yi Tay Yuqi Wang Zhongkai Zhu Zhihui Xie
2024-04-15
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models
https://publications.reka.ai/reka-core-tech-report.pdf
67000000000.00
8.400010000000001e+24
No direct information about Reka Core model ("Reka Core has not finished training and is still improving.") The smaller dense model Reka Flash has 21B parameters and was trained on 5 trillion language tokens. There is information about compute: "Our setup comprises of clusters from a mixture of vendors with our peak compute being approximately 2.5K H100s and 2.5K A100s." If we assume 2 months of training with 2.5k H100s and 2.5k A100s at utilization 0.5 we get 8.4e24 FLOP (2500*9.9e14+2500*3.
Wikipedia
Unspecified unreleased
The training data comprises a mixture of publicly available and proprietary/licensed datasets with a dataset knowledge cutoff of November 2023. The dataset ingested by our model comprises of text, images, videos, and audio clips. Reka Flash and Reka Edge were trained on approximately 5 trillion and 4.5 trillion extensively deduplicated and filtered language tokens, respectively. While the classification of corpora is not strictly defined to one class or category, approximately 25% of our pretrai
NVIDIA A100
NVIDIA H100 SXM5 80GB
Speculative
API access
United States of America
Unreleased
Industry
SEA-LION V3 Llama3.1 70B
Language
Language modeling/generation
AI Singapore
2024-12-19
SEA-LION V3
https://huggingface.co/aisingapore/llama3.1-8b-cpt-sea-lionv3-base
70000000000.00
8.0103891e+24
Llama3.1 70B base model: 7.929e+24 Additional pretraining compute: Stage 1: 200*60*60*989500000000000*64*0.3=1.36788×10^22 Stage 2: 495*60*60*989500000000000*128*0.3=6.77103×10^22 Total: 7.929×10^24 + 1.36788×10^22 + 6.77103×10^22 = 8.0103891 × 10^24
The Stack v2
Dolma
Trained on a mix of datasets including StackV2 and Dolma (see https://huggingface.co/aisingapore/llama3.1-70b-cpt-sea-lionv3-base#data)
200000000000
"pre-trained on 200B tokens"
NVIDIA H200 SXM
NVIDIA H100 SXM5 80GB
Unverified
Our SEA-LION v3 Llama3.1 8B and 70B base models have been continued pre-trained on top of the Llama3.1 8B and 70B models respectively. Both have a context length of 128K, making them the SEA-LION models with the longest context length to date.
Open weights (unrestricted)
Llama 3.1-70B
192
Llama 3.1-70B
Language
Language modeling/generation
Meta AI
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu,
2024-07-23
The Llama 3 Herd of Models
https://ai.meta.com/research/publications/the-llama-3-herd-of-models/
70000000000.00
70B
7.929e+24
Huggingface page says 3.1-70B used 7.0M H100 hours and trained over 15T tokens. https://huggingface.co/meta-llama/Llama-3.1-70B The paper also says that 3.1-405B got MFU of between 38-43%; presumably 70B was around the same or a bit higher. I'll assume utilization of 40% 6ND: 6 * 15T * 70B = 6.3e24 FLOPs Hardware: 7M * 9.9e14 * 3600 * 0.4 = 9.98e24 FLOPs Geometric mean: sqrt(6.3e24 * 9.98e24) = 7.929e24 Note that Llama 3-70B also said it used 15T tokens, but only 6.4M H100 hours. This sugges
Llama 3 dataset
Confident
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models s
Open weights (restricted use)
United States of America
Open (restricted use)
Llama 3.1 license: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE must seek separate license if over 700m monthly users, acceptable use restrictions code here: https://github.com/meta-llama/llama3/tree/main
Industry
Llama-3.1-Nemotron-70B-Instruct
Language
Language modeling
NVIDIA
Meta AI
Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy Zhang, Makesh Narsimhan Sreedhar, Oleksii Kuchaiev
2024-06-12
https://www.semanticscholar.org/paper/HelpSteer2%3A-Open-source-dataset-for-training-reward-Wang-Dong/f590d8926dd12345a3bd22253461850f5ca4b3ed
https://www.semanticscholar.org/paper/HelpSteer2%3A-Open-source-dataset-for-training-reward-Wang-Dong/f590d8926dd12345a3bd22253461850f5ca4b3ed
200000 monthly HF download as of Nov 2024
7.929e+24
Taken from Llama 3.1 70B as the finetuning compute is multiple orders of magnitude lower
Llama 3 dataset
Overall, we have 7,118 preference pairs with 6,766 pairs in the training set and 352 pairs in the validation set.
Unverified
High-quality preference datasets are essential for training reward models that can effectively guide large language models (LLMs) in generating high-quality responses aligned with human preferences. As LLMs become stronger and better aligned, permissively licensed preference datasets, such as Open Assistant, HH-RLHF, and HelpSteer need to be updated to remain effective for reward modeling. Methods that distil preference data from proprietary LLMs such as GPT-4 have restrictions on commercial usa
Open weights (restricted use)
United States of America
United States of America
Llama 3.1-70B
102591360000B
Llama 3.1 70B: 7.929e+24 FT (see Appendix F): 32+64=96 hours on a single H100 Compute: 96*60*60*989500000000000*0.3=102591360000000000000=1e20
Industry
Industry
Llama 3-70B
Language
Chat
Language modeling/generation
Code generation
Meta AI
Aaditya Singh; Aaron Grattafiori; Abhimanyu Dubey; Abhinav Jauhri; Abhinav Pandey; Abhishek Kadian; Adam Kelsey; Adi Gangidi; Ahmad Al-Dahle; Amit Sangani; Ahuva Goldstand; Aiesha Letman; Ajay Menon; Akhil Mathur; Alan Schelten; Alex Vaughan; Amy Yang; Andrei Lupu; Andres Alvarado; Andrew Gallagher; Andrew Gu; Andrew Ho; Andrew Poulton; Andrew Ryan; Angela Fan; Ankit Ramchandani; Anthony Hartshorn; Archi Mitra; Archie Sravankumar; Artem Korenev; Arun Rao; Ashley Gabriel; Ashwin Bharambe; Assaf E
2024-04-18
Introducing Meta Llama 3: The most capable openly available LLM to date
https://ai.meta.com/blog/meta-llama-3/
70000000000.00
7.860999999999999e+24
Arithmetic calculation: 6 * 15T tokens * 70B parameters = 6.3e24 GPU calculation: https://huggingface.co/meta-llama/Meta-Llama-3-70B indicates training took 6.4M GPU-hours We also know their larger scale training runs for 405B were getting between 0.38-0.41 MFU. Presumably the 70B model gets at least 0.43 utilization (405B has to be split across two nodes, while 70B should fit on one). 990 TFLOPS per GPU * 6.4 million GPU hours * 3600s * 0.43 = 9.808e24 Geometric mean: sqrt(6.3e24 * 9.808e24)
Llama 3 dataset
"Llama 3 is pretrained on over 15T tokens that were all collected from publicly available sources. Our training dataset is seven times larger than that used for Llama 2, and it includes four times more code. To prepare for upcoming multilingual use cases, over 5% of the Llama 3 pretraining dataset consists of high-quality non-English data that covers over 30 languages."
15000000000000
NVIDIA H100 SXM5 80GB
Confident
Open weights (restricted use)
United States of America
Unreleased
https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md License A custom commercial license is available at: https://llama.meta.com/llama3/license
Industry
Qwen2.5 Instruct (72B)
Language
Code generation
Code autocompletion
Quantitative reasoning
Question answering
Language modeling/generation
Alibaba
Qwen Team
2024-09-19
Qwen2.5: A Party of Foundation Models!
https://qwenlm.github.io/blog/qwen2.5/
72700000000.00
Number of Parameters: 72.7B Number of Paramaters (Non-Embedding): 70.0B
7.851600000000001e+24
6ND = 6*72700000000 parameters *18000000000000 tokens = 7.8516e+24
Unspecified unreleased
18000000000000
"In terms of Qwen2.5, the language models, all models are pretrained on our latest large-scale dataset, encompassing up to 18 trillion tokens."
Confident
Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. Significant improvements in instruction following, generating long texts (over 8K tokens), unde
Open weights (restricted use)
China
Qwen2.5-72B
requires permission to use in applications with 100K+ users https://huggingface.co/Qwen/Qwen2.5-72B-Instruct
Industry
Qwen2.5-72B
Language
Language modeling/generation
Question answering
Quantitative reasoning
Alibaba
Qwen Team
2024-09-19
Qwen2.5: A Party of Foundation Models!
https://qwenlm.github.io/blog/qwen2.5/
72700000000.00
72.7B
7.8e+24
Training dataset size was 18 trillion 6ND = 6 * 72.7 billion parameters * 18 trillion tokens = 7.8e24
Unspecified unreleased
"In terms of Qwen2.5, the language models, all models are pretrained on our latest large-scale dataset, encompassing up to 18 trillion tokens."
18000000000000
"In terms of Qwen2.5, the language models, all models are pretrained on our latest large-scale dataset, encompassing up to 18 trillion tokens"
Confident
In the past three months since Qwen2’s release, numerous developers have built new models on the Qwen2 language models, providing us with valuable feedback. During this period, we have focused on creating smarter and more knowledgeable language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5. We are announcing what might be the largest opensource release in history! Let’s get the party started!
Open weights (unrestricted)
China
Unreleased
license: allows commercial. weights only https://huggingface.co/Qwen/Qwen2.5-72B/blob/main/LICENSE
Industry
GPT-4o mini
Language
Multimodal
Vision
Chat
Language modeling/generation
Code generation
Visual question answering
OpenAI
Pre-training leads Aidan Clark, Alex Paino, Jacob Menick Post-training leads Liam Fedus, Luke Metz Architecture leads Clemens Winter, Lia Guy Optimization leads Sam Schoenholz, Daniel Levy Long-context lead Nitish Keskar Pre-training Data leads Alex Carney, Alex Paino, Ian Sohl, Qiming Yuan Tokenizer lead Reimar Leike Human data leads Arka Dhar, Brydon Eastman, Mia Glaese Eval lead Ben Sokolowsky Data flywheel lead Andrew Kondrich Inference lead Felipe Petroski Such Inference Producti
2024-07-18
GPT-4o mini: advancing cost-efficient intelligence
https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/
7.36001e+24
Training compute estimated from benchmark scores. 90% CI [3.23e+24, 2.05e+25]
Unspecified unreleased
Speculative
OpenAI is committed to making intelligence as broadly accessible as possible. Today, we're announcing GPT-4o mini, our most cost-efficient small model. We expect GPT-4o mini will significantly expand the range of applications built with AI by making intelligence much more affordable. GPT-4o mini scores 82% on MMLU and currently outperforms GPT-41 on chat preferences in LMSYS leaderboard(opens in a new window). It is priced at 15 cents per million input tokens and 60 cents per million output toke
API access
United States of America
Unreleased
Industry
PaLM 2
Language
Language modeling
Language modeling/generation
Google
Andrew M. Dai, David R. So, Dmitry Lepikhin, Jonathan H. Clark, Maxim Krikun, Melvin Johnson, Nan Du, Rohan Anil, Siamak Shakeri, Xavier Garcia, Yanping Huang, Yi Tay, Yong Cheng, Yonghui Wu, Yuanzhong Xu, Yujing Zhang, Zachary Nado, Bryan Richter, Alex Polozov, Andrew Nystrom, Fangxiaoyu Feng, Hanzhao Lin, Jacob Austin, Jacob Devlin, Kefan Xiao, Orhan Firat, Parker Riley, Steven Zheng, Yuhuai Wu, Zhongtao Liu, Jiahui Yu, Guy Gur-Ari, Weikang Zhou, Sneha Kudugunta, Sunipa Dev, Frederick Liu, Gus
2023-05-10
PaLM 2 Technical Report
https://arxiv.org/abs/2305.10403
950.00
340000000000.00
Model Architecture: "PaLM-2 is a new state-of-the-art language model. We have small, medium, and large variants that use stacked layers based on the Transformer architecture, with varying parameters depending on model size. Further details of model size and architecture are withheld from external publication." However, the parameter count was leaked to CNBC: https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html
7.34e+24
Compute Requirements "Not reported." Paper suggests heuristic of C=6ND. Based on 340B parameters and 3.6T tokens, training compute would be around 7.3*10^24 FLOP.
"The PaLM 2 pre-training corpus is composed of a diverse set of sources: web documents, books, code, mathematics, and conversational data. The pre-training corpus is significantly larger than the corpus used to train PaLM (Chowdhery et al., 2022). PaLM 2 is trained on a dataset that includes a higher percentage of non-English data than previous large language models, which is beneficial for multilingual tasks" (page 9)
2700000000000
"The pre-training corpus is significantly larger than the corpus used to train PaLM" so greater than 6e+11. According to the leaked documents viewed by CNBC, the corpus was 3.6 trillion tokens or around 2.7*10^12 words. https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html
Google TPU v4
Likely
We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM (Chowdhery et al., 2022). PaLM 2 is a Transformer-based model trained using a mixture of objectives similar to UL2 (Tay et al., 2023). Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model
API access
United States of America
Unreleased
Industry
Telechat2-115B
Language
Language modeling
China Telecom
Zihan Wang and Xinzhang Liu and Shixuan Liu and Yitong Yao and Yuyao Huang and Zhongjiang He and Xuelong Li and Yongxiang Li and Zhonghao Che and Zhaoxi Zhang and Yan Wang and Xin Wang and Luwen Pu and Huihan Xu and Ruiyu Fang and Yu Zhao and Jie Zhang and Xiaomeng Huang and Zhilong Lu and Jiaxin Peng and Wenjun Zheng and Shiquan Wang and Bingkai Yang and Xuewei he and Zhuoru Jiang and Qiyi Xie and Yanhan Zhang and Zhongqiu Li and Lingling Shi and Weiwei Fu and Yin Zhang and Zilu Huang and Sishi
2024-09-20
TeleChat Technical Report
https://huggingface.co/Tele-AI/TeleChat2-115B
115000000000.00
6.9e+24
6ND: 6 * 115B * 10T = 6.9e24
10000000000000
The open source TeleChat2-115B model is trained using 10 trillion tokens of high-quality Chinese and English corpus
Unverified
Open weights (restricted use)
China
Industry
Llama 3.3
Language
Language modeling/generation
Question answering
Translation
Code generation
Meta AI
2024-12-06
Meta Llama 3.3 multilingual large language model (LLM)
https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3/
70000000000.00
70B
6.8649768e+24
6ND = 6*70*10^9*15*10^12= 6.3e+24 7000000*3600*989500000000000*0.3= 7.48062e+24 sqrt(7.48062e+24*6.3e+24) = 6.8649768e+24
Unspecified unreleased
"A new mix of publicly available online data."
15000000000000
"Overview: Llama 3.3 was pretrained on ~15 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over 25M synthetically generated examples. Data Freshness: The pretraining data has a cutoff of December 2023."
NVIDIA H100 SXM5 80GB
Confident
The Meta Llama 3.3 multilingual large language model (LLM) is a pretrained and instruction tuned generative model in 70B (text in/text out). The Llama 3.3 instruction tuned text only model is optimized for multilingual dialogue use cases and outperforms many of the available open source and closed chat models on common industry benchmarks. Model developer: Meta Model Architecture: Llama 3.3 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions
Open weights (restricted use)
United States of America
Unreleased
License A custom commercial license, the Llama 3.3 Community License Agreement, is available at: https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/LICENSE "Llama 3.3 is intended for commercial and research use in multiple languages."
Industry
Cosmos-1.0- Diffusion-14B Video2World
Robotics
Vision
Video
Robotic manipulation
Self-driving car
Video generation
NVIDIA
NVIDIA: Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, Jingyi Jin, Seung Wook Kim, Gergely Klár, Grace Lam, Shiyi Lan, Laura Leal-Taixe, Anqi Li, Zhaoshuo Li, Chen-Hsuan Lin, Tsung-Yi Lin, Huan Ling, Min
2025-01-07
Cosmos World Foundation Model Platform for Physical AI
https://arxiv.org/abs/2501.03575
14000000000.00
14B
6.1554816e+24
989500000000000 * 0.4 * 10000 * 3600 * 3 *30 *24 = 3.0777408e+25 (total training compute) assuming this model is 1/5 of it: 3.0777408e+25 / 5 = 6.1554816e+24 (Likely confidence)
Unspecified unreleased
9000000000000000
"Suite of first-generation video models trained on 9,000 trillion tokens, including 20 million hours of robotics and driving data - generating high-quality videos from multimodal inputs like images, text, or video."
NVIDIA H100 SXM5 80GB
Unverified
Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pr
Open weights (restricted use)
United States of America
10000
NVIDIA Open Model License Agreement Under the NVIDIA Open Model License, NVIDIA confirms: Models are commercially usable. You are free to create and distribute Derivative Models. NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models. Important Note: If you bypass, disable, reduce the efficacy of, or circumvent any technical limitation, safety guardrail or associated safety guardrail hyperparameter, encryption, security, digital rights management, or aut
Industry
Amazon Nova Pro
Multimodal
Language
Video
Language modeling/generation
Retrieval-augmented generation
Video generation
Amazon
2024-12-03
Introducing Amazon Nova foundation models: Frontier intelligence and industry leading price performance
https://aws.amazon.com/es/blogs/aws/introducing-amazon-nova-frontier-intelligence-and-industry-leading-price-performance/
6.000010000000001e+24
"probably just below 1e25 stemming from the Llama 70B serving speed. If Llama 70B is trained proportionally to 405B, then it's at ~ 6.6e24. Nova Pro is served at 100tk/s, while Llama 70B is served at 70tk/s on average, and 100tk/s by together.ai at FP8. So Nova Pro would be >1e25 if they roughly 2x the amount of training compared to Llama 70B which [seems unlikely]"
Speculative
API access
United States of America
Industry
DeepSeek-R1
Language
Language modeling/generation
Code generation
Quantitative reasoning
Question answering
DeepSeek
2025-01-20
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
https://api-docs.deepseek.com/news/news250120
671000000000.00
671B total 37B activated https://github.com/deepseek-ai/DeepSeek-R1/tree/main
5.17e+24
4.56e+24 FLOP (estimated base model Deepseek V3 training compute) + 6.1e23 FLOP = 5.17e+24 FLOP
Unspecified unreleased
RL + SFT When reasoning-oriented RL converges, we utilize the resulting checkpoint to collect SFT (Supervised Fine-Tuning) data for the subsequent round. Unlike the initial cold-start data, which primarily focuses on reasoning, this stage incorporates data from other domains to enhance the model’s capabilities in writing, role-playing, and other general-purpose tasks.
Confident
We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhan
Open weights (unrestricted)
China
DeepSeek-V3
610000000000000B
6.1e23 FLOP from these estimations: https://epoch.ai/gradient-updates/what-went-into-training-deepseek-r1
Unreleased
MIT licensed
Industry
Amazon Titan
Language
Image generation
Semantic search
Image generation
Language modeling/generation
Code generation
Chat
Text-to-image
Translation
Amazon
2023-09-28
https://aws.amazon.com/bedrock/titan/
200000000000.00
200B dense model https://importai.substack.com/p/import-ai-365-wmd-benchmark-amazon
4.8e+24
trained using NVIDIA NeMo: https://blogs.nvidia.com/blog/nemo-amazon-titan/ 13,760 NVIDIA A100 chips (using 1,720 P4d nodes). It took 48 days to train. from https://importai.substack.com/p/import-ai-365-wmd-benchmark-amazon counting operations: 6*200000000000*4000000000000=4.8e+24 gpu usage: 312000000000000(FLOP/s)*0.3*13760*1152*3600=5.3413281792e+24
4000000000000
4T tokens of data, based on comments from amazon engineer James Hamilton at a 2024 talk: https://perspectives.mvdirona.com/2024/01/cidr-2024/ Also cited here: https://lifearchitect.ai/titan/
NVIDIA A100
Likely
API access
United States of America
13760
Unreleased
Industry
DeepSeek-V3
Language
Language modeling/generation
Code generation
Quantitative reasoning
Question answering
DeepSeek
2024-12-24
DeepSeek-V3 Technical Report
https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
671000000000.00
Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token.
4.56e+24
6*37B*14.8T = 3.2856e+24 Alternatively, they say: "DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training." and we know they trained in FP8. H800s get 1.513e15 FLOP/s in FP8: 2.788M * 3600 * 1.513e15 * 0.3 = 4.56e24 Utilization may be somewhat lower for FP8. Upper bound estimate: 50% utilization would mean 7.59e24
14800000000000
"We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities"
NVIDIA H800 SXM5
Confident
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-tr
Open weights (restricted use)
China
2048
MIT and deepseek license https://github.com/deepseek-ai/DeepSeek-V3?tab=readme-ov-file
Industry
AFM-server
Language
Language modeling/generation
Apple
Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chong Wang, Chung-Cheng Chiu, David Qiu, Deepak Gopinath, Dian Ang Yap, Dong Yin, Feng Nan, Floris Weers, Guoli Yin, Haoshuo Huang, Jianyu Wang, Jiarui Lu, John Peebles, Ke Ye, Mark Lee, Nan Du, Qibin Chen, Quentin Keunebroek, Ruoming Pang, Sam Wiseman, Syd Evans, Tao Lei, Tom Gunter, Vivek Rathod, Xiang Kong, Xianzhi Du, Yanghao Li, Yongqiang Wang, Yuan Gao, Zaid Ahmed, Zhaoyang Xu, Zhiyun Lu, Zirui Wang, Al Rashid, Albin Madappally Jose, Ale
2024-07-29
Apple Intelligence Foundation Language Models
https://machinelearning.apple.com/research/apple-intelligence-foundation-language-models
4.3e+24
"The AFM base models are dense decoder-only models that build on the Transformer architecture" "We train AFM-server from scratch for 6.3T tokens on 8192 TPUv4 chips, using a sequence length of 4096 and a batch-size of 4096 sequences." "For both models we perform continued pre-training at a sequence length of 8192, with another 1T tokens from a mixture that upweights math and code, and down-weights the bulk web-crawl." "The sustained model-flop-utilization (MFU) for this training run was appro
6.3T tokens of web text, code, and math, plus another 1T in the second stage and 100B in the third. See section 3.1 for details.
7400000000000
Not explicitly mentioned, but I assume the 7.4T tokens do not involve multiple epochs.
Google TPU v4
Likely
Hosted access (no API)
United States of America
8192
Unreleased
Industry
MegaScale (Production)
Language
Language modeling/generation
ByteDance
Peking University
Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu
2024-02-23
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
https://arxiv.org/abs/2402.15627
40.00
530000000000.00
Production run is stated to have "hundreds of billions of parameters". Since the authors also do a number of experiments with a 530B model, I speculate they've used 530B for the production model.
3.9e+24
Speculative. The model is stated to have trained for "several weeks". Assuming 530B parameters and "several" = 3, compute can be estimated from the 175B model's stated PFLOP/sec: 2166.3 aggregate PFlops/sec * 3 weeks * 7 days/week * 24 hours/day * 3600 seconds/hour = 3.9e+24. As an upper bound, say 8e+24.
Speculative. Authors note production system was trained on "multi-trillions of tokens". This could refer to training for multiple epochs on the same 300B tokens used to train the 175B and 530B models outlined in more detail in the paper. Alternatively, it could refer to a larger dataset of perhaps 3-9 trillion tokens.
NVIDIA A100
Speculative
We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pip
Unreleased
China
China
12288
Unreleased
Code for MegaScale (also called veScale) training system are released under Apache Licence: https://github.com/volcengine/vescale The model itself is unreleased.
Industry
Academia
SenseChat
Language
Chat
SenseTime
2023-04-10
SenseTime Launches “SenseNova” Foundation Model Sets and AI Computing Systems, Advancing AGI Development
https://www.sensetime.com/en/news-detail/51166397?categoryId=1072
180000000000.00
https://www.thepaper.cn/newsDetail_forward_22639611 Translation: "SenseTime launched the "SenseNova" large model system, which includes natural language generation, image generation services, pre-labeling for perception models, and model development. The "SenseChat" application platform, powered by a 180-billion parameter Chinese language model, supports ultra-long text comprehension and offers capabilities such as question answering, understanding, and generation in Chinese." Link says "hundre
3.8899999999999995e+24
“Over the course of five years, SenseTime has built SenseCore, a leading AI infrastructure with 27,000 GPUs, capable of delivering a total computational power of 5,000 petaflops” Assuming they used this entire cluster with 30 days of training (rough average of frontier model training times since 2016), 30% utilization rate: 5000e15 * 0.3 * 30 * 24 * 60 * 60 = 3.89e24 FLOP. Assuming the model is dense and trained Chinchilla-optimal: 20 tokens/parameter * (180e9 parameters)**2 * 6 = 3.89e24 FLOP
Speculative
SenseTime hosted a Tech Day event, sharing their strategic plan for advancing AGI (Artificial General Intelligence) development through the combination of “foundation models + large-scale computing” systems. Under this strategy, SenseTime unveiled the “SenseNova” foundation model set, introducing a variety of foundation models and capabilities in natural language processing, content generation, automated data annotation, and custom model training. At the event, SenseTime not only showcased their
Hong Kong
China
Industry
Claude 2
Language
Language modeling
Chat
Language modeling/generation
Question answering
Anthropic
2023-07-11
https://www.anthropic.com/index/claude-2, https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf
3.866e+24
https://colab.research.google.com/drive/1MdPuhS4Emaf23VXYZ-ooExDW-5GXZkw0#scrollTo=Ds0Q5X8aMnOY
Unspecified unreleased
From model card: "Claude models are trained on a proprietary mix of publicly available information from the Internet, datasets that we license from third party businesses, and data that our users affirmatively share or that crowd workers provide. Some of the human feedback data used to finetune Claude was made public [12] alongside our RLHF [2] and red-teaming [4] research. Claude 2’s training data cuts off in early 2023, and roughly 10 percent of the data included was non-English."
Speculative
API access
United States of America
Unreleased
Industry
Falcon-180B
Language
Language modeling
Technology Innovation Institute
Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, Baptiste Pannier, Guilherme Penedo
2023-09-06
The Falcon Series of Open Language Models
https://falconllm.tii.ae/falcon-180b.html; https://arxiv.org/abs/2311.16867
261.00
180000000000.00
"Falcon 180B is a super-powerful language model with 180 billion parameters"
3.76e+24
43,500 petaflop-days per Table 1 of the paper 43500 * 1e15 * 24 * 3600 = 3.76e24 C = 6ND = 6 FLOP/token/parameter * 3.5 trillion tokens * 180 billion parameters = 3.78*10^24 FLOP
RefinedWeb
"The Falcon series is made of three causal decoder-only models trained on up to 4,096 A100. We assembled a pretraining dataset of 3,500 billion tokens, predominantly sourced from our work on RefinedWeb (Penedo et al., 2023)–a massive filtered and deduplicated web dataset" Training dataset composition is described in Table 3. Falcon was trained for 1 epoch.
2625000000000
3.5 trillion tokens * (~3 words per 4 tokens) ~= 2.625 trillion words
NVIDIA A100 SXM4 40 GB
Confident
Falcon 180B is a super-powerful language model with 180 billion parameters, trained on 3.5 trillion tokens. It's currently at the top of the Hugging Face Leaderboard for pre-trained Open Large Language Models and is available for both research and commercial use. This model performs exceptionally well in various tasks like reasoning, coding, proficiency, and knowledge tests, even beating competitors like Meta's LLaMA 2. Among closed source models, it ranks just behind OpenAI's GPT 4, and perfo
Open weights (restricted use)
United Arab Emirates
4096
Unreleased
"Falcon 180b can be commercially used but under very restrictive conditions, excluding any "hosting use"." https://huggingface.co/blog/falcon-180b
Government
QwQ-32B
Language
Language modeling/generation
Question answering
Quantitative reasoning
Code generation
Alibaba
Qwen Team
2025-03-06
QwQ-32B: Embracing the Power of Reinforcement Learning
https://qwenlm.github.io/blog/qwq-32b/
32500000000.00
Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias Number of Parameters: 32.5B Number of Paramaters (Non-Embedding): 31.0B Number of Layers: 64 Number of Attention Heads (GQA): 40 for Q and 8 for KV
3.51e+24
Assuming the same dataset size as for Qwen2.5 training (18T tokens): 6ND = 6 * 32500000000 parameters * 18 * 10^12 tokens = 3.51 × 10^24 'Speculative' confidence
Unspecified unreleased
Speculativley might be similar to Qwen2.5 models (18T tokens)
Speculative
QwQ is the reasoning model of the Qwen series. Compared with conventional instruction-tuned models, QwQ, which is capable of thinking and reasoning, can achieve significantly enhanced performance in downstream tasks, especially hard problems. QwQ-32B is the medium-sized reasoning model, which is capable of achieving competitive performance against state-of-the-art reasoning models, e.g., DeepSeek-R1, o1-mini.
Open weights (unrestricted)
China
Qwen2.5-Coder (32B)
Unreleased
https://huggingface.co/Qwen/QwQ-32B Apache 2
Industry
Qwen2.5-32B
Language
Language modeling/generation
Question answering
Quantitative reasoning
Alibaba
Qwen Team
2024-09-17
Qwen2.5: A Party of Foundation Models!
https://qwenlm.github.io/blog/qwen2.5/
32500000000.00
32.5B
3.51e+24
6 * 32.5B parameters * 18 trillion tokens = 3.51 × 10^24
Unspecified unreleased
18000000000000
"In terms of Qwen2.5, the language models, all models are pretrained on our latest large-scale dataset, encompassing up to 18 trillion tokens"
Confident
In the past three months since Qwen2’s release, numerous developers have built new models on the Qwen2 language models, providing us with valuable feedback. During this period, we have focused on creating smarter and more knowledgeable language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5. We are announcing what might be the largest opensource release in history! Let’s get the party started! The Qwen2.5-7B model surpasses its predecessors and counte
Open weights (unrestricted)
China
Unreleased
Apache 2.0 https://huggingface.co/Qwen/Qwen2.5-32B
Industry
Hunyuan-Large
Language
Language modeling/generation
Question answering
Code generation
Translation
Tencent
Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, Zhongzhi Chen, Xuemeng Huang, Fengzong Lian, Saiyong Yang, Jianfeng Yan, Yuyuan Zeng, Xiaoqin Ren, Chao Yu, Lulu Wu, Yue Mao, Jun Xia, Tao Yang, Suncong Zheng, Kan Wu, Dian Jiao, Jinbao Xue, Xipeng Zhang, Decheng Wu, Kai Liu, Dengpeng Wu, Guanghui Xu, Shaohua Chen, Shuang Chen, Xiao Feng, Yigeng Hong, Junqiang Zheng, Chengcheng Xu, Zongwei Li, Xiong Kuang, Jianglu
2024-11-06
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent
https://arxiv.org/abs/2411.02265
389000000000.00
"a total of 389 billion parameters and 52 billion activation parameters"
3.49237e+24
52B activated parameters 6ND = 6*52*10^9*7*10^12 = 2.184 × 10^24 They also suggest more precise formula to calculate MoE compute budget: 9.59ND + 2.3 × 10^8D = 9.59*52*10^9*7*10^12 + 2.3 × 10^8 × 7*10^12 = 3.49237×10^24 which seems closer to projected compute on Figure 3
Unspecified unreleased
7000000000000
"# Trained Tokens 7T" Table 1
Confident
In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model, with a total of 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large's superior performance across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated tasks, where it outp
Open weights (restricted use)
China
Open (restricted use)
the license doesn't regulate usage in the EU also requires additional licensing in case of massive commercial use
Industry
NVLM-X 72B
Vision
Language
Language modeling/generation
Vision-language generation
Question answering
Code generation
Translation
Quantitative reasoning
NVIDIA
Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
2024-10-22
NVLM: Open Frontier-Class Multimodal LLMs
https://arxiv.org/abs/2409.11402
72000000000.00
72B
3.0398181e+24
3.02e24 FLOP (Qwen2-72B compute) + 19818086000000000000000 = 3.0398181e+24
COCO
Conceptual Captions (CC3M)
SBU
VQAv2
VisualGenome
TextVQA
OCR-VQA
Captioning COCO [72], CC3M [127], SBU [114], LAION-115M (sanitized) [123; 66] VQA (natural image) VQAv2 [38], Visual Genome [59] Chart DVQA [51] Document Docmatix [90] OCR / Scene-Text OCR-VQA [98], COCO-Text [144], TextOCR [132], ReCTs [170], RRC-ArT [22], RRC-LSVT [134] RCTW [128], synthdog-en [57], pdfa-eng-wds [117] Math CLEVR-Math [73]
45875200000
Pre-training Global batch size 2,048 Sequence length in the LLM decoder 512 Downsampling of visual tokens 1024->256 # of visual token per tile 256 # of tiles 1 # of training steps 20K 2048 * (512 + 256 * 1) * 20000 = 31,457,280,000 SFT: Global batch size 256 Sequence length in the LLM decoder 1,024 # of visual token per tile 256 # of tiles 6+1 # of training steps 20K 256 * (1,024 + 256*7) * 20000 = 14417920000 31,457,280,000 +14417920000 = 45875200000
NVIDIA H100 SXM5 80GB
Likely
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cro
Open weights (non-commercial)
United States of America
Qwen2-72B
InternViT-6B
19818086000000B
6*72B*45875200000 = 1.9818086e+22
128
Industry
Qwen2-72B
Language
Chat
Language modeling/generation
Alibaba
Qwen Team
2024-06-07
Hello Qwen2
https://qwenlm.github.io/blog/qwen2/ https://arxiv.org/abs/2407.10671
72710000000.00
72.71B parameters in total, of which 70.21B are non-embedding parameters
3.02e+24
72 billion params, 7 trillion tokens 6 * 72 billion * 7 trillion ~= 3.02e24
Unspecified unreleased
"All models were pre-trained on a high-quality, large-scale dataset comprising over 7 trillion tokens, covering a wide range of domains and languages. Compared to previous editions of Qwen, Qwen2 includes a broader spectrum of linguistic data, enhancing the quantity and quality of code and mathematics content. "
7000000000000
"All models were pre-trained on a high-quality, large-scale dataset comprising over 7 trillion tokens, covering a wide range of domains and languages."
Confident
After months of efforts, we are pleased to announce the evolution from Qwen1.5 to Qwen2. This time, we bring to you: - Pretrained and instruction-tuned models of 5 sizes, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B; - Having been trained on data in 27 additional languages besides English and Chinese; - State-of-the-art performance in a large number of benchmark evaluations; - Significantly improved performance in coding and mathematics; - Extended context length sup
Open weights (unrestricted)
China
Unreleased
Apache 2.0
Industry
NVLM-D 72B
Vision
Language
Language modeling/generation
Vision-language generation
Question answering
Code generation
Translation
Quantitative reasoning
NVIDIA
Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
2024-10-22
NVLM: Open Frontier-Class Multimodal LLMs
https://arxiv.org/abs/2409.11402
72000000000.00
72B
3.02e+24
Uses Qwen2-72B as a backbone, which trained with 3.02e24 FLOP, as well as InternViT-6B. It's unclear how many FLOP were spent training but probably negligible; e.g. PaLI trained ViT-e with ~4B parameters using 1.07e23 FLOP. Fine-tuning FLOPs: 57,016,320,000 image/text tokens over all stages 6 * 72B * 57,016,320,000 = 2.463e22
COCO
Conceptual Captions (CC3M)
SBU
VQAv2
VisualGenome
TextVQA
OCR-VQA
Captioning COCO [72], CC3M [127], SBU [114], LAION-115M (sanitized) [123; 66] VQA (natural image) VQAv2 [38], Visual Genome [59] Chart DVQA [51] Document Docmatix [90] OCR / Scene-Text OCR-VQA [98], COCO-Text [144], TextOCR [132], ReCTs [170], RRC-ArT [22], RRC-LSVT [134] RCTW [128], synthdog-en [57], pdfa-eng-wds [117] Math CLEVR-Math [73]
57016320000
Pre-training Global batch size 2,048 Sequence length in the LLM decoder 512 Downsampling of visual tokens 1024->256 # of visual token per tile 256 # of tiles 1 # of training steps 20K 2048 * (512 + 256 * 1) * 20000 = 31,457,280,000 SFT: Global batch size 128 Sequence length in the LLM decoder 3,200 # of visual token per tile 256 # of tiles 6+1 # of training steps 40K 128 * (3200 + 256*7) * 40000 = 25,559,040,000 31,457,280,000 + 25,559,040,000 = 57,016,320,000
NVIDIA H100 SXM5 80GB
Confident
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cro
Open weights (non-commercial)
United States of America
Qwen2-72B
InternViT-6B
24630000000000B
Fine-tuning FLOPs: 57,016,320,000 image/text tokens over all stages 6 * 72B * 57,016,320,000 = 2.463e22
128
Open (non-commercial)
https://huggingface.co/nvidia/NVLM-D-72B Creative Commons Attribution: Non-Commercial 4.0 International *training code "coming soon"
Industry
NVLM-H 72B
Vision
Language
Language modeling/generation
Vision-language generation
Question answering
Code generation
Translation
Quantitative reasoning
NVIDIA
Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
2024-10-22
NVLM: Open Frontier-Class Multimodal LLMs
https://arxiv.org/abs/2409.11402
72000000000.00
72B
3.02e+24
Additional compute in this paper is negligible relative to the compute used to train the language model backbone (Qwen2-72B at 3.02e24 FLOP)
COCO
Conceptual Captions (CC3M)
SBU
VQAv2
VisualGenome
TextVQA
OCR-VQA
Captioning COCO [72], CC3M [127], SBU [114], LAION-115M (sanitized) [123; 66] VQA (natural image) VQAv2 [38], Visual Genome [59] Chart DVQA [51] Document Docmatix [90] OCR / Scene-Text OCR-VQA [98], COCO-Text [144], TextOCR [132], ReCTs [170], RRC-ArT [22], RRC-LSVT [134] RCTW [128], synthdog-en [57], pdfa-eng-wds [117] Math CLEVR-Math [73]
125829120000
Pre-training: Global batch size 2,048 Sequence length in the LLM decoder 512 Downsampling of visual tokens 1024->256 # of visual token per tile 256 # of tiles 6+1 # of training steps 20K 2048 * (512+256*7) * 20000 = 94,371,840,000 SFT: Global batch size 256 Sequence length in the LLM decoder 1,280 # of visual token per tile 256 # of tiles 6+1 # of training steps 40K 256*(1280+256*7)*40000 = 31,457,280,000 94,371,840,000 + 31,457,280,000 = 125,829,120,000
NVIDIA H100 SXM5 80GB
Likely
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cro
Open weights (non-commercial)
United States of America
Qwen2-72B
InternViT-6B
54360000000000B
6ND = 6*125,829,120,000*72000000000.00 = 5.436e22
128
Industry
Grok-1
Language
Language modeling
Chat
xAI
2023-11-04
Announcing Grok
https://x.ai/model-card/, https://x.ai/blog/grok-os
314000000000.00
"314B parameter Mixture-of-Experts model with 25% of the weights active on a given token". So effectively 78B parameters Mixture of 8 experts: https://github.com/xai-org/grok-1
2.90000000001e+24
"On these benchmarks, Grok-1 displayed strong results, surpassing all other models in its compute class, including ChatGPT-3.5 and Inflection-1. It is only surpassed by models that were trained with a significantly larger amount of training data and compute resources like GPT-4" Per table, Grok-1 is surpassed by Palm 2, Claude 2, GPT-4, so it required less compute than these three models. Palm 2 was trained on 7e24 FLOP. GPT-3.5 is ~2.6e24. Inflection-1's compute is not public/known by us but
Unspecified unreleased
"Base model trained on a large amount of text data, not fine-tuned for any particular task." "The training data used for the release version of Grok-1 comes from both the Internet up to Q3 2023 and the data provided by our AI Tutors."
6200000000000
(Speculative confidence, see compute notes)
Likely
Grok is an AI modeled after the Hitchhiker’s Guide to the Galaxy, so intended to answer almost anything and, far harder, even suggest what questions to ask! Grok is designed to answer questions with a bit of wit and has a rebellious streak, so please don’t use it if you hate humor! A unique and fundamental advantage of Grok is that it has real-time knowledge of the world via the 𝕏 platform. It will also answer spicy questions that are rejected by most other AI systems. Grok is still a very e
Open weights (unrestricted)
United States of America
Unreleased
apache 2.0
Industry
Minerva (540B)
Language
Quantitative reasoning
Google
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, Vedant Misra
2022-06-29
Solving Quantitative Reasoning Problems with Language Models
https://arxiv.org/abs/2206.14858
585.00
540350000000.00
"To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model (PaLM)." Our approach is to start with the PaLM pretrained decoder-only transformer language models Chowdhery et al. (2022), and further train (finetune) them on our mathematical dataset using an autoregressive objective. Table 2 contains the main model and training hyperparameters. See Table 2
2.7415e+24
Minerva was fine-tuned from PaLM using the same hardware. Assume the same model FLOPs utilization rate for pre-training and fine-tuning. PaLM pretraining time: 6144 TPU for 1200 hours + 3072 TPU for 336 hours = @8404992 TPU-hours Minerva finetuning time: 1024 TPU for 696 hours = 712704 TPU-hours So fine-tuning added 8.5% more compute. Minerva total compute = PaLM pretraining compute * (712704+8404992)/(8404992) = 2.7415*10^24 FLOP https://www.wolframalpha.com/input?i=%28712704%2B8404992%29%2F%
arXiv
PaLM, finetuned on arxiv
26000000000
"Our models were trained on a dataset of 38.5B tokens" + PaLM upd 38.5B tokens - sie of the dataset, the model saw 26B tokens in 399k steps (see Table 2)
Google TPU v4
Language models have achieved remarkable performance on a wide range of tasks that require natural language understanding. Nevertheless, state-of-the-art models have generally struggled with tasks that require quantitative reasoning, such as solving mathematics, science, and engineering problems at the college level. To help close this gap, we introduce Minerva, a large language model pretrained on general natural language data and further trained on technical content. The model achieves state-o
Unreleased
United States of America
PaLM (540B)
214290000000000B
1024
Unreleased
Industry
DBRX
Language
Chat
Code generation
Databricks
Mosaic Research Team
2024-03-27
Introducing DBRX: A New State-of-the-Art Open LLM
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
132000000000.00
132B mixture of experts. 36B parameters active per inference
2.6e+24
Mixture of Experts (MoE) 36 billion active params * 12 trillion tokens * 6 ~= 2.6e24 https://www.wolframalpha.com/input?i=6+FLOP+*+36+billion+*+12+trillion also, it was trained on 3072 NVIDIA H100s, but with an unclear timeframe (end-end process was three months, including evals and red-teaming).
12T tokens, text and code "It was pre-trained on 12T tokens of text and code data... DBRX was pretrained on 12T tokens of carefully curated data and a maximum context length of 32k tokens. We estimate that this data is at least 2x better token-for-token than the data we used to pretrain the MPT family of models" from HF: https://huggingface.co/databricks/dbrx-base The training mix used for DBRX contains both natural-language and code examples. The vast majority of our training data is in the
9000000000000
12T tokens is equivalent to 9T words. Though it includes code data, so not very literally 9T words
NVIDIA H100 SXM5 80GB
Confident
Today, we are excited to introduce DBRX, an open, general-purpose LLM created by Databricks. Across a range of standard benchmarks, DBRX sets a new state-of-the-art for established open LLMs. Moreover, it provides the open community and enterprises building their own LLMs with capabilities that were previously limited to closed model APIs; according to our measurements, it surpasses GPT-3.5, and it is competitive with Gemini 1.0 Pro. It is an especially capable code model, surpassing specialized
Open weights (restricted use)
United States of America
Unreleased
license: https://www.databricks.com/legal/open-model-license conditions based on monthly users
Industry
GPT-3.5
Language
Language modeling
OpenAI
2022-11-28
https://platform.openai.com/docs/models/gpt-3-5
Parameter count may be 175B based on OpenAI's statements that text-davinci-003 is in the GPT-3.5 series of models. It was also stated to be 175B in the Microsoft CODEFUSION paper, but the paper was reportedly retracted because the authors did not know the parameter count.
2.578e+24
https://colab.research.google.com/drive/1QSxa8YCWjEBQU7mrXLhw6TP1VX5oqgdW#scrollTo=Gt6Z6oZ26clI
NVIDIA A100 SXM4 40 GB
Speculative
API access
United States of America
Unreleased
Industry
U-PaLM (540B)
Language
Language generation
Google
Yi Tay, Jason Wei, Hyung Won Chung, Vinh Q. Tran, David R. So, Siamak Shakeri, Xavier Garcia, Huaixiu Steven Zheng, Jinfeng Rao, Aakanksha Chowdhery, Denny Zhou, Donald Metzler, Slav Petrov, Neil Houlsby, Quoc V. Le, Mostafa Dehghani
2022-10-20
Transcending Scaling Laws with 0.1% Extra Compute
https://arxiv.org/abs/2210.11399
61.00
540000000000.00
2.53e+24
"The total number of extra tokens we train on for the 540B model is approximately 1.3 Billion which constitutes 0.16% extra computation... Training an U-PaLM 540B model only consumes 512 TPUv4 chips and finishes in about 5 days which is considered to be lightweight." original PaLM was 2.527e+24. adding 0.16% is ~2.53e24
"To keep things consistent, we train this model with the same data mixture as PaLM and do not rely on additional sources of data (labeled or unlabeled)."
Google TPU v4
Confident
Scaling language models improves performance but comes with significant computational costs. This paper proposes UL2R, a method that substantially improves existing language models and their scaling curves with a relatively tiny amount of extra compute. The key idea is to continue training a state-of-the-art large language model (e.g., PaLM) on a few more steps with UL2's mixture-of-denoiser objective. We show that, with almost negligible extra computational costs and no new sources of data, we
Unreleased
United States of America
PaLM (540B)
4000000000000B
"The total number of extra tokens we train on for the 540B model is approximately 1.3 Billion which constitutes 0.16% extra computation... Training an U-PaLM 540B model only consumes 512 TPUv4 chips and finishes in about 5 days which is considered to be lightweight." PaLM was 2.5e24 0.16% of that is 4e21
512
Unreleased
Industry
PaLM (540B)
Language
Language modeling
Code generation
Translation
Google Research
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev,, Henryk Michalewski, Xav
2022-04-04
PaLM: Scaling Language Modeling with Pathways
https://arxiv.org/abs/2204.02311
5064.00
540350000000.00
"To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model (PaLM)."
2.5272e+24
See Table 20. 6144 TPUv4 for 1200 hours + 3072 TPUv4 for 336 hours. Equivalent to 6144 TPUv4 for 1368 hours. 46.2% model FLOPs utilization "The 540B-parameter PaLM model sustained a remarkable 57.8% of the peak hardware floating point performance over 50 days while training on TPU v4 supercomputers. " https://cloud.google.com/blog/topics/systems/tpu-v4-enables-performance-energy-and-co2e-efficiency-gains
Wikipedia
GLaM dataset
LaMBDA dataset
GitHub
585000000000
"The PaLM pretraining dataset consists of a high-quality corpus of 780 billion tokens that represent a wide range of natural language use cases." 1 token ~ 0.75 words
Google TPU v4
Confident
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 c
Unreleased
Multinational
United States of America
6144
Unreleased
Industry
Flan-PaLM 540B
Language
Language modeling/generation
Google
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, Jason Wei
2022-10-20
Scaling Instruction-Finetuned Language Models
https://arxiv.org/abs/2210.11416
2506.00
540000000000.00
540B
2.5e+24
0.2% greater than Palm 540B, which used 2.5e24
Flan
Various instruction examples for many tasks: "Our final set of finetuning tasks is sourced from a combination of tasks from FLAN, T0, Natural Instructions, along with some dialog, program synthesis, and chain-of-thought reasoning tasks, as described in Figure 2. We provide specific pointers and citations in Table 24. All data sources are publicly available. We also remove all MMLU tasks from Natural Instructions to preserve its role as a broad benchmark of 57 held-out tasks for evaluation. In t
Google TPU v4
Confident
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups
Unreleased
United States of America
PaLM (540B)
5600000000000B
5.6e21 per Table 2 "we only use 0.2% of the pre-training compute to instruction-finetune Flan-PaLM 540B (approximately 512 v4 TPU chips for 37 hours)" 512 * 37 * 3600 * 275 teraflops * 0.3 = 5.6e21 (so 30% utilization was correct)
512
Unreleased
Industry
Gemma 3 27B
Language
Vision
Multimodal
Language modeling/generation
Question answering
Translation
Chat
Quantitative reasoning
Visual question answering
Code generation
Google DeepMind
Core contributors: Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Beyer, Xiaohai Zhai, Anton Tsitsulin, Robert Busa-Fekete, Alex Feng, Noveen Sachdeva, Benjamin Coleman, Yi Gao, Basil Mustafa, Iain Barr, E
2025-03-12
Gemma 3 Technical Report
https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf
27000000000.00
Vision Encoder: 417M Embedding Parameters: 1,416M Non-embedding Parameters: 25,600M
2.268e+24
6ND = 6 * 27B parameters * 14T training tokens = 2.268 × 10^24 FLOP
Unspecified unreleased
14000000000000
14T
Google TPU v5p
Confident
We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context – at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local atten
Open weights (restricted use)
United States of America
United Kingdom of Great Britain and Northern Ireland
Multinational
SigLIP 400M
6144
Unreleased
https://huggingface.co/google/gemma-3-27b-it Gemma License
Industry
Evo 2 40B
Biology
Protein or nucleotide language model (pLM/nLM)
Arc Institute
Stanford University
NVIDIA
Liquid
University of California (UC) Berkeley
Goodfire
Columbia University
University of California San Francisco
Garyk Brixi, Matthew G. Durrant, Jerome Ku, Michael Poli, Greg Brockman, Daniel Chang, Gabriel A. Gonzalez, Samuel H. King, David B. Li, Aditi T. Merchant, Mohsen Naghipourfar, Eric Nguyen, Chiara Ricci-Tam, David W. Romero, Gwanggyu Sun, Ali Taghibakshi, Anton Vorontsov, Brandon Yang, Myra Deng, Liv Gorton, Nam Nguyen, Nicholas K. Wang, Etowah Adams, Stephen A. Baccus, Steven Dillmann, Stefano Ermon, Daniel Guo, Rajesh Ilango, Ken Janik, Amy X. Lu, Reshma Mehta, Mohammad R.K. Mofrad, Madelena Y
2025-02-19
Genome modeling and design across all domains of life with Evo 2
https://arcinstitute.org/manuscripts/Evo2
40300000000.00
Table 1 lists 40.3B paramters as model size.
2.2500000000000004e+24
40.3e9 parameters * 9.3e12 training datapoints * 6 = 2.25e24. Same FLOPS estimate given by authors in Table 1.
OpenGenome 2
9300000000000
"We trained two versions of Evo 2: a smaller version at 7B parameters trained on 2.4 trillion tokens and a full version at 40B parameters trained on 9.3 trillion tokens."
Unverified
All of life encodes information with DNA. While tools for sequencing, synthesis, and editing of genomic code have transformed biological research, intelligently composing new biological systems would also require a deep understanding of the immense complexity encoded by genomes. We introduce Evo 2, a biological foundation model trained on 9.3 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life. We train Evo 2 with 7B and 40B parameters to have an unprecedente
Open weights (unrestricted)
United States of America
United States of America
United States of America
United States of America
United States of America
United States of America
Academia
Industry
Industry
Academia
Academia
Academia
Gemma 2 27B
Language
Language modeling/generation
Chat
Code generation
Question answering
Quantitative reasoning
Google DeepMind
Gemma Team, Google DeepMind
2024-06-24
Gemma 2 offers best-in-class performance, runs at incredible speed across different hardware and easily integrates with other AI tools.
https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf
27000000000.00
2.106e+24
"For the 27B model, we train on an 8x24x32 configuration of TPUv5p, totaling 6144 chips" trained on 13T tokens 6ND = 6*27000000000*13000000000000=2.106e+24
Unspecified unreleased
Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. Primarily English-language content. Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code or understand code-related questions. Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical quer
13000000000000
"We train Gemma 2 27B on 13 trillion tokens of primarily-English data"
Google TPU v5p
Confident
Now we’re officially releasing Gemma 2 to researchers and developers globally. Available in both 9 billion (9B) and 27 billion (27B) parameter sizes, Gemma 2 is higher-performing and more efficient at inference than the first generation, with significant safety advancements built in. In fact, at 27B, it offers competitive alternatives to models more than twice its size, delivering the kind of performance that was only possible with proprietary models as recently as December. And that’s now achie
Open weights (restricted use)
United States of America
United Kingdom of Great Britain and Northern Ireland
Multinational
6144
Unreleased
Gemma 2 is available under our commercially-friendly Gemma license, giving developers and researchers the ability to share and commercialize their innovations.
Industry
FLAN 137B
Language
Language modeling
Question answering
Language modeling/generation
Google Research
Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le
2021-09-03
Finetuned Language Models Are Zero-Shot Learners
https://arxiv.org/abs/2109.01652
2994.00
137000000000.00
Abstract: "We take a 137B parameter pretrained language model and instruction tune it on over 60 NLP datasets verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types." Many models seem to be using the same 137B base transformer model?
2.047e+24
From section 2.4: Pretraining was done over 2.49T tokens. 6 * 2.49T * 137B = 2.047e24 Also, "instruction tuning takes around 60 hours on a TPUv3 with 128 cores." 128 TPUv3 cores = 64 TPUv3 chips. Environmental considerations section claims this took less than 2% of total time 1.23e14 * 64 * 60 * 3600 * 0.3 = 5.10e20
Wikipedia
Unspecified unreleased
Abstract: "We take a 137B parameter pretrained language model and instruction tune it on over 60 NLP datasets"
2490000000000
"Model architecture and pretraining. In our experiments, we use LaMDA-PT, a dense left-to-right, decoder-only transformer language model of 137B parameters (Thoppilan et al., 2022). This model is pretrained on a collection of web documents (including those with computer code), dialog data, and Wikipedia, tokenized into 2.49T BPE tokens with a 32k vocabulary using the SentencePiece library (Kudo & Richardson, 2018). Around 10% of the pretraining data was non-English. Note that LaMDA-PT only has
Google TPU v3
Confident
This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning—finetuning language models on a collection of datasets described via instructions—substantially improves zeroshot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction tune it on over 60 NLP datasets verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on uns
Unreleased
Multinational
United States of America
LaMDA
"In our experiments, we use LaMDA-PT, a dense left-to-right, decoder-only transformer language model of 137B parameters (Thoppilan et al., 2022) [...] Note that LaMDA-PT only has language model pretraining (c.f. LaMDA, which was finetuned for dialog)." In our entry for LaMDA we only measured pre-training compute, so we just specify LaMDA as the base model of FLAN 137B.
Unreleased
Industry
Gemini 1.0 Pro
Multimodal
Language
Vision
Language modeling
Visual question answering
Chat
Translation
Google DeepMind
Gemini Team
2023-12-06
Gemini: A Family of Highly Capable Multimodal Models
https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf
633.00
1.8300010000000002e+24
Training compute estimated from benchmark scores. Our reasoning and calculations for Gemini 1 Ultra are detailed in this Colab notebook. https://colab.research.google.com/drive/1sfG91UfiYpEYnj_xB5YRy07T5dv-9O_c
Unspecified unreleased
"Gemini models are trained on a dataset that is both multimodal and multilingual. Our pretraining dataset uses data from web documents, books, and code, and includes image, audio, and video data... We find that data quality is critical to a highlyperforming model, and believe that many interesting questions remain around finding the optimal dataset distribution for pretraining."
Google TPU v4
Speculative
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks — notably being the first mode
API access
United States of America
United Kingdom of Great Britain and Northern Ireland
Multinational
Unreleased
API access: https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models
Industry
Yi-Large
Language
Chat
Language modeling/generation
01.AI
Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, Zonghong Dai
2024-05-13
100000000000.00
"Yi-Large is a software over-the-air-driven closed-source large model with a parameter of over 100 billion tokens." from https://www.chinadaily.com.cn/a/202405/13/WS6641abd1a31082fc043c6ccd.html
1.7999999999999996e+24
6ND = 6*100000000000*3000000000000=1.8e+24 (speculative confidence because training dataset size is very uncertain)
3000000000000
3T tokens for previous Yi models: "Targeted as a bilingual language model and trained on 3T multilingual corpus, the Yi series models become one of the strongest LLM worldwide, showing promise in language understanding, commonsense reasoning, reading comprehension, and more."
Speculative
API access
China
Unreleased
Industry
DeepSeek-V2.5
Language
Language modeling/generation
Chat
Code generation
DeepSeek
DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J.L. Cai, Jian Liang, Jianzhong Guo, Jiaqi Ni, Jiashi Li, Jin Chen, Jingyang Yuan, Junjie Qiu, Junxiao Song, Kai Dong, Kaige Gao, Kang Guan, Lean Wang, Lecong Z
2024-09-06
DeepSeek-V2.5
https://huggingface.co/deepseek-ai/DeepSeek-V2.5
236000000000.00
21B active params, 236B total
1.7892000000000004e+24
V2.5 is a merge of V2-coder and V2-chat V2-coder is trained for 6T additional tokens from an intermediate checkpoint of V2, which had been trained for 4.2T tokens. Total: 10.2T V2-chat is fine-tuned from V2, saw 8.2T tokens in pre-training Unique steps: 8.2T + 6T = 14.2T FLOPs: 6 * 21B * 14.2T = 1.7892e24
GitHub
Common Crawl
The original V2 had a dataset of 8.1T unique tokens, and coder-V2 added an additional 1.391T unique tokens of code and math. But it appears no additional training was done to combine them into this model.
Confident
Open weights (restricted use)
China
Unreleased
Industry
EXAONE 1.0
Multimodal
Language
Vision
Translation
Language modeling/generation
Visual question answering
LG
2021-12-14
https://www.lgcorp.com/media/release/27387#:~:text=LG%20AI%20Research%20proposes%20EXAONE,performance%20while%20learning%20fewer%20parameters.
300000000000.00
1.6955999999999996e+24
No indication of how images are processed. Supposing they used something like ViT-H/14, and training images were 512x512 (they state "EXAONE shows remarkable performance such as [...] offering 1024x1024 sized image output", but typically this size of image training would only be done during a relatively short, final stage of pre-training), there would be 37x37 = 1,369 patches per image 1,369 * 250 million = around 342 billion image patch embeddings. 300M parameters * (342 billion + 600 billion)
Unspecified unreleased
"To create multi-modal AI, LG AI Research Institute learned from 600 billion corpora, the world's largest, and more than 250 million high-resolution images combining language and images. It is also differentiated in that it is a bilingual AI that understands and speaks Korean and English at the level of a native speaker." 600000000000+250000000=600250000000
600250000000
Speculative
[Dec 2021] EXAONE is a bilingual artificial intelligence that has learned the characteristics of both Korean and English languages at the same time. Since the initial development last June, it has completed learning of 1.3 billion, 13 billion, 39 billion, and 175 billion parameter models, and it is currently learning 300 billion parametric models. EXAONE shows remarkable performance such as obtaining the highest FID score, offering 1024x1024 sized image output, and achieving purpose conversatio
Unreleased
Korea (Republic of)
Industry
Movie Gen Video
Video
Video generation
Meta AI
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Mitesh Kumar Singh, Peizhao Zhang, Peter Vajda, Quentin Duval, Rohit Girdhar, Roshan Sumbaly, Sai Saketh Rambhatla, Sam Tsai, Samaneh Azadi, Samyak Datta, Sanyuan Chen, Sea
2024-10-04
Movie Gen: A Cast of Media Foundation Models
https://ai.meta.com/static-resource/movie-gen-research-paper
30000000000.00
30B
1.65e+24
Model size = 30B Broken down by training stage (table 3): 256px T2I: samples seen = 1.94E9; sample token length = 256; flops = 6ND = 8.94E22 256px T2I/V: samples seen = 3.95E8; sample token length = 8192; flops = 6ND = 5.82E23 768px T2I/V: samples seen = 7.38E7; sample token length = 73,728; flops = 6ND = 9.79E23 Total flops = 1.65E24
26600000000
O(1B) images O(100M) videos, each with 256 frames ~= 25M images
NVIDIA H100 SXM5 80GB
Confident
We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user’s image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generatio
Unreleased
United States of America
6144
Industry
Chameleon-34B
Multimodal
Image generation
Language
Vision
Language modeling/generation
Vision-language generation
Visual question answering
Text-to-image
Facebook AI Research
Srinivasan Iyer, Bernie Huang, Lili Yu, Arun Babu, Chunting Zhou, Kushal Tirumala, Xi Victoria Lin, Hu Xu, Xian Li, Akshat Shrivastava, Omer Levy, Armen Aghajanyan, Ram Pasunuru, Andrew Cohen, Aram H. Markosyan, Koustuv Sinha, Xiaoqing Ellen Tan, Ivan Evtimov, Ping Yu, Tianlu Wang, Olga Golovneva, Asli Celikyilmaz, Pedro Rodriguez, Leonid Shamis, Vasu Sharma, Christine Jou, Karthik Padthe, Ching-Feng Yeh, Mingda Chen, Bapi Akula, Jacob Kahn, Daniel Li, Scott Yih, Barlas Oguz, Morteza Behrooz, Be
2024-05-16
Chameleon: Mixed-Modal Early-Fusion Foundation Models
https://arxiv.org/abs/2405.09818v1
34000000000.00
1.6453571041e+24
GPU method: Table 2 shows that 34B model pre-training uses 4282407 GPU-hours, trained across 3072 A100s. 3.12e14 * 4282407 * 3600 * 0.3 = 1.44e24 Parameter-token method: Pre-training goes over 9.2T tokens, post-training only goes over 1.1B tokens (sum of tokens column in Table 3). 6 * 34B * 9.2T = 1.88e24 Geometric mean: sqrt(1.44e24 * 1.88e24) = 1.65e24
Unspecified unreleased
Pre-training: - 2.9 trillion tokens of pure text - 1.4 billion text-image pairs, which produces 1.5 trillion text-image tokens - Since each image is 1024 tokens, implies 1.43 trillion image tokens and 0.07 trillion text tokens - 400 billion tokens of image-text interleaved documents - Difficult to estimate image-to-text ratio, but references OBELIKS paper which had 141 million web pages, 353 million associated images, and 115 billion text tokens. - 353 million * 1024 = 361.5 billion image tok
4400000000000
Slightly conflicting info. Pre-training data details describe different types of data that sum to 4.8 trillion tokens, but Table 1 indicates 4.4T. Using table values as this agrees with other statements about epochs and total tokens seen.
NVIDIA A100 SXM4 80 GB
Confident
We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-fo
Open weights (non-commercial)
United States of America
Not enough info to estimate. GPU time given for pretraining, and while we know # of fine-tuning tokens we don't know # of epochs.
3072
Unreleased
https://ai.meta.com/resources/models-and-libraries/chameleon-downloads/?gk_enable=chameleon_web_flow_is_live "The models we’re releasing today were safety tuned and support mixed-modal inputs and text-only output to be used for research purposes. While we’ve taken steps to develop these models responsibly, we recognize that risks remain. At this time, we are not releasing the Chameleon image generation model."
Industry
Yi-Lightning
Language
Language modeling/generation
01.AI
2024-10-18
Yi-Lightning
https://www.lingyiwanwu.com/en https://platform.lingyiwanwu.com/
1.5e+24
The CEO of 01.AI tweeted that Yi-Lightning was trained for 1 month on 2000 H100s: https://x.com/kaifulee/status/1846310645849047524 Assuming this is accurate: (9.9e14 * 2000) FLOP/s * 1 month * 30.5 days/month * 24hr/day * 3600 s/hr * 0.3 utilization assumption = 1.565e24
Unspecified unreleased
NVIDIA H100 SXM5 80GB
Confident
API access
China
2000
Unreleased
https://platform.lingyiwanwu.com/
Industry
Qwen-72B
Language
Chat
Code generation
Alibaba
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zha
2023-11-30
https://huggingface.co/Qwen/Qwen-72B
72000000000.00
72B
1.3e+24
72 billion params, 3 trillion tokens 72b * 3T * 6 = 1.3e24
"It is pretrained on over 3 trillion tokens, including Chinese, English, multilingual texts, code, and mathematics, covering general and professional fields"
3000000000000
Assuming not trained for multiple epochs.
Confident
Qwen-72B is the 72B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-72B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. Additionally, based on the pretrained Qwen-72B, we release Qwen-72B-Chat, a large-model-based AI assistant, which is trained with alignment techniques.
Open weights (restricted use)
China
Unreleased
up to 100m active users: https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT
Industry
Qwen1.5-72B
Language
Chat
Language modeling/generation
Quantitative reasoning
Code generation
Translation
Alibaba
Qwen Team
2024-02-04
Introducing Qwen1.5
https://qwenlm.github.io/blog/qwen1.5/
72000000000.00
72B
1.3e+24
3T training tokens: https://github.com/QwenLM/Qwen2/issues/97 6 * 72 billion * 3 trillion = ~1.3e24
Unspecified unreleased
"We pretrained the models with a large amount of data, and we post-trained the models with both supervised finetuning and direct preference optimization."
3000000000000
3 trillion tokens from this response https://github.com/QwenLM/Qwen2/issues/97
Confident
In recent months, our focus has been on developing a “good” model while optimizing the developer experience. As we progress towards Qwen1.5, the next iteration in our Qwen series, this update arrives just before the Chinese New Year. With Qwen1.5, we are open-sourcing base and chat models across six sizes: 0.5B, 1.8B, 4B, 7B, 14B, and 72B. In line with tradition, we’re also providing quantized models, including Int4 and Int8 GPTQ models, as well as AWQ and GGUF quantized models. To enhance the d
Open weights (restricted use)
China
Unreleased
restriction on >100m monthly users: https://huggingface.co/Qwen/Qwen1.5-72B/blob/main/LICENSE
Industry
DeepSeek-Coder-V2 236B
Language
Code generation
Code autocompletion
DeepSeek
Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y. Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bingxuan Wang, Junxiao Song, Deli Chen, Xin Xie, Kang Guan, Yuxiang You, Aixin Liu, Qiushi Du, Wenjun Gao, Xuan Lu, Qinyu Chen, Yaohui Wang, Chengqi Deng, Jiashi Li, Chenggang Zhao, Chong Ruan, Fuli Luo, Wenfeng Liang
2024-06-17
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
https://github.com/deepseek-ai/DeepSeek-Coder-V2
236000000000.00
Mixture of experts model. 21B parameters activated per token.
1.2852e+24
Trained on a total of 10.2T tokens 6NC: 6 * 10.2T * 21B active parameters = 1.285e24
GitHub
Common Crawl
See Section 2. "In the pre-training phase, the dataset of DeepSeek-Coder-V2 is created with a composition of 60% source code, 10% math corpus, and 30% natural language corpus"
3191000000000
"In the pre-training phase, the dataset of DeepSeek-Coder-V2 is created with a composition of 60% source code, 10% math corpus, and 30% natural language corpus ... The source code consists of 1,170B code-related tokens sourced from GitHub and CommonCrawl... For the math corpus, we collect 221B math-related tokens sourced from CommonCrawl... In total, DeepSeek-Coder-V2 has been exposed to 10.2T training tokens, where 4.2 trillion tokens originate from the DeepSeek V2 dataset, while the remaining
Confident
We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens. Through this continued pre-training, DeepSeek-Coder-V2 substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general l
Open weights (restricted use)
China
DeepSeek-V2 (MoE-236B)
Unreleased
license has some harmful use restrictions: https://github.com/deepseek-ai/DeepSeek-Coder-V2/blob/main/LICENSE-MODEL no training code
Industry
EXAONE 3.5-R 32B
Language
Language modeling/generation
Question answering
Translation
LG AI Research
2025-03-14
32000000000.00
32B
1.2692e+24
1.25 × 10^24 (base model reported training compute) + 1.92 × 10^22 (finetune compute) = 1.2692e+24 FLOP
Confident
Unreleased
Korea (Republic of)
EXAONE 3.5 32B
19200000000000B
1.92e22
Unreleased
Industry
Code Llama-70B
Language
Code generation
Meta AI
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Ellen Tan, Yossef (Yossi) Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Defossez, Jade Copet, Faisal Azhar, Hugo Touvron, Gabriel Synnaeve, Louis Martin, Nicolas Usunier, Thomas Scialom
2024-01-29
Code Llama: Open Foundation Models for Code
https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/ https://arxiv.org/abs/2308.12950
1297.00
70000000000.00
70B
1.26e+24
Base model saw 2T tokens, Code Llama-70B was trained on an additional 1T. 6NC: 6 * 3T * 70B = 1.26e24
Unspecified unreleased
We are releasing four sizes of Code Llama with 7B, 13B, 34B, and 70B parameters respectively. Each of these models is trained with 500B tokens of code and code-related data, apart from 70B, which is trained on 1T tokens.
1000000000000
Llama 70B training dataset was 2 trillion tokens. Code Llama finetuning dataset was 1 trillion tokens of code.
NVIDIA A100 SXM4 80 GB
Confident
We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters
Open weights (restricted use)
United States of America
Llama 2-70B
420000000000000B
Fine tuning from base model uses 1T tokens. 70B * 1T * 6 = 4.2E23
400
Unreleased
Llama 2 license. can't use outputs to train models. https://github.com/meta-llama/llama/blob/main/LICENSE
Industry
EXAONE Deep 32B
Language
Language modeling/generation
Question answering
Quantitative reasoning
Code generation
LG AI Research
LG AI Research, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Yongmin Park, Sihoon Yang, Heuiyeen Yeen, Sihyuk Yi, Hyeongu Yun
2025-03-16
EXAONE Deep: LLMs with Enhanced Reasoning Performance
https://arxiv.org/abs/2503.12524
32000000000.00
32B
1.26e+24
1.25 × 10^24 (base model reported training compute) + 7.04 × 10^21 (finetune compute) = 1.26 × 10^24 FLOP Table 1
Unspecified unreleased
12000000000
"To enhance the reasoning capabilities of language models, we have utilized 1.6M instances for SFT and 20K instances of preference data for DPO. The SFT dataset contains approximately 12B tokens"
NVIDIA H100 SXM5 80GB
Confident
We present EXAONE Deep series, which exhibits superior capabilities in various reasoning tasks, including math and coding benchmarks. We train our models mainly on the reasoning-specialized dataset that incorporates long streams of thought processes. Evaluation results show that our smaller models, EXAONE Deep 2.4B and 7.8B, outperform other models of comparable size, while the largest model, EXAONE Deep 32B, demonstrates competitive performance against leading open-weight models. All EXAONE Dee
Open weights (non-commercial)
Korea (Republic of)
EXAONE 3.5 32B
7040000000000B
Table 1 (reported): 7.04 × 10^21 FLOP 6ND = 6*32B parameters * 12B tokens = 2.304e+21 FLOP
512
Unreleased
https://huggingface.co/LGAI-EXAONE/EXAONE-Deep-32B Exaone License
Industry
EXAONE 3.5 32B
Language
Language modeling/generation
Question answering
Translation
LG AI Research
Soyoung An, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Yountae Jung, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Woohyung Lim, Sangha Park, Sooyoun Park, Yongmin Park, Sihoon Yang, Heuiyeen Yeen, Hyeongu Yun
2024-12-09
EXAONE 3.5: Series of Large Language Models for Real-world Use Cases
https://arxiv.org/abs/2412.04862
32000000000.00
32B
1.25e+24
1.25 × 10^24 (Table 2)
Unspecified unreleased
6500000000000
6.5T tokens (Table 2)
Confident
This technical report introduces the EXAONE 3.5 instruction-tuned language models, developed and released by LG AI Research. The EXAONE 3.5 language models are offered in three configurations: 32B, 7.8B, and 2.4B. These models feature several standout capabilities: 1) exceptional instruction following capabilities in real-world scenarios, achieving the highest scores across seven benchmarks, 2) outstanding long-context comprehension, attaining the top performance in four benchmarks, and 3) compe
Open weights (non-commercial)
Korea (Republic of)
Unreleased
Exaone license (allows only non-commercial usage)
Industry
XVERSE-65B-2
Language
Chat
Language modeling/generation
XVERSE Technology
Shenzhen Yuanxiang Technology
2023-12-08
https://github.com/xverse-ai/XVERSE-65B/blob/main/README_EN.md
65000000000.00
Based on the name. Exact count unknown but may be listed on Hugging Face.
1.24800000000001e+24
C = 6ND = 6 * 3.2T tokens * 65B params = 1.248e24 FLOP
[2023/12/08] Released the XVERSE-65B-2 base model. This model builds upon its predecessor through Continual Pre-Training, reaching a total training volume of 3.2 trillion tokens.
3200000000000
Training Data: The model has been thoroughly trained on a diversified and high-quality dataset consisting of 2.6 trillion of tokens, including more than 40 languages such as Chinese, English, Russian, and Spanish. The sampling ratio of different types of data is finely set, which makes the performance of Chinese and English excellent, and also takes into account the effect of other languages. Assume 0.85 words per token on average for the mix of languages.
Confident
Open weights (restricted use)
China
China
Open source
https://github.com/xverse-ai/XVERSE-65B/blob/main/README_EN.md license info: "The use of the source code in this repository must follow the Apache-2.0 open-source license, while the use of the model weights of XVERSE-65B needs to adhere to the Model License Agreement. The XVERSE-65B model weights are fully open to academic research and support free commercial use. To apply for a commercial license, please fill in the application form. For other questions or collaborations, please contact opens
Industry
Industry
SEA-LION V3 Llama3.1 8B
Language
Language modeling/generation
AI Singapore
2024-12-19
SEA-LION V3
https://huggingface.co/aisingapore/llama3.1-8b-cpt-sea-lionv3-base
8000000000.00
1.23330162e+24
Llama3.1 8B base model: 1.224e+24 Additional pretraining compute: 136*60*60*989500000000000*64*0.3=9.30162×10^21 Total: 9.30162 10^21 + 1.224*^24 =1.23330162 × 10^24
The Stack v2
Dolma
Trained on a mix of datasets including StackV2 and Dolma (see https://huggingface.co/aisingapore/llama3.1-8b-cpt-sea-lionv3-base#data)
200000000000
"pre-trained on 200B tokens"
NVIDIA H200 SXM
Unverified
Our SEA-LION v3 Llama3.1 8B and 70B base models have been continued pre-trained on top of the Llama3.1 8B and 70B models respectively. Both have a context length of 128K, making them the SEA-LION models with the longest context length to date.
Open weights (unrestricted)
Llama 3.1-8B
64
Llama 3.1-8B
Language
Language modeling/generation
Meta AI
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu,
2024-07-23
The Llama 3 Herd of Models
https://ai.meta.com/research/publications/the-llama-3-herd-of-models/
8000000000.00
8B
1.224e+24
Huggingface page says 3.1-8B used 1.46M H100 hours and trained over 15T tokens. https://huggingface.co/meta-llama/Llama-3.1-70B The paper also says that 3.1-405B got MFU of between 38-43%; presumably 8B was around the same or a bit higher. I'll assume utilization of 40% 6ND: 6 * 15T * 8B = 7.2e23 FLOPs Hardware: 1.46M * 9.9e14 * 3600 * 0.4 = 2.08e24 FLOPs Geometric mean: sqrt(7.2e23 * 2.08e24) = 1.224e24 Note that Llama 3-8B also said it used 15T tokens, but only 1.3M H100 hours. This sugges
Llama 3 dataset
Unverified
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models s
Open weights (restricted use)
United States of America
Open (restricted use)
Llama 3.1 license: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE must seek separate license if over 700m monthly users, acceptable use restrictions code here: https://github.com/meta-llama/llama3/tree/main
Industry
Mi:dm 200B
Language
Language modeling/generation
KT
2023-10-31
https://genielabs.ai/midm/about
200000000000.00
200B
1.2e+24
6ND=1000000000000*200000000000.00*6=1.2 × 10^24
1000000000000
Mi:dm is the first Korean LLM trained on over 1 trillion tokens.
Confident
TL;DR: KT Corp introduces Mi:dm, a massive AI model aimed at diverse sectors. Mi:dm is the first Korean LLM trained on over 1 trillion tokens. It offers four models, from basic to large, with up to 200 billion parameters. KT plans to share Mi:dm’s foundational model with other companies. Three advanced technologies reduce AI hallucinations by up to 70%. Collaborations with AI startups, including Upstage, aim to conquer the global generative AI market.
API access
Korea (Republic of)
Unreleased
KT said it will open up the foundation model of Mi:dm to other companies, providing a full AI development package, including KT Cloud's hyperscale AI computing service and AI chip startup Rebellions Inc.'s neural processing unit infrastructure, fostering the development of various AI services.
Industry
Hunyuan
Language
Image generation
Multimodal
Language modeling/generation
Image generation
Question answering
Tencent
2023-09-07
Tencent Unveils Hunyuan, its Proprietary Large Foundation Model on Tencent Cloud
https://www.tencent.com/en-us/articles/2201685.html
100000000000.00
"Presently, the Hunyuan model has over 100 billion parameters, with more than two trillion tokens in pre-training data."
1.2e+24
6ND = 6*100*10^9*2*10^12 = 1.2*10^24
Unspecified unreleased
2000000000000
"Presently, the Hunyuan model has over 100 billion parameters, with more than two trillion tokens in pre-training data."
Confident
Enterprises in China may now access Hunyuan via Tencent’s public cloud platform and finetune it to their specific needs. The platform features strong Chinese language processing abilities, advanced logical reasoning, and comes with reliable task execution abilities. Tencent’s foundation model supports a wide array of functions spanning the creation of images, copywriting, text recognition, and customer service, to name a few. These will be instrumental in key industries like finance, public ser
API access
China
Unreleased
Industry
PLaMo-100B
Language
Language modeling/generation
Preferred Networks Inc
Preferred Elements (PFE)
2024-06-14
Pre-training of the proprietary LLM "PLaMo-100B" with 100 billion parameters
https://tech.preferred.jp/ja/blog/plamo-100b/
100000000000.00
1.2e+24
6*100B*2T=1.2e24
"The pre-trained model of PLaMo-100B developed this time was trained on a total of 2T tokens of both Japanese and English text data."
Unverified
Preferred Elements (PFE), a subsidiary of Preferred Networks (PFN), has been developing a 100 billion (100B) parameter LLM called "PLaMo-100B" since February. The pre-training part of the development of PLaMo-100B was completed in May, so in this article we will introduce the pre-training part of this model.
API access
Japan
Industry
Luca 2.0
Mianbi Intelligence
2023-08-29
https://www.163.com/dy/article/IDBGA8840511FQO9.html
100000000000.00
https://www.leiphone.com/category/ai/23kbzQXj60xZgUgO.html English translation: "Li Dahai: From a technical point of view, the CPM2 (Chinese Pretrained Model) 100 billion model we launched at that time was a sparse model of MoE, which is different from the 100 billion model we are promoting now." This suggests it is a dense model.
1.2e+24
Assume Chinchilla-optimal dataset size: 20 * 100B * 100B * 6 = 1.2e24 FLOP
Unverified
China
Industry
Megatron-Turing NLG 530B
Language
Language modeling
Microsoft
NVIDIA
Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, Bryan Catanzaro
2021-10-11
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
https://arxiv.org/abs/2201.11990
657.00
530000000000.00
1.17e+24
https://www.lesswrong.com/posts/bGuMrzhJdENCo8BxX/nvidia-and-microsoft-releases-530b-parameter-transformer?commentId=HSJSNspKp94tFcSCx source: https://lair.lighton.ai/akronomicon/ 9938 PF-days * 3600 * 24 * 10^15 = 8.586432e+23
Common Crawl
The Pile
CC-Stories
Realnews
In addition to Common Crawl data, we leveraged a number of other previously generated datasets. From The Pile, we selected Books3, OpenWebText2, Stack Exchange, PubMed Abstracts, Wikipedia, Gutenberg (PG-19), BookCorpus2, NIH ExPorter, and Pile-CC datasets. We also included the CC-Stories and RealNews datasets used to train Megatron
270000000000
"Our training dataset consists of 339 billion tokens and we trained MT-NLG on 270 billions tokens by blending the 15 training datasets as described above. We also set aside 2% of our data for validation." 1 token ~ 0.75 words
NVIDIA A100 SXM4 80 GB
Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models. As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of
Unreleased
United States of America
United States of America
4480
Unreleased
Industry
Industry
Mistral Small 3
Language
Language modeling/generation
Question answering
Quantitative reasoning
Code generation
Translation
Mistral AI
2025-01-30
Mistral Small 3, a latency-optimized 24B-parameter model released under the Apache 2.0 license.
https://mistral.ai/news/mistral-small-3/
24000000000.00
24B
1.152e+24
6ND = 6*8T tokens * 24B parameters = 1.152e+24 FLOP
Unspecified unreleased
"Notably, Mistral Small 3 was developed without reinforcement learning or synthetic training data, techniques commonly used by competitors. Lample said this “raw” approach helps avoid embedding unwanted biases that could be difficult to detect later."
8000000000000
8 trillion tokens Source: https://venturebeat.com/ai/mistral-small-3-brings-open-source-ai-to-the-masses-smaller-faster-and-cheaper/
Confident
Mistral Small 3 is competitive with larger models such as Llama 3.3 70B or Qwen 32B, and is an excellent open replacement for opaque proprietary models like GPT4o-mini. Mistral Small 3 is on par with Llama 3.3 70B instruct, while being more than 3x faster on the same hardware. Mistral Small 3 is a pre-trained and instructed model catered to the ‘80%’ of generative AI tasks—those that require robust language and instruction following performance, with very low latency. We designed this new mode
Open weights (unrestricted)
France
Unreleased
Apache 2.0 https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501
Industry
Qwen2.5-Coder (32B)
Language
Language modeling/generation
Code generation
Alibaba
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, Junyang Lin
2024-11-12
Qwen2.5-Coder Technical Report
https://arxiv.org/abs/2409.12186
32500000000.00
32.5B (31B - non emb)
1.0725e+24
Assuming 1 epoch 6ND = 6*32.5 parameters *10^9*5.5*10^12 tokens = 1.0725e+24
GitHub
Common Crawl
Unspecified unreleased
"We collected public repositories from GitHub created before February 2024" "We curated a large-scale and high-quality text-code mixed dataset from Common Crawl, which includes code-related documentation, tutorials, blogs, and more" "We used CodeQwen1.5, the predecessor of Qwen2.5-Coder, to generate large-scale synthetic datasets."
5500000000000
"As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens."
Confident
In this report, we introduce the Qwen2.5-Coder series, a significant upgrade from its predecessor, CodeQwen1.5. This series includes six models: Qwen2.5-Coder-(0.5B/1.5B/3B/7B/14B/32B). As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder demonstrates impressive code generation capabilities w
Open weights (unrestricted)
China
Unreleased
Apache 2.0 https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct though they have apache 2.0 github repository it seems to be inference code rather than training code
Industry
ESM3 (98B)
Biology
Protein generation
EvolutionaryScale
University of California (UC) Berkeley
Thomas Hayes, Roshan Rao, Halil Akin, Nicholas James Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Quy Tran, Jonathan Deaton, Marius Wiggert, Rohil Badkundri, Irhum Shafkat, Jun Gong, Alexander Derry, Raul Santiago Molina, Neil Thomas, Yousuf Khan, Chetan Mishra, Carolyn Kim, Liam J Bartie, Patrick D Hsu, Tom Sercu, Salvatore Candido, Alexander Rives
2024-06-25
ESM3: Simulating 500 million years of evolution with a language model
https://www.evolutionaryscale.ai/blog/esm3-release
98500000000.00
98.5 billion (Table S1)
1.0699999999999999e+24
"ESM3 at its largest scale was trained with 1.07×10^24 FLOPs on 2.78 billion proteins and 771 billion unique tokens, and has 98 billion parameters." per Table 1, trained 98B model on 1.8T training tokens. 98 billion * 1800 billion * 6 = 1.06e24. Likely some rounding, so will go with developer's reported count.
ESM3 Dataset
771000000000
771 billion tokens
Confident
More than three billion years of evolution have produced an image of biology encoded into the space of natural proteins. Here we show that language models trained on tokens generated by evolution can act as evolutionary simulators to generate functional proteins that are far away from known proteins. We present ESM3, a frontier multimodal generative language model that reasons over the sequence, structure, and function of proteins. ESM3 can follow complex prompts combining its modalities and is
Unreleased
United States of America
United States of America
Unreleased
only small version released
Industry
Academia
BlueLM 175B
Language
Chat
Language modeling/generation
Question answering
vivo AI lab
2023-11-02
https://baijiahao.baidu.com/s?id=1781445143383237948&wfr=spider&for=pc
175000000000.00
1.05e+24
6ND = 6*175B*1000B=1.05e+24
Unspecified unreleased
1000B Text data 10B Image data 100M video data 100M Knowledge graph (from the conference handout)
Confident
Unreleased
China
Unreleased
information about the model is from their paper catalogue and not found on the internet
Industry
ERNIE 3.0 Titan
Language
Language modeling
Language modeling/generation
Baidu
Peng Cheng Laboratory
Shuohuan Wang, Yu Sun, Yang Xiang, Zhihua Wu, Siyu Ding, Weibao Gong, Shikun Feng
2021-12-23
ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
https://arxiv.org/abs/2112.12731
70.00
260000000000.00
"[We] developed... distributed training technology, including fine-grained parallelism, heterogeneous hardware-aware training, and fault tolerance mechanism to train the 260B model on both Nvidia V100 GPU and Ascend 910 NPU clusters." See also: https://twitter.com/BaiduResearch/status/1468633977242243078?t=6q4zuLNdTSc4GUBe9OM5Aw&s=19
1.0421e+24
The paper suggests that ERNIE 3.0 Titan uses more compute than GPT-3. This is consistent with the 6ND approximation. C = 6ND = 6 (FLOP/param/token) * (260B params) * (668B tokens) = 1.0421*10^24 FLOP
ERNIE 3.0 Corpus
668000000000
"To ensure the success of the pre-training of ERNIE 3.0 Titan, we utilize the ERNIE 3.0 Corpus [ 2 ], a large-scale, wide-variety, and high-quality Chinese text corpora amounting to 4TB" Assuming 167M words/tokens per GB
NVIDIA Tesla V100 DGXS 32 GB
Huawei Ascend 910
Confident
Pre-trained language models have achieved state-of-the-art results in various Natural Language Processing (NLP) tasks. GPT-3 has shown that scaling up pre-trained language models can further exploit their enormous potential. A unified framework named ERNIE 3.0 was recently proposed for pre-training large-scale knowledge enhanced models and trained a model with 10 billion parameters. ERNIE 3.0 outperformed the state-of-the-art models on various NLP tasks. In order to explore the performance of sc
Hosted access (no API)
China
China
1920
Unreleased
The Ernie 3.0 Titan model was used in Ernie bot. Today, ERNIE has been widely deployed across finance, healthcare, insurance, equity, Internet, logistics, and other fields. http://research.baidu.com/Blog/index-view?id=165
Industry
Academia
TigerBot-70B
Language
Chat
Language generation
Language modeling/generation
Question answering
Tigerobo
Ye Chen, Wei Cai, Liangmin Wu, Xiaowei Li, Zhanxuan Xin, Cong Fu
2023-09-06
TigerBot: An Open Multilingual Multitask LLM
https://github.com/TigerResearch/TigerBot/blob/main/README_en.md, https://arxiv.org/abs/2312.08688
70000000000.00
70B
1.02e+24
~1.02e24 Tigerobo did ~2.1e23 additional pre-training. We estimated Llama 2 was trained on 8.1e23 FLOP.
"Tigerbot-70b is further pre-trained on the foundation of Llama-2-70b using high-quality multi-language data of 300 billion tokens. " "We collected data from Chinese books, the internet, and encyclopedia-type data based on the distribution of GPT3 pretraining data, and filtered the data through source quality control and tf-idf soft deduplication. From 20TB of data, we filtered down to 2TB, maintaining the proportion of language and categories. On this basis, we randomly sampled 100G of data an
300000000000
NVIDIA A100
Confident
(translated from https://github.com/TigerResearch/TigerBot/wiki/TigerBot%E2%80%9070B%E5%8F%91%E5%B8%83%EF%BC%81) We are pleased to release Tigerbot-70b, which continues to be open source and free for commercial use, including: Tigerbot-70b-base: Continuing pre-training on the basis of Llama-2-70b, the model's comprehensive capabilities are better than Llama-2-70b in 10 mainstream benchmark tests such as mmlu, reaching SOTA in the industry. a. Using high-quality multi-lingual data of 300 billi
Open weights (restricted use)
China
Llama 2-70B
126000000000000B
70b * 300b * 6 = 126000*10^18 = 1.26*10^23
512
Open source
Apache 2.0 https://github.com/TigerResearch/TigerBot/blob/main/README_en.md but it's also a Llama 2 finetune. training code here: https://github.com/TigerResearch/TigerBot/tree/main/train They released a 5% sample of training data: " On this basis, we randomly sampled 100G of data and released it open source."
Industry
DeepSeek-V2 (MoE-236B)
Language
Language modeling/generation
Chat
Code generation
DeepSeek
DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J.L. Cai, Jian Liang, Jianzhong Guo, Jiaqi Ni, Jiashi Li, Jin Chen, Jingyang Yuan, Junjie Qiu, Junxiao Song, Kai Dong, Kaige Gao, Kang Guan, Lean Wang, Lecong Z
2024-05-07
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
https://arxiv.org/abs/2405.04434 https://github.com/deepseek-ai/DeepSeek-V2
236000000000.00
21B active params, 236B total
1.02e+24
21b active params * 8.1 trillion * 6 = 1.02e24
Unspecified unreleased
"We construct a high-quality and multi-source pre-training corpus consisting of 8.1T tokens. Compared with the corpus used in DeepSeek 67B (our previous release) (DeepSeek-AI, 2024), this corpus features an extended amount of data, especially Chinese data, and higher data quality. We first pretrain DeepSeek-V2 on the full pre-training corpus"
8100000000000
8.1 Trillion
NVIDIA H800 SXM5
Confident
We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE e
Open weights (restricted use)
China
Unreleased
open weights with harmful use restrictions: https://github.com/deepseek-ai/DeepSeek-V2/blob/main/LICENSE-MODEL
Industry
Inflection-1
Language
Language modeling
Inflection AI
2023-06-23
Inflection-1 technical memo
https://inflection.ai/assets/Inflection-1.pdf
1.0001e+24
<= 2.5e24 They define two "compute classes", one for models with more compute than PaLM 540B, i.e. GPT-4 and PaLM 2, and one for models with as much compute or less, i.e. GPT-3.5, Chinchilla, LLaMA, and Inflection-1. PaLM 540B required 2.5e24 FLOP to train (confirmed by Google)
"Inflection-1 was trained using thousands of NVIDIA H100 GPUs on a very large dataset."
NVIDIA H100 SXM5 80GB
Speculative
Large language models (LLMs) based on the Transformer architecture have been shown to possess a range of advanced capabilities in language generation and understanding. These capabilities have paved the way for deployment of LLMs in products like OpenAI’s Chat-GPT and Google’s Bard. At Inflection AI, our mission is to create personal AIs for everyone, and in May 2023 we released Pi (pi.ai) – an LLM-based personal AI which is designed to be empathetic, useful, and safe. In this work we introduce
Hosted access (no API)
United States of America
Unreleased
Industry
MegaScale (530B)
Language
Language modeling/generation
ByteDance
Peking University
Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu
2024-02-23
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
https://arxiv.org/abs/2402.15627
40.00
530000000000.00
Two models are trained for epoch to evaluate the MegaScale training system; one model with 175B and another with 530B parameters. This entry reports the 530B model. There is a third production model mentioned, with fewer details.
9.6910000000001e+23
175B models uses 3.2e23 FLOPs (Table 2, bottom row) With constant dataset size and utilization, FLOPs should scale linearly in # parameters, so: 3.2e23 * (530/175) = 9.7e23
175B and 530B models trained for paper use 300B tokens each.
300000000000
300B tokens
NVIDIA A100
Confident
We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pip
Unreleased
China
China
11200
Unreleased
they open-sourced their framework but don't see training code for their big model. https://github.com/volcengine/vescale Model weights are unreleased
Industry
Academia
Phi-4
Language
Language modeling/generation
Question answering
Code generation
Quantitative reasoning
Microsoft Research
Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu, Cyril Zhang, Yi Zhang
2024-12-12
Phi-4 Technical Report
https://arxiv.org/abs/2412.08905
14000000000.00
14B
9.3202015e+23
6ND = 6* 14*10^9 parameters * 10*10^12 tokens = 8.4e+23 FLOP 989500000000000 FLOP / sec [assumed bf16 precision] * 1920 GPUs * 504 hours * 3600 sec / hour * 0.3 [assumed utilization] = 1.0341209e+24 FLOP geometric mean sqrt(8.4e+23 * 1.0341209e+24) = 9.3202015e+23
Unspecified unreleased
"The pretraining phase of phi-4 relies heavily on synthetic datasets generated through a variety of techniques. " "We collected a wide variety of high-quality organic data sources for phi-4, prioritizing reasoning-dense and nuanced material (e.g., academic papers, educational forums, and programming tutorials)." "Our post-training data is composed of: • Supervised Fine-Tuning (SFT) Datasets • Direct Preference Optimization (DPO)
10000000000000
"The model was pretrained for approximately 10T tokens using linear warm-up and decay schedules with peak learning rate of 0.0003, constant weight decay of 0.1, and global batch size of 5760. " Table 5: Web 15% 1.3T unique tokens 1.2 epochs Web rewrites 15% 290B unique tokens 5.2 epochs Synthetic 40% 290B unique tokens 13.8 epochs Code data 20% 820B unique tokens 2.4 epochs Acquired sources 10% 580B unique tokens 1.7 epochs
NVIDIA H100 SXM5 80GB
Confident
We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Phi family largely distill the capabilities of a teacher model (specifically GPT-4), phi-4 substantially surpasses its teacher model on ST
Open weights (unrestricted)
United States of America
United Kingdom of Great Britain and Northern Ireland
1920
Unreleased
"Phi-4 is currently available on Azure AI Foundry under a Microsoft Research License Agreement (MSRLA) and will be available on Hugging Face next week. " Hugging Face: MIT license https://huggingface.co/microsoft/phi-4
Industry
Gemma 3 12B
Language
Vision
Multimodal
Language modeling/generation
Question answering
Translation
Chat
Quantitative reasoning
Visual question answering
Code generation
Google DeepMind
Core contributors: Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Beyer, Xiaohai Zhai, Anton Tsitsulin, Robert Busa-Fekete, Alex Feng, Noveen Sachdeva, Benjamin Coleman, Yi Gao, Basil Mustafa, Iain Barr, E
2025-03-12
Gemma 3 Technical Report
https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf
12000000000.00
Vision Encoder: 417M Embedding Parameters: 1,012M Non-embedding Parameters: 10,759M
8.64e+23
6ND = 6 * 12B parameters * 12T training tokens = 8.64 × 10^23 FLOP
Unspecified unreleased
12000000000000
12T
Google TPU v4
Confident
We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context – at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local atten
Open weights (restricted use)
United States of America
United Kingdom of Great Britain and Northern Ireland
Multinational
SigLIP 400M
6144
Unreleased
https://huggingface.co/google/gemma-3-12b-it Gemma License
Industry
Baichuan2-53B
Language
Language modeling/generation
Chat
Baichuan
2023-08-09
Chinese AI startup Baichuan rolls out third LLM in four months
https://technode.com/2023/08/09/chinese-ai-startup-baichuan-rolls-out-third-llm-in-four-months/
53000000000.00
8.268e+23
Given that it was announced at a similar time to the other Baichuan2 models, this assumes that the dataset size is the same at 2.6T tokens while the parameter count was scaled up. This would be consistent with many other model releases, such as Meta's Llama models. 53b * 2.6t * 6 = 8.268e23
Likely
On Tuesday, four-month-old AI startup Baichuan Intelligent Technology unveiled its first closed-source model equipped with 53 billion parameters. Following the Chinese company’s rapid release of two open-source large language models since its founding in April, the new model demonstrates the firm’s fast pace in delivering pre-trained models for larger parameters. The freshly introduced model, Baichuan-53B, is mainly for corporate clients and focused on text generation. A ChatGPT-like chat servic
China
Industry
Qwen2.5 Instruct (7B)
Language
Code generation
Code autocompletion
Quantitative reasoning
Question answering
Language modeling/generation
Alibaba
Qwen Team
2024-09-19
Qwen2.5: A Party of Foundation Models!
https://qwenlm.github.io/blog/qwen2.5/
7610000000.00
8.2188e+23
6ND = 6*7610000000.00 parameters *18000000000000 tokens = 8.2188e+23 (might be less if not entire training dataset was used)
Unspecified unreleased
18000000000000
"In terms of Qwen2.5, the language models, all models are pretrained on our latest large-scale dataset, encompassing up to 18 trillion tokens."
Likely
Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. Significant improvements in instruction following, generating long texts (over 8K tokens), unde
Open weights (restricted use)
China
Qwen2.5-7B
requires permission to use in applications with 100K+ users https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
Industry
Qwen2.5-7B
Language
Language modeling/generation
Question answering
Quantitative reasoning
Alibaba
Qwen Team
2024-09-19
Qwen2.5: A Party of Foundation Models!
https://qwenlm.github.io/blog/qwen2.5/
7610000000.00
7.61B
8.2188e+23
Training dataset size was 18 trillion 6ND = 6 * 7.61 billion parameters * 18 trillion tokens = 8.2188e+23
Unspecified unreleased
"In terms of Qwen2.5, the language models, all models are pretrained on our latest large-scale dataset, encompassing up to 18 trillion tokens."
18000000000000
"In terms of Qwen2.5, the language models, all models are pretrained on our latest large-scale dataset, encompassing up to 18 trillion tokens"
Confident
In the past three months since Qwen2’s release, numerous developers have built new models on the Qwen2 language models, providing us with valuable feedback. During this period, we have focused on creating smarter and more knowledgeable language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5. We are announcing what might be the largest opensource release in history! Let’s get the party started! The Qwen2.5-7B model surpasses its predecessors and counte
Open weights (unrestricted)
China
Unreleased
Apache 2.0
Industry
Llama 2-70B
Language
Language modeling
Meta AI
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Mar
2023-07-18
Llama 2: Open Foundation and Fine-Tuned Chat Models
https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/ https://arxiv.org/abs/2307.09288
8056.00
70000000000.00
Llama has been released in 7B, 13B, 34B, and 70B variants.
8.1e+23
"Pretraining utilized a cumulative 3.3M GPU hours of computation on hardware of type A100-80GB" of which 1720320 GPU hours were used to train the 70B model. 311.84 BF16 TFLOP/s * 1720320 hours * 0.40 utilization = 7.725e+23 FLOP. Alternatively: the model was trained for 1 epoch on 2 trillion tokens and has 70B parameters. C = 6ND = 6*70B*2T = 8.4e+23 FLOP.
Llama 2 dataset
2 trillion tokens of publicly available text, with no text from Meta's products. "Our training corpus includes a new mix of data from publicly available sources, which does not include data from Meta’s products or services. We made an effort to remove data from certain sites known to contain a high volume of personal information about private individuals. We trained on 2 trillion tokens of data as this provides a good performance–cost trade-off, up-sampling the most factual sources in an effort
1500000000000
2 trillion tokens ~= 1.5 trillion words
NVIDIA A100 SXM4 80 GB
Confident
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closedsource models. We provide a detailed description of our approach to
Open weights (restricted use)
United States of America
1000
Unreleased
Llama 2 license. can't use outputs to train models. https://github.com/meta-llama/llama/blob/main/LICENSE
Industry
DeepSeek LLM 67B
Language
Chat
Language modeling/generation
Question answering
DeepSeek
Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li, Y.K. Li, Wenfeng Liang, Fangyun Lin, A.X. Liu, Bo Liu, Wen Liu, Xiaodong Liu, Xin Liu, Yiyuan Liu, Haoyu Lu, Shanghao Lu, Fuli Luo, Shirong Ma, Xiaotao Nie, Tian Pei, Yishi Piao, Junjie Qiu, Hui Qu, T
2024-01-05
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
https://arxiv.org/abs/2401.02954, https://github.com/deepseek-ai/DeepSeek-LLM
67000000000.00
67B
8.04e+23
67B * 2T * 6 = 8.04e23
Unspecified unreleased
"We collect 2 trillion tokens for pre-training, primarily in Chinese and English." "We have gained valuable insights from reputable sources such as (Computer, 2023; Gao et al., 2020; Penedo et al., 2023; Touvron et al., 2023a)... We adopted an aggressive deduplication strategy, expanding the deduplication scope. Our analysis revealed that deduplicating the entire Common Crawl corpus results in higher removal of duplicate instances compared to deduplicating within a single dump"
2000000000000
"We collect 2 trillion tokens for pre-training, primarily in Chinese and English"
Confident
The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate scaling of large scale models in two commonly used open-source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing ope
Open weights (restricted use)
China
Unreleased
repo with inference code and details, but no training code: https://github.com/deepseek-ai/deepseek-LLM/blob/main/LICENSE-MODEL
Industry
BlueLM 130B
Language
Chat
Language modeling/generation
Question answering
vivo AI lab
2023-11-02
https://baijiahao.baidu.com/s?id=1781445143383237948&wfr=spider&for=pc
130000000000.00
7.8e+23
6ND = 6*130B*1000B=7.8e+23
Unspecified unreleased
1000B Text data 10B Image data 100M video data 100M Knowledge graph (from the conference handout)
Confident
Unreleased
China
Unreleased
information about the model is from their paper catalogue and not found on the internet
Industry
Nemotron-4 15B
Language
Language modeling/generation
Code generation
Question answering
Translation
Quantitative reasoning
NVIDIA
Jupinder Parmar, Shrimai Prabhumoye, Joseph Jennings, Mostofa Patwary, Sandeep Subramanian, Dan Su, Chen Zhu, Deepak Narayanan, Aastha Jhunjhunwala, Ayush Dattagupta, Vibhu Jawa, Jiwei Liu, Ameya Mahabaleshwarkar, Osvald Nitski, Annika Brundyn, James Maki, Miguel Martinez, Jiaxuan You, John Kamalu, Patrick LeGresley, Denys Fridman, Jared Casper, Ashwath Aithal, Oleksii Kuchaiev, Mohammad Shoeybi, Jonathan Cohen, Bryan Catanzaro
2024-02-27
Nemotron-4 15B Technical Report
https://arxiv.org/abs/2402.16819
15000000000.00
15b
7.5005116e+23
6ND = 6 FLOP/token/parameter * 15*10^9 parameters * 8*10^12 tokens = 7.2e+23 FLOP "Nemotron-4 was trained using 384 DGX H100 nodes; each node contains 8 H100 80GB SXM5 GPUs based on the NVIDIA Hopper architecture (NVIDIA, 2022). Each H100 GPU has a peak throughput of 989 teraFLOP/s when doing 16-bit floating point (bfloat16) arithmetic without sparsity. Table 2 reports more detailed training schedule: 989*10^12 FLOP/sec * 3600 sec/hour * 24 hours * (768 gpus * 0.343 [reported utilization] * 0
Unspecified unreleased
"At a high-level, the data blend is split into three different types of data: English natural language data (70%), multilingual natural language data (15%), and source-code data (15%)."
8000000000000
"15-billion-parameter large multilingual language model trained on 8 trillion text tokens"
NVIDIA H100 SXM5 80GB
Confident
We introduce Nemotron-4 15B, a 15-billion-parameter large multilingual language model trained on 8 trillion text tokens. Nemotron-4 15B demonstrates strong performance when assessed on English, multilingual, and coding tasks: it outperforms all existing similarly-sized open models on 4 out of 7 downstream evaluation areas and achieves competitive performance to the leading open models in the remaining ones. Specifically, Nemotron-4 15B exhibits the best multilingual capabilities of all similarly
Unreleased
United States of America
3072
Unreleased
Industry
Yi-1.5-34B
Language
Chat
Language modeling/generation
Translation
Code generation
01.AI
2024-05-13
Yi-1.5 is an upgraded version of Yi, delivering stronger performance in coding, math, reasoning, and instruction-following capability.
https://huggingface.co/01-ai/Yi-1.5-34B
34000000000.00
34b
7.344e+23
6*34*10^9*3.6*10^12 = 7.344e+23
Unspecified unreleased
assuming same as Yi 34 - Chinese and English dataset
3600000000000
500b "Yi-1.5 is an upgraded version of Yi. It is continuously pre-trained on Yi with a high-quality corpus of 500B tokens and fine-tuned on 3M diverse fine-tuning samples." 3.6T total pre-trained tokens
Confident
Yi-1.5 is an upgraded version of Yi. It is continuously pre-trained on Yi with a high-quality corpus of 500B tokens and fine-tuned on 3M diverse fine-tuning samples. Compared with Yi, Yi-1.5 delivers stronger performance in coding, math, reasoning, and instruction-following capability, while still maintaining excellent capabilities in language understanding, commonsense reasoning, and reading comprehension. Yi-1.5 comes in 3 model sizes: 34B, 9B, and 6B. For model details and benchmarks, see M
Open weights (restricted use)
China
Unreleased
no training code the model https://huggingface.co/01-ai/Yi-1.5-34B Apache 2.0 "If you create derivative works based on this model, please include the following attribution in your derivative works:"
Industry
SEA-LION V2 8B
Language
Language modeling/generation
AI Singapore
2024-07-30
SEA-LION V2
https://sea-lion.ai/our-models/
8000000000.00
"pretrained on top of the Llama3 base model that is 8 billion parameters"
7.23e+23
Llama 3 base model: 7.2e+23 SEA LION extended pre train: 2*24*60*60*64*989500000000000*0.3=3.28292352e+21 Total: 7.2328292e+23
Dolma
Mix of datasets, mainly Dolma (see https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2-base#data)
48000000000
Llama3 8B CPT SEA-LIONv2 base model was continued pre-trained on 48B tokens
NVIDIA H100 SXM5 80GB
Unverified
SEA-LION version 2 has been continued-pretrained on top of the Llama3 base model that is 8 billion parameters in size
Open weights (unrestricted)
Llama 3-8B
64
Llama 3-8B
Language
Chat
Language modeling/generation
Code generation
Meta AI
Aaditya Singh; Aaron Grattafiori; Abhimanyu Dubey; Abhinav Jauhri; Abhinav Pandey; Abhishek Kadian; Adam Kelsey; Adi Gangidi; Ahmad Al-Dahle; Amit Sangani; Ahuva Goldstand; Aiesha Letman; Ajay Menon; Akhil Mathur; Alan Schelten; Alex Vaughan; Amy Yang; Andrei Lupu; Andres Alvarado; Andrew Gallagher; Andrew Gu; Andrew Ho; Andrew Poulton; Andrew Ryan; Angela Fan; Ankit Ramchandani; Anthony Hartshorn; Archi Mitra; Archie Sravankumar; Artem Korenev; Arun Rao; Ashley Gabriel; Ashwin Bharambe; Assaf E
2024-04-18
Introducing Meta Llama 3: The most capable openly available LLM to date
https://ai.meta.com/blog/meta-llama-3/
8000000000.00
7.2e+23
Counting operations 15000000000000 tokens*8000000000.00 parameters*6=7.2×10^23 GPU calculation 400 TFLOPS per GPU * 1.3M GPU hours * 3600s=1.872×10^24 (it is not confident that 400 TFLOPs applies to the Llama 3-8B training run)
Llama 3 dataset
15000000000000
NVIDIA H100 SXM5 80GB
Confident
Open weights (restricted use)
United States of America
Unreleased
https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md License A custom commercial license is available at: https://llama.meta.com/llama3/license
Industry
Gopher (280B)
Language
Language modeling
Question answering
DeepMind
Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buch
2021-12-08
"Scaling Language Models: Methods, Analysis & Insights from Training Gopher"
https://arxiv.org/abs/2112.11446
1122.00
280000000000.00
Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across
6.31e+23
Table A26 6.31E+08 Train PFLOPs
MassiveTex
300000000000
"We train all models for 300 billion tokens with a 2048 token context window, using the Adam (Kingma and Ba, 2014) optimiser." 1 token ~ 0.75 words
Google TPU v3
Confident
We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a 2 trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25× fewer parameters. After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen Bert retriever, a dif
Unreleased
United Kingdom of Great Britain and Northern Ireland
4096
Unreleased
Industry
Reka Flash
Multimodal
Language
Vision
Chat
Language modeling/generation
Image captioning
Code generation
Code autocompletion
Reka AI
Aitor Ormazabal Che Zheng Cyprien de Masson d’Autume Dani Yogatama Deyu Fu Donovan Ong Eric Chen Eugenie Lamprecht Hai Pham Isaac Ong Kaloyan Aleksiev Lei Li Matthew Henderson Max Bain Mikel Artetxe Nishant Relan Piotr Padlewski Qi Liu Ren Chen Samuel Phua Yazheng Yang Yi Tay Yuqi Wang Zhongkai Zhu Zhihui Xie
2024-04-15
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models
https://publications.reka.ai/reka-core-tech-report.pdf
21000000000.00
6.3e+23
Reka Flash has 21B parameters and was trained on 5 trillion language tokens 6*21B*5trillion=6.3 × 10^23 Agrees with GPU details: "Reka Flash and Edge were trained on several hundreds of H100s across a period of several weeks." 3 weeks * 300 H100s * 7 day/week * 24 hour/day * 3600 s/day * 9.9e14 FLOP/GPU-s = 5.4e23 Not enough info to estimate SFT and RLHF post-training FLOPs.
Unspecified unreleased
The training data comprises a mixture of publicly available and proprietary/licensed datasets with a dataset knowledge cutoff of November 2023. The dataset ingested by our model comprises of text, images, videos, and audio clips. Reka Flash and Reka Edge were trained on approximately 5 trillion and 4.5 trillion extensively deduplicated and filtered language tokens, respectively. While the classification of corpora is not strictly defined to one class or category, approximately 25% of our pretrai
5000000000000
NVIDIA A100
NVIDIA H100 SXM5 80GB
Likely
API access
United States of America
Unreleased
Industry
xTrimoPGLM -100B
Biology
Proteins
Protein or nucleotide language model (pLM/nLM)
Tsinghua University
BioMap Research
Bo Chen, Xingyi Cheng, Yangli-ao Geng, Shen Li, Xin Zeng, Boyan Wang, Jing Gong, Chiming Liu, Aohan Zeng, Yuxiao Dong, Jie Tang, Le Song
2023-07-06
xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein
https://www.biorxiv.org/content/10.1101/2023.07.05.547496v4
65.00
100000000000.00
Abstract: "training xTrimoPGLM at an unprecedented scale of 100 billion parameters and 1 trillion training tokens"
6.2e+23
"xTrimoPGLM-100B is trained on a cluster of 96 DGX-A100 GPU (8×40G) servers in FP16 precision from January 18 to June 30, 2023. During this time, xTrimoPGLM-100B has consumed 1 trillion tokens from the dataset consisting of Uniref90 and ColAbFoldDB. As of the current date, xTrimoPGLM-100B continues its pre-training process to pass through as many tokens as possible" 6 * 100 billion params * 1T tokens = 6e23 8*96 * 312 trillion * 163 days * 24 * 3600 * 0.3 ~= 1e24 directly given in the paper (
UniRef50
~24M protein sequences
NVIDIA A100 SXM4 40 GB
Confident
Protein language models have shown remarkable success in learning biological information from protein sequences. However, most existing models are limited by either autoencoding or autoregressive pre-training objectives, which makes them struggle to handle protein understanding and generation tasks concurrently. This paper proposes a unified protein language model, xTrimoPGLM, to address these two types of tasks simultaneously through an innovative pre-training framework. Our key technical contr
Unreleased
China
China
768
Unreleased
Academia
Industry
Yi-34B
Language
Chat
Language modeling/generation
Translation
Code generation
01.AI
Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, Zonghong Dai
2023-11-02
Yi: Open Foundation Models by 01.AI
https://arxiv.org/abs/2403.04652
34000000000.00
34b
6.1e+23
"The dataset we use contains Chinese & English only. We used approximately 3T tokens" sounds like this means it was trained on 3T tokens, not necessarily that the dataset contains 3T tokens? If so, 34b * 3T * 6 = 6.1e23
Unspecified unreleased
Chinese and English dataset "For pretraining, we construct 3.1 trillion tokens of English and Chinese corpora using a cascaded data deduplication and quality filtering pipeline. For finetuning, we polish a small scale (less than 10K) instruction dataset over multiple iterations such that every single instance has been verified directly by our machine learning engineers. For vision-language, we combine the chat language model with a vision transformer encoder and train the model to align visual
3100000000000
"language models pretrained from scratch on 3.1T highly-engineered large amount of data, and finetuned on a small but meticulously polished alignment data."
NVIDIA A100
Confident
The Yi series models are large language models trained from scratch by developers at 01.AI.
Open weights (restricted use)
China
128
Unreleased
apply for commercial license: no training code https://github.com/01-ai/Yi/blob/main/MODEL_LICENSE_AGREEMENT.txt the model https://huggingface.co/01-ai/Yi-34B-Chat Apache 2.0 "If you create derivative works based on this model, please include the following attribution in your derivative works: ...."
Industry
Qwen2-VL-72B
Language
Vision
Multimodal
Visual question answering
Video description
Language modeling/generation
Translation
Question answering
Character recognition
Quantitative reasoning
Alibaba
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin
2024-09-18
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
https://arxiv.org/abs/2409.12191
72000000000.00
72 billion (language model) and 675M (vision encoder)
6.048e+23
6ND = 6×1.4×10^12×7.2×10^10 = 6.048e+23
Unspecified unreleased
"The model is pre-trained on a diverse dataset that includes image-text pairs, optical character recognition (OCR) data, interleaved image-text articles, visual question answering datasets, video dialogues, and image knowledge datasets. Our data sources primarily comprise cleaned web pages, open-source datasets, and synthetic data. The cutoff date for our data knowledge is June 2023."
1400000000000
"Throughout the pre-training stages, Qwen2-VL processes a cumulative total of 1.4 trillion tokens. Specifically, these tokens encompass not only text tokens but also image tokens"
Likely
We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The mo
Open weights (unrestricted)
China
https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct
Industry
InternLM
Language
Language modeling
Shanghai AI Lab
SenseTime
2023-07-06
https://internlm.org/
100000000000.00
Pre-training a bilingual 100B Foundation model on data with over a trillion tokens
6.000001e+23
6 * 100b * 1t = 6e23
1000000000
"Pre-training a bilingual 100B Foundation model on data with over a trillion tokens"
NVIDIA A100 SXM4 80 GB
Confident
Pre-training a bilingual 100B Foundation model on data with over a trillion tokens, the model exhibits excellent performance in scenarios such as Chinese, English, and coding due to the appropriate data ratio. Based on the foundation model, the application of high-quality human annotated dialogue data combined with RLHF technology enables the InternLM large language model to respond to complex commands during human interaction, while also demonstrating responses in line with human morality and v
China
Hong Kong
China
Academia
Industry
Granite 3.0 8B
Language
Language modeling/generation
Question answering
Translation
Text summarization
Text classification
Code generation
IBM
Granite Team IBM
2024-10-21
Granite 3.0 Language Models
https://github.com/ibm-granite/granite-3.0-language-models/tree/main
8100000000.00
8.1B
5.832e+23
6ND = 6*8.1*10^9*12*10^12 = 5.832e+23 "All our Granite 3.0 models are trained using a compute budget of 8.35 × 10^23 FLOPS." 8.35 × 10^23 * 757.0 (model's power consumption) / (174.6+757.0+64.5+121.2) = 5.6573436e+23 hardware estimation: 832102*3600*989500000000000*0.3 = 8.8923412e+23
Unspecified unreleased
Granite 3.0 language models are trained using data from various sources such as unstructured natural language text and code data from theWeb curated by IBM, a collection of synthetic datasets generated by IBM, and publicly available high-quality datasets with permissible licenses.
12000000000000
12T tokens
NVIDIA H100 SXM5 80GB
Confident
This report presents Granite 3.0, a new set of lightweight, state-of-the-art, open foundation models ranging in scale from 400 million to 8 billion active parameters. Equipped with native support of multilingual, coding, function calling, and strong safety performance, these models target enterprise use cases, including on-premise and on-device settings. Evaluations on a comprehensive set of tasks demonstrate that our models consistently reach state-of-the-art performance for their size (as sho
Open weights (unrestricted)
United States of America
256
Unreleased
Apache 2.0 license https://huggingface.co/ibm-granite/granite-3.0-8b-instruct
Industry
Chinchilla
Language
Language modeling
DeepMind
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Laurent Sifre
2022-03-29
Training Compute-Optimal Large Language Models
https://arxiv.org/abs/2203.15556
1486.00
70000000000.00
"We test this hypothesis by training a predicted compute-optimal model, \chinchilla, that uses the same compute budget as \gopher but with 70B parameters and 4× more more data. \chinchilla uniformly and significantly outperforms \Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks."
5.76e+23
"Both Chinchilla and Gopher have been trained for the same number of FLOPs but differ in the size of the model and the number of training tokens." We see the number of flops in table 3
MassiveWeb
C4
MassiveWeb, Books, C4, News, Github, Wikipedia (Table A1)
1050000000000
Table 1 shows Chinchilla was training on 1.4 trillion tokens 1 token ~ 0.75 words
Google TPU v4
Google TPU v3
Confident
We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over \nummodels language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model s
Unreleased
United Kingdom of Great Britain and Northern Ireland
Unreleased
Industry
BIG-G 137B
Language
Language modeling/generation
Google
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ambrose Slone, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen, Andrea Madotto, Andrea Santilli, Andreas Stuhlmü
2022-06-09
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
https://arxiv.org/abs/2206.04615
1394.00
137000000000.00
137B. Table App.1
5.6e+23
"BIG-G models were trained at Google. We use 13 dense decoder-only Transformer models (Vaswani et al., 2017) with gated activation layers (Dauphin et al., 2017) and GELU activations based on the LaMDA architectures (Thoppilan et al., 2022). These models were trained on a dataset consisting of a mixture of web documents, code, dialog, and Wikipedia data, with approximately three billion documents tokenized to 2.8 trillion BPE tokens using a 32k-token SentencePiece vocabulary" Appendix: "We use
GLaM dataset
"These models were trained on a dataset consisting of a mixture of web documents, code, dialog, and Wikipedia data, with approximately three billion documents tokenized to 2.8 trillion BPE tokens using a 32k-token SentencePiece vocabulary"
681000000000
Full dataset is comprised of 2.8 trillion tokens, but calculation based on batch size and steps suggests model was trained on only 681 billion tokens.
Confident
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyon
Unreleased
United States of America
Unreleased
Industry
LLaMA-65B
Language
Language modeling
Code generation
Meta AI
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample
2023-02-24
LLaMA: Open and Efficient Foundation Language Models
https://arxiv.org/abs/2302.13971
8872.00
65200000000.00
Model card, table 1: https://github.com/facebookresearch/llama/blob/53011c3d7946dadb8274a4c5c7586ab54edf792d/MODEL_CARD.md
5.5e+23
1.4e12 tokens * 6.52e10 parameters * 6 FLOP/token/parameter = 5.5e23 FLOP Compared to 2048 A100 GPUs each with 311.84 TFLOPS maximum performance for 21 days, this implies 47% utilization. https://www.wolframalpha.com/input?i=5.5*10%5E23+FLOP+%2F+%282048+*+311.84+teraFLOPS+*+21+days%29
CCNet
GitHub
Wikipedia
books
arXiv
Stack Exchange
The model was trained using the following source of data: CCNet [67%], C4 [15%], GitHub [4.5%], Wikipedia [4.5%], Books [4.5%], ArXiv [2.5%], Stack Exchange[2%]. The Wikipedia and Books domains include data in the following languages: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk. See the paper for more details about the training set and corresponding preprocessing.
1340000000000
Table 1 indicates that 1.4T tokens involved sampling sub-datasets at more or less than one epoch. Correcting for this: (1.1 epoch * 3.3TB) + (1.06 epoch * 0.783TB) + ... = 1.4T tokens 5.24 epoch-TBs = 1.4T tokens 5.24 epoch-TB * 1000 GB/TB * 200M token/GB = 1.4T tokens 1.05T epoch*token = 1.4T tokens 1 epoch = 1.34T tokens
NVIDIA A100
Confident
We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the resea
Open weights (non-commercial)
United States of America
2048
Unreleased
"we are releasing our model under a noncommercial license focused on research use cases" https://ai.meta.com/blog/large-language-model-llama-meta-ai/
Industry
Guanaco-65B
Language
Chat
University of Washington
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer
2023-05-23
QLoRA: Efficient Finetuning of Quantized LLMs
https://arxiv.org/abs/2305.14314; https://github.com/artidoro/qlora
1578.00
65000000000.00
from Llama-65B (also 33B, 13B, 7B variants)
5.5e+23
Fine-tune of Llama-65B, which appears to have been trained on a "professional grade GPU" with 48GB VRAM (likely A6000) for 24 hours. Fine-tune compute is negligible compared to pretraining (5.5e23 for Llama-65b)
Fine-tuned on instruction datasets such as GLUE and Super-NaturalInstructions
Confident
We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only
Open weights (non-commercial)
United States of America
LLaMA-65B
8000000000B
"using a single professional GPU over 24 hours we achieve 99.3% with our largest model" no model specified, but if it's an A100, 312 tflop/s * 24 * 3600 * 0.3 utilization = 8e18
Open source
LLaMA license, non-commercial for weights. code is MIT code: https://github.com/artidoro/qlora/blob/main/scripts/finetune_guanaco_65b.sh
Academia
Code Llama-34B
Language
Code generation
Meta AI
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Ellen Tan, Yossef (Yossi) Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Defossez, Jade Copet, Faisal Azhar, Hugo Touvron, Gabriel Synnaeve, Louis Martin, Nicolas Usunier, Thomas Scialom
2023-08-14
Code Llama: Open Foundation Models for Code
https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/ https://arxiv.org/abs/2308.12950
1297.00
34000000000.00
34B
5.3e+23
1.22e23 finetune compute, or ~5.3e23 including Llama-2 34B base compute. See finetune compute notes for calculation.
Unspecified unreleased
"As shown in Table 1, Code Llama is trained predominantly on a near-deduplicated dataset of publicly available code. We also source 8% of our samples data from natural language datasets related to code. This dataset contains many discussions about code and code snippets included in natural language questions or answers. To help the model retain natural language understanding skills, we also sample a small proportion of our batches from a natural language dataset"
600000000000
Llama 2 used 2T tokens, and "We train Code Llama on 500B additional tokens and Code Llama - Python further on 100B tokens" 2T + 500B + 100B = 2600000000000
NVIDIA A100 SXM4 80 GB
Confident
We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters
Open weights (restricted use)
United States of America
Llama 2-34B
122000000000000B
Training the nine Code Llama models took 400k A100-hours across all the models, per model card. It's nine models because there are three base models at 7B, 13B, 34B, and then Instruct and Python models across all three sizes. I'll calculate for Code Llama Python-34B since it's the most trained. Code Llama-base is trained from Llama 2 with 500B tokens: "We train Code Llama on 500B tokens during the initial phase, starting from the 7B, 13B, and 34B versions of Llama 2" Code Llama-Python required
Unreleased
Llama 2 license. can't use outputs to train models. https://github.com/meta-llama/llama/blob/main/LICENSE
Industry
MM1-30B
Multimodal
Language
Vision
Chat
Image captioning
Apple
Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, Yinfei Yang
2024-03-14
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
https://arxiv.org/abs/2403.09611
122.00
30000000000.00
30B
4.86e+23
Pre-trained on ~2B image-text pairs and 2T tokens (Table 2). Each image is 144 tokens, so the images are ~300B tokens. Then additional multimodal training for 400B tokens, for a total of ~2.7T tokens. This is the final training recipe: "We initialize both the image encoder and the underlying LLM decoder weights for MM1 from in-house pre-trained models2. We then perform multimodal pre-training on the above data mix for 200k steps (approx. 400B tokens)." Compute = 6ND = 6 * 2.7 trillion * 30 bi
Conceptual Captions (CC3M)
Conceptual Captions 12M (CC12M)
COYO-700M
Unspecified unreleased
OBELICS
Text, captioned images. See Table 2
1500000000000
at least 2T tokens
Likely
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and
Unreleased
United States of America
Unreleased
Industry
PanGu-Σ
Language
Code generation
Language modeling
Translation
Question answering
Huawei Noah's Ark Lab
Xiaozhe Ren, Pingyi Zhou, Xinfan Meng, Xinjing Huang, Yadao Wang, Weichao Wang, Pengfei Li, Xiaoda Zhang, Alexander Podolskiy, Grigory Arshinov, Andrey Bout, Irina Piontkovskaya, Jiansheng Wei, Xin Jiang, Teng Su, Qun Liu, Jun Yao
2023-03-20
PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing
https://arxiv.org/abs/2303.10845
48.00
1085000000000.00
"In this work, we present PanGu-Σ , a large language model with sparse architecture containing 1.085 trillion parameters."
4.669999999999999e+23
It has sparse architecture, so we can't use C=6ND. "We develop PanGu-Σ model under the framework of MindSpore and train it on a cluster with only 512 Ascend 910 AI Accelerators with 329 billion tokens over 100 days." 100 days * 512 processors * 320 teraFLOPS/processor * 33% utilization = 4.67e+23 FLOP https://www.wolframalpha.com/input?i=100+days+*+512+*+320+terahertz+*+0.33
"329B tokens in more than 40 natural and programming languages"
246750000000
329B tokens ~= 247B words
Huawei Ascend 910
Confident
The scaling of large language models has greatly improved natural language understanding, generation, and reasoning. In this work, we develop a system that trained a trillion-parameter language model on a cluster of Ascend 910 AI processors and MindSpore framework, and present the language model with 1.085T parameters named PanGu-{\Sigma}. With parameter inherent from PanGu-{\alpha}, we extend the dense Transformer model to sparse one with Random Routed Experts (RRE), and efficiently train the m
Unreleased
China
512
Unreleased
Industry
OLMo 2 Furious 13B
Language
Language modeling/generation
Question answering
Allen Institute for AI
University of Washington
New York University (NYU)
Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William Merrill, Lester James V. Miranda, Jacob Morrison, Tyler Murray, Crystal Nam, Valentina Pyatkin, Aman Rangapur, Michael Schmitz, Sam Skjonsberg,
2024-12-31
2 OLMo 2 Furious
https://arxiv.org/abs/2501.00656
13000000000.00
13B
4.600000000000001e+23
4.6*10^23 FLOPs (Table 6 - developers calculated using 6ND formula)
OLMo-Mix-1124
Dolmino-Mix-1124
Tulu 3
4000000000000
Pretraining Stage 1 (OLMo-Mix-1124) 5 trillion tokens ( = 1.2 epochs) Pretraining Stage 2 (Dolmino-Mix-1124) 100B tokens (3 runs) 300B tokens (1 run) merged Post-training (Tulu 3 SFT OLMo mix) SFT + DPO + PPO (preference mix)
NVIDIA H100 SXM5 80GB
Confident
We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes dense autoregressive models with improved architecture and training recipe, pretraining data mixtures, and instruction tuning recipes. Our modified model architecture and training recipe achieve both better training stability and improved per-token efficiency. Our updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124, which significantly improves model capabilities
Open weights (unrestricted)
United States of America
United States of America
United States of America
Open source
apache 2 https://huggingface.co/allenai/OLMo-2-1124-13B https://github.com/allenai/OLMo
Research collective
Academia
Academia
AFM-on-device
Language
Language modeling/generation
Apple
Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chong Wang, Chung-Cheng Chiu, David Qiu, Deepak Gopinath, Dian Ang Yap, Dong Yin, Feng Nan, Floris Weers, Guoli Yin, Haoshuo Huang, Jianyu Wang, Jiarui Lu, John Peebles, Ke Ye, Mark Lee, Nan Du, Qibin Chen, Quentin Keunebroek, Ruoming Pang, Sam Wiseman, Syd Evans, Tao Lei, Tom Gunter, Vivek Rathod, Xiang Kong, Xianzhi Du, Yanghao Li, Yongqiang Wang, Yuan Gao, Zaid Ahmed, Zhaoyang Xu, Zhiyun Lu, Zirui Wang, Al Rashid, Albin Madappally Jose, Ale
2024-07-29
Apple Intelligence Foundation Language Models
https://machinelearning.apple.com/research/apple-intelligence-foundation-language-models
2730000000.00
Table 1, sum of non-embedding and embedding parameters
4.5126e+23
Model was initialized from a pruned version of a 6.4B parameter model trained using the same recipe as AFM-server. Assuming "same recipe" involves training for the full 6.3T tokens, this implies 6 * 6.3T * 6.4B = 2.42e23 FLOP. The pruning masks are learned by training over 188B tokens, which suggests 6 * 188B * 6.4B = 7.22e21 FLOPs. Pretraining is then run over 6.3T tokens; however, labels are a convex combination of true labels and the predicted labels from the unpruned 6.4B model. Since thi
188B of tokens are used to train a pruning mask to reduce a 6.4B model to the 2.73B used for AFM-on-device. Main pre-training data is 6.3T tokens of web text, code, and math, plus another 1T in the second pre-training stage and 100B in the third. See section 3.1 for details. Post-training details do not give details on dataset size.
7588000000000
Not explicitly mentioned, but I assume the 7.588T tokens do not involve multiple epochs.
Google TPU v5p
Confident
We present foundation language models developed to power Apple Intelligence features, including a ∼3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute [Apple, 2024b]. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used to train the model, the training process, how the models are optimized for inference
Hosted access (no API)
United States of America
2048
Unreleased
Industry
SEA-LION V3 Gemma2 9B
Language
Language modeling/generation
AI Singapore
2024-12-19
SEA-LION V3
https://huggingface.co/aisingapore/gemma2-9b-cpt-sea-lionv3-base
9000000000.00
4.484146e+23
Gemma 2 9B base model: 4.32e+23 Additional pretraining compute: 10*24*60*60*989500000000000*64*0.3=1.64146e+22 Total: 1.64146×10^22 + 4.32×10^23 = 4.484146×10^23
The Stack v2
Dolma
Trained on a mix of datasets including StackV2 and Dolma (see https://huggingface.co/aisingapore/gemma2-9b-cpt-sea-lionv3-base#data)
200000000000
"pre-trained on 200B tokens"
NVIDIA H100 SXM5 80GB
Unverified
Our SEA-LIONv3 Gemma2-9B base model has been continued pre-trained on top of the Gemma2 base model that is 9 billion parameters in size, and has a context length of 8192.
Open weights (unrestricted)
Gemma 2 9B
64
Pharia-1-LLM-7B
Language
Language modeling/generation
Translation
Question answering
Aleph Alpha
2024-08-26
Introducing Pharia-1-LLM: transparent and compliant
https://huggingface.co/Aleph-Alpha/Pharia-1-LLM-7B-control
7041544704.00
4.4299999999999995e+23
reported by the authors: 2.75*10^23 + 1.68*10^23 = 4.43*10^23 FLOP https://huggingface.co/Aleph-Alpha/Pharia-1-LLM-7B-control#compute--training-efficiency
Common Crawl
The training data of our models comprises two components: web-crawled data and structured datasets with a total size of 7.7T, with a cutoff date 04/2023. We performed some additional web scraping to augment these datasets. Web-crawled data was obtained by filtering and deduplicating data available in public datasets, derived from Common Crawl, in the following languages: English, French, German, Italian, Spanish, Dutch, Portuguese. To deduplicate the data, we applied a Bloomfilter for exact do
7700000000000
4.7T + 3T = 7.7T tokens
NVIDIA A100 SXM4 80 GB
NVIDIA H100 SXM5 80GB
Confident
We are pleased to announce our new foundation model family that includes Pharia-1-LLM-7B-control and Pharia-1-LLM-7B-control-aligned, now publicly available under the Open Aleph License, which explicitly allows for non-commercial research and educational use. Pharia-1-LLM-7B-control is engineered to deliver concise, length-controlled responses that match the performance of leading open-source models in the 7B to 8B parameter range and is culturally and linguistically optimized for German, French
Open weights (non-commercial)
Germany
256
Industry
Gemma 2 9B
Language
Language modeling/generation
Chat
Code generation
Question answering
Quantitative reasoning
Google DeepMind
Gemma Team, Google DeepMind
2024-06-24
Gemma 2 offers best-in-class performance, runs at incredible speed across different hardware and easily integrates with other AI tools.
https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf
9000000000.00
4.32e+23
"For the 9B model, we train on an 8x16x32 configuration of TPUv4, totaling 4096 chips" 6ND = 6*9000000000*8000000000000=4.32e+23
Unspecified unreleased
Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. Primarily English-language content. Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code or understand code-related questions. Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical quer
8000000000000
"the 9B model on 8 trillion tokens"
Google TPU v4
Confident
Now we’re officially releasing Gemma 2 to researchers and developers globally. Available in both 9 billion (9B) and 27 billion (27B) parameter sizes, Gemma 2 is higher-performing and more efficient at inference than the first generation, with significant safety advancements built in. In fact, at 27B, it offers competitive alternatives to models more than twice its size, delivering the kind of performance that was only possible with proprietary models as recently as December. And that’s now achie
Open weights (restricted use)
United States of America
United Kingdom of Great Britain and Northern Ireland
Multinational
4096
Unreleased
Gemma 2 is available under our commercially-friendly Gemma license, giving developers and researchers the ability to share and commercialize their innovations.
Industry
OPT-175B
Language
Language modeling
Chat
Language modeling/generation
Question answering
Meta AI
Susan Zhang∗ , Stephen Roller∗ , Naman Goyal∗ , Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott† , Sam Shleifer† , Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, Luke Zettlemoyer
2022-05-02
OPT: Open Pre-trained Transformer Language Models
https://arxiv.org/abs/2205.01068
2932.00
175000000000.00
"In line with Meta AI’s commitment to open science, we are sharing Open Pretrained Transformer (OPT-175B), a language model with 175 billion parameters trained on publicly available data sets"
4.3e+23
https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/final_update.md "As of yesterday, at 12:46pm PST on January 6, our 175B model finally completed its training run on 300B tokens. This required ~4.30E+23 FLOPs of compute"
The Pile
BookCorpus (BooksCorpus, Toronto Book Corpus)
CC-Stories
Pushshift Reddit
"The pre-training corpus contains a concatenation of datasets used in RoBERTa (Liu et al., 2019b), the Pile (Gao et al., 2021a), and PushShift.io Reddit (Baumgartner et al., 2020; Roller et al., 2021)" ... "RoBERTa We included the BookCorpus (Zhu et al., 2015) and Stories (Trinh and Le, 2018) subsets of the RoBERTa corpus and utilized an updated version of CCNews, containing news stories crawled through September 28, 2021. This CCNews v2 corpus was preprocessed the same way as the original RoBER
180000000000
"The training data contains 180B tokens corresponding to 800 GB of data" 1 token ~ 0.75 words
NVIDIA A100 SXM4 80 GB
Confident
Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M t
Open weights (non-commercial)
United States of America
1024
Open source
non-commercial for weights: https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/MODEL_LICENSE.md training code (MIT) https://github.com/facebookresearch/metaseq/blob/main/docs/training.md
Industry
OPT-IML (175B)
Language
Language modeling
Meta AI
Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pasunuru, Todor Mihaylov, Daniel Simig, Ping Yu, Kurt Shuster, Tianlu Wang, Qing Liu, Punit Singh Koura, Xian Li, Brian O'Horo, Gabriel Pereyra, Jeff Wang, Christopher Dewan, Asli Celikyilmaz, Luke Zettlemoyer, Ves Stoyanov
2022-12-22
OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization
https://arxiv.org/abs/2212.12017
236.00
175000000000.00
4.3e+23
fine-tuned from OPT-175B (4.3e23) with an estimate 2.1e21 FLOP for fine-tuning. "During fine-tuning, our models saw approximately 2 billion tokens, which is only 0.6% of the pre-training budget of OPT"
OPT-IML Bench
(fine-tuning dataset) "To this end, we create OPT-IML Bench: a large benchmark for Instruction MetaLearning (IML) of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks"
2000000000
"During fine-tuning, our models saw approximately 2 billion tokens, which is only 0.6% of the pre-training budget of OPT"
NVIDIA A100
Likely
Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and w
Open weights (non-commercial)
United States of America
OPT-175B
2100000000000B
"We fine-tune all 30B models on 64 40GB A100s, and 175B models on 128 40GB A100s", no timeframe specified fine-tuned on 2B tokens. 2B * 175B * 6 = 2.1e21
128
Unreleased
unclear license https://huggingface.co/facebook/opt-iml-30b
Industry
BlenderBot 3
Language
Chat
McGill University
Meta AI
Mila - Quebec AI (originally Montreal Institute for Learning Algorithms)
Kurt Shuster, Jing Xu, Mojtaba Komeili, Da Ju, Eric Michael Smith, Stephen Roller, Megan Ung, Moya Chen, Kushal Arora, Joshua Lane, Morteza Behrooz, William Ngan, Spencer Poff, Naman Goyal, Arthur Szlam, Y-Lan Boureau, Melanie Kambadur, Jason Weston
2022-08-10
BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage
https://arxiv.org/abs/2208.03188, https://github.com/facebookresearch/ParlAI/blob/main/parlai/zoo/bb3/model_card.md training code: https://parl.ai/projects/bb3/
218.00
175000000000.00
4.3e+23
(taken from OPT-175 base)
BlenderBot 3 Data
Fine-tuned from OPT-175B. "The fine-tuning data for BB3 comprises roughly 4 million source/target examples spread across the various training modules. This corresponds to around 1.13B training tokens. When fine-tuning the OPT-based BB3 models, we additionally included 600k examples ( 170m tokens) of pre-training data to help with training stability. Table 16 and Table 17 enumerate the breakdown by module."
1300000000
NVIDIA A100 SXM4 40 GB
Likely
We present BlenderBot 3, a 175B parameter dialogue model capable of open-domain conversation with access to the internet and a long-term memory, and having been trained on a large number of user defined tasks. We release both the model weights and code, and have also deployed the model on a public web page to interact with organic users. This technical report describes how the model was built (architecture, model and training scheme), and details of its deployment, including safety mechanisms. H
Open weights (non-commercial)
Canada
United States of America
Canada
OPT-175B
1500000000000B
"The 30B and 175B parameter BlenderBot 3 models were each trained for one epoch of the training data on 64 (30B) or 128 (175B) x 40gb A100 GPUs; we found that the model (especially the 175B version) overfit significantly when seeing the training data more than once. The 175B model was trained with a batch size of 2^18 and the 30B model was trained with a batch size of 2^19, resulting in roughly 5600 updates and 2800 updates respectively." 175b params * 5600 * 2^18 * 6 = 1.5e21
128
Open source
weights have a non-commercial license, must go through request form: https://docs.google.com/forms/d/e/1FAIpQLSfRzw8xVzxaxgRyuodTZtkcYADAjzYjN5gcxx6DMa4XaGwwhQ/viewform meanwhile training code is here. repo is MIT-licensed https://github.com/facebookresearch/ParlAI/blob/main/parlai/scripts/train_model.py
Academia
Industry
Academia
EXAONE 3.5-R 7.8B
Language
Language modeling/generation
Question answering
Translation
LG AI Research
2025-03-14
7800000000.00
7.8B
4.2568e+23
4.21 × 10^23 (base model reported training compute) + 4.68 × 10^21 (finetune compute) = 4.2568e+23 FLOP
Confident
Unreleased
Korea (Republic of)
EXAONE 3.5 7.8B
4680000000000B
4.68e21
Unreleased
Industry
EXAONE Deep 7.8B
Language
Language modeling/generation
Question answering
Quantitative reasoning
Code generation
LG AI Research
LG AI Research, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Yongmin Park, Sihoon Yang, Heuiyeen Yeen, Sihyuk Yi, Hyeongu Yun
2025-03-16
EXAONE Deep: LLMs with Enhanced Reasoning Performance
https://arxiv.org/abs/2503.12524
7800000000.00
7.8B
4.23e+23
4.21 × 10^23 (base model reported training compute) + 1.71 × 10^21 (finetune compute) = 4.23 × 10^23 FLOP Table 1
Unspecified unreleased
12000000000
"To enhance the reasoning capabilities of language models, we have utilized 1.6M instances for SFT and 20K instances of preference data for DPO. The SFT dataset contains approximately 12B tokens"
NVIDIA H100 SXM5 80GB
Confident
We present EXAONE Deep series, which exhibits superior capabilities in various reasoning tasks, including math and coding benchmarks. We train our models mainly on the reasoning-specialized dataset that incorporates long streams of thought processes. Evaluation results show that our smaller models, EXAONE Deep 2.4B and 7.8B, outperform other models of comparable size, while the largest model, EXAONE Deep 32B, demonstrates competitive performance against leading open-weight models. All EXAONE Dee
Open weights (non-commercial)
Korea (Republic of)
EXAONE 3.5 7.8B
1710000000000B
Table 1 (reported): 1.71 × 10^21 FLOP 6ND = 6*7.8B parameters * 12B tokens = 5.616e+20 FLOP
Unreleased
https://huggingface.co/LGAI-EXAONE/EXAONE-Deep-7.8B Exaone License
Industry
EXAONE 3.5 7.8B
Language
Language modeling/generation
Question answering
Translation
LG AI Research
Soyoung An, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Yountae Jung, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Woohyung Lim, Sangha Park, Sooyoun Park, Yongmin Park, Sihoon Yang, Heuiyeen Yeen, Hyeongu Yun
2024-12-09
EXAONE 3.5: Series of Large Language Models for Real-world Use Cases
https://arxiv.org/abs/2412.04862
7800000000.00
7.8B
4.209999999999999e+23
4.21 × 10^23 (Table 2)
Unspecified unreleased
9000000000000
9T tokens (Table 2)
Confident
This technical report introduces the EXAONE 3.5 instruction-tuned language models, developed and released by LG AI Research. The EXAONE 3.5 language models are offered in three configurations: 32B, 7.8B, and 2.4B. These models feature several standout capabilities: 1) exceptional instruction following capabilities in real-world scenarios, achieving the highest scores across seven benchmarks, 2) outstanding long-context comprehension, attaining the top performance in four benchmarks, and 3) compe
Open weights (non-commercial)
Korea (Republic of)
Unreleased
Exaone license (allows only non-commercial usage)
Industry
BlueLM 70B
Language
Chat
Language modeling/generation
Question answering
vivo AI lab
2023-11-02
https://baijiahao.baidu.com/s?id=1781445143383237948&wfr=spider&for=pc
70000000000.00
4.1999999999999996e+23
6ND = 6*70B*1000B=4.2e+23
Unspecified unreleased
1000B Text data 10B Image data 100M video data 100M Knowledge graph (from the conference handout)
Confident
Unreleased
China
Unreleased
information about the model is from their paper catalogue and not found on the internet
Industry
Llama 2-34B
Language
Language modeling
Meta AI
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Mar
2023-07-18
Llama 2: Open Foundation and Fine-Tuned Chat Models
https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/ https://arxiv.org/abs/2307.09288
8056.00
34000000000.00
Llama has been released in 7B, 13B, 34B, and 70B variants.
4.08e+23
All models sizes trained on 2.0T tokens, per table 1 2T * 34b * 6 = 4.08e23 Also trained on 1038336 A100-hours, which is 3.5e23 at 30% utilization. So the utilization was probably around 35%.
Llama 2 dataset
2 trillion tokens of publicly available text, with no text from Meta's products. "Our training corpus includes a new mix of data from publicly available sources, which does not include data from Meta’s products or services. We made an effort to remove data from certain sites known to contain a high volume of personal information about private individuals. We trained on 2 trillion tokens of data as this provides a good performance–cost trade-off, up-sampling the most factual sources in an effort
NVIDIA A100 SXM4 80 GB
Confident
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closedsource models. We provide a detailed description of our approach to
Unreleased
United States of America
Unreleased
Industry
phi-3-medium 14B
Language
Chat
Language modeling/generation
Microsoft
Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, Ronen Eldan, Dan Iter, Amit Garg, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, Russell J. Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin
2024-04-23
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
https://arxiv.org/abs/2404.14219
14000000000.00
14B
4.032e+23
counting operations: 6×4.8×10^12 tokens × 14×10^9 parameters ≈ 4.032×10^23 FLOPS
Phi-3 Dataset
we also trained phi-3-medium, a model with 14B parameters using the same tokenizer and architecture of phi-3-mini, and trained on the same data for slightly more epochs (4.8T tokens total as for phi-3-small)
4800000000000
Likely
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered web data a
Open weights (unrestricted)
United States of America
Unreleased
MIT license for weights: https://huggingface.co/microsoft/Phi-3-medium-128k-instruct
Industry
EXAONE 3.0
Language
Language modeling/generation
Code generation
Question answering
LG AI Research
LG AI Research: Soyoung An, Kyunghoon Bae, Eunbi Choi, Stanley Jungkyu Choi, Yemuk Choi, Seokhee Hong, Yeonjung Hong, Junwon Hwang, Hyojin Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Yountae Jung, Euisoon Kim, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Moontae Lee, Seungjun Lee, Woohyung Lim, Sangha Park, Sooyoun Park, Yongmin Park, Boseong Seo, Sihoon Yang, Heuiyeen Y
2024-08-07
EXAONE 3.0 7.8B Instruction Tuned Language Model
https://arxiv.org/abs/2408.03541
7820000000.00
7.8B
4.0000000000000003e+23
6ND = 6*7.8B parameters *8T tokens = 3.744e+23 FLOP "EXAONE language models were trained using Google Cloud Platform and a cluster powered by NVIDIA H100 GPUs and NVIDIA NeMo Framework. Then, they were optimized by NVIDIA TensorRT-LLM. The total amount of computation used for model training was about 4 × 1023 FLOPS"
Unspecified unreleased
8T training data (tokens) the token per word ration is 2.46 and given in the paper
8000000000000
NVIDIA H100 SXM5 80GB
Confident
We introduce EXAONE 3.0 instruction-tuned language model, the first open model in the family of Large Language Models (LLMs) developed by LG AI Research. Among different model sizes, we publicly release the 7.8B instruction-tuned model to promote open research and innovations. Through extensive evaluations across a wide range of public and in-house benchmarks, EXAONE 3.0 demonstrates highly competitive real-world performance with instruction-following capability against other state-of-the-art op
Open weights (non-commercial)
Korea (Republic of)
https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct Exaone license (allows only non-commercial usage)
Industry
Parti
Image generation
Text-to-image
Image generation
Google Research
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, Yonghui Wu
2022-06-22
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
https://arxiv.org/abs/2206.10789v1
880.00
20000000000.00
Abstract: "we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters"
3.962895376192635e+23
Calculated from architecture. Does not take into account the encoding and decoding of text and images, only the transformer stack. Table 1 shows for the 20B model 16 encoder layers 64 decoder layers Dmodel = 4096 Dhidden = 16384 Num heads = 64 Just below table 1: "We use a maximum length of text tokens of 128, and the length of image tokens are fixed to 1024" I take the length of the sequence to be 100 for the encoder stack and 1024 for the decoder stack. Section 3, Training: "a total of 450
LAION-400M
FIT400M
JFT-4B
4800000000
Google TPU v4
We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language
Unreleased
Multinational
United States of America
Unreleased
"For these reasons, we have decided not to release our Parti models, code, or data for public use without further safeguards in place" https://sites.research.google/parti/
Industry
DeepSeek Coder 33B
Language
Code generation
DeepSeek
Peking University
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y.K. Li, Fuli Luo, Yingfei Xiong, Wenfeng Liang
2024-01-25
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
https://arxiv.org/abs/2401.14196
33000000000.00
33B
3.96e+23
"Step 1: Initially pre-trained with a dataset consisting of 87% code, 10% code-related language (Github Markdown and StackExchange), and 3% non-code-related Chinese language. Models are pre-trained using 1.8T tokens and a 4K window size in this step. Step 2: Further Pre-training using an extended 16K window size on an additional 200B tokens, resulting in foundational models (DeepSeek-Coder-Base). Step 3: Instruction Fine-tuning on 2B tokens of instruction data, resulting in instruction-tuned mod
"Trained from scratch on 2T tokens, including 87% code and 13% linguistic data in both English and Chinese languages."
2000000000000
"Trained from scratch on 2T tokens, including 87% code and 13% linguistic data in both English and Chinese languages." "The total data volume is 798 GB with 603 million files."
Likely
The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. These models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window
Open weights (restricted use)
China
China
Unreleased
code doesn't seem to be training code. deepseek license: https://github.com/deepseek-ai/DeepSeek-Coder/blob/main/LICENSE-MODEL
Industry
Academia
StarCoder 2 15B
Language
Code generation
Code autocompletion
Hugging Face
ServiceNow
NVIDIA
BigCode
Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muenn
2024-02-29
StarCoder 2 and The Stack v2: The Next Generation
https://arxiv.org/abs/2402.19173
15000000000.00
15B
3.87e+23
estimation is given in Table 6
The Stack v2
See Table 4. The Stack V2 plus some extras. Created from repositorites from Github with permissive licences.
913230000000
from Table 4
Confident
The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a
Open weights (restricted use)
Multinational
United States of America
United States of America
United States of America
Unreleased
commercial use allowed, but various use cases restricted: https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement code is fine-tune only: https://github.com/bigcode-project/starcoder2?tab=readme-ov-file#fine-tuning
Industry
Industry
Industry
FunSearch
Language
Search
Code generation
Google DeepMind
Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, Alhussein Fawzi
2023-12-14
Mathematical discoveries from program search with large language models
https://www.nature.com/articles/s41586-023-06924-6 https://deepmind.google/discover/blog/funsearch-making-new-discoveries-in-mathematical-sciences-using-large-language-models/
170.00
15000000000.00
From the section called "Pretrained LLM": "We use Codey, an LLM built on top of the PaLM2 model family... Because FunSearch relies on sampling from an LLM extensively, an important performance-defining tradeoff is between the quality of the samples and the inference speed of the LLM. In practice, we have chosen to work with a fast-inference model (rather than slower-inference, higher-quality)" Unclear which PaLM2 model was used (of Gecko, Otter, Bison, and Unicorn); above quote indicates it was
3.87e+23
Appendix A.5: "Finding the full-sized symmetric admissible set I(15, 10) required the generation and analysis of approximately two million programs... To reproduce admissible set experiments done above (generating 2 million samples) one would have to use 15 instances of StarCoder-15B running on A100 40 GB GPU each and 5 CPU servers (each running 32 evaluators in parallel) for two days. We estimate that when running on Google Cloud, the price of an experiment is around $800 – $1400, and the energ
"The experiments carried out in this paper do not require any data corpus other than the publicly available OR-Library bin packing benchmarks"
0
"The experiments carried out in this paper do not require any data corpus other than the publicly available OR-Library bin packing benchmarks"
Speculative
Large language models (LLMs) have demonstrated tremendous capabilities in solving complex tasks, from quantitative reasoning to understanding natural language. However, LLMs sometimes suffer from confabulations (or hallucinations), which can result in them making plausible but incorrect statements1,2. This hinders the use of current large models in scientific discovery. Here we introduce FunSearch (short for searching in the function space), an evolutionary procedure based on pairing a pretraine
Open weights (unrestricted)
United States of America
United Kingdom of Great Britain and Northern Ireland
Multinational
PaLM 2
0B
No finetuning
Unreleased
Code to run FunSearch with an LLM of your choice is open source under Apache 2.0 (software) and CC-BY (all else). However, the actual LLM used in the main experiments is unknown and may or may not be one of the Codey models available via API access. (in other words code is available for the search tool but not for the model): https://github.com/google-deepmind/funsearch
Industry
Multi-Token Prediction 7B
Language
Code generation
Facebook AI Research
Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve
2024-04-30
Better & Faster Large Language Models via Multi-token Prediction
https://arxiv.org/abs/2404.19737
6700000000.00
6.7B (“7B”)
3.841092e+23
"training all models reported in the paper required around 500K GPU hours of computation on hardware of type A100-80GB and H100." A100-80 GB peak FLOP/s [assumed fp16 precision]: 77970000000000 H100 peak FLOP/s [assumed SXM5 TensorCore]: 989000000000000 assuming 50/50 usage: (77970000000000+989000000000000)*0.5*500000hours*3600s*0.3=2.880819e+23 for ALL models in the paper assuming this model has taken around 40% of all used compute (https://docs.google.com/spreadsheets/d/1Yc-HAdYgn6e9SUIliMaQ
CodeContests
Unspecified unreleased
250000000000
1T total tokens over 4 epochs (Table 1)
NVIDIA A100 SXM4 80 GB
NVIDIA H100 SXM5 80GB
Likely
Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved do
Open weights (non-commercial)
United States of America
Unreleased
https://huggingface.co/facebook/multi-token-prediction "we’re releasing the pre-trained models for code completion under a non-commercial/research-only license."
Industry
Arctic
Language
Language modeling/generation
Question answering
Code generation
Quantitative reasoning
Snowflake
Snowflake AI Research
2024-04-24
Snowflake Arctic: The Best LLM for Enterprise AI — Efficiently Intelligent, Truly Open
https://www.snowflake.com/en/blog/arctic-open-efficient-foundation-language-models-snowflake/
480000000000.00
" It combines a 10B dense transformer model with a residual 128x3.66B MoE MLP resulting in 480B total and 17B active parameters chosen using a top-2 gating."
3.8347175e+23
from the graph: 1x - Arctic 1.9X - Llama 3 8B (7.2×10^23) ~ = Yi 34B (6.1e23) -> x = 3.2105263e+23 3X - Code Llama 70B (1.26e+24) -> x = 4.2e+23 17.5X - Llama 3 70B (7.861e24) -> x =4.492e+23 = 3.7975893e+23 Operation counting (17B active parameters): 6ND = 6*17*10^9*3.5*10^12 = 3.57e+23 (3.2105263e+23*4.492e+23*4.2e+23*3.57e+23)^(1/4) = 3.8347175e+23
3500000000000
"Arctic was trained with a three-stage curriculum each with a different data composition focusing on generic skills in the first phase (1T Tokens), and enterprise-focused skills in the latter two phases (1.5T and 1T tokens). " 1+1.5+1 = 3.5
Confident
Built by Snowflake, Arctic is a family of enterprise-grade LLMs with leading performance in enterprise intelligence and breakthrough efficiency. Snowflake Arctic is a truly open, Apache 2.0 licensed model.
Open weights (unrestricted)
United States of America
Open source
Apache 2.0 license with ungated access to weights and code paired with open data recipe and research insights.
Industry
Jurassic-1-Jumbo
Language
Language modeling/generation
Chat
AI21 Labs
Opher Lieber, Or Sharir, Barak Lenz, Yoav Shoham
2021-08-11
Jurassic-1: Technical Details and Evaluation
https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf
55.00
178000000000.00
"Jurassic-1 models come in two sizes, where the Jumbo version, at 178B parameters, is the largest and most sophisticated language model ever released for general use by developers."
3.7e+23
see here https://docs.google.com/document/d/1B8x6XYcmB1u6Tmq3VcbAtj5bzhDaj2TcIPyK6Wpupx4/edit
225000000000
"Our model was trained with the conventional self-supervised auto-regressive training objective on 300B tokens drawn from publicly available resources" 1 token ~ 0.75 words
NVIDIA A100
Jurassic-1 is a pair of auto-regressive language models recently released by AI21 Labs, consisting of J1-Jumbo, a 178B-parameter model, and J1-Large, a 7B-parameter model. We describe their architecture and training, and evaluate their performance relative to GPT-3. The evaluation is in terms of perplexity, as well as zero-shot and few-shot learning. To that end, we developed a zeroshot and few-shot test suite, which we made publicly available (https://github.com/ai21labs/ lm-evaluation) as a sh
API access
Israel
Unreleased
Industry
BLOOM-176B
Language
Language modeling
Translation
Code generation
Hugging Face
BigScience
Margaret Mitchell, Giada Pistilli, Yacine Jernite, Ezinwanne Ozoani, Marissa Gerchick, Nazneen Rajani, Sasha Luccioni, Irene Solaiman, Maraim Masoud, Somaieh Nikpoor, Carlos Muñoz Ferrandis, Stas Bekman, Christopher Akiki, Danish Contractor, David Lansky, Angelina McMillan-Major, Tristan Thrush, Suzana Ilić, Gérard Dupont, Shayne Longpre, Manan Dey, Stella Biderman, Douwe Kiela, Emi Baylor, Teven Le Scao, Aaron Gokaslan, Julien Launay, Niklas Muennighoff
2022-07-11
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
https://arxiv.org/abs/2211.05100
1984.00
176247271424.00
See "Technical Specifications" on Hugging Face: https://huggingface.co/bigscience/bloom
3.65664e+23
https://bigscience.huggingface.co/blog/bloom Blog post says 117 days. 384 A100 GPUs * 314 TFLOPS throughput per GPU * 117 days * 0.3 (utilization assumption) = 3.65664e23 https://www.wolframalpha.com/input?i=384+*+314+TFLOPS+*+117+days+*+0.3
BigScience ROOTS Corpus
In total, 1.6 terabytes of pre-processed text was converted into 350 billion unique tokens as BLOOM's training datasets. arXiv:2210.15424 "BLOOM was trained on the ROOTS corpus (Lauren¸con et al., 2022), a composite collection of 498 Hugging Face datasets (Lhoest et al., 2021) amounting to 1.61 terabytes of text that span 46 natural languages and 13 programming languages. A high-level overview of this dataset can be seen in Figure 3, while a detailed itemized list of every language along with i
379000000000
Table 3.5 https://arxiv.org/pdf/2211.05100 366B (pretrain) + 13B (finetune) = 379B tokens total
NVIDIA A100 SXM4 80 GB
Confident
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a d
Open weights (restricted use)
Multinational
United States of America
Multinational
France
384
Unreleased
responsible use restrictions: https://bigscience.huggingface.co/blog/the-bigscience-rail-license
Industry
Research collective
GLaM
Language
Language modeling/generation
Question answering
Google
Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, Claire Cui
2021-12-13
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
https://arxiv.org/abs/2112.06905
597.00
1200000000000.00
1.2 trillion parameters
3.6363112434e+23
The network activates 96.6 billion parameters per token and trained for 600B tokens. 6 * 600B * 96.6B = 3.478e23 Digitizing figure 4 (d) indicates 139.67 TPU-years of training. 2.75e14 * 139.67 * 365.25 * 24 * 3600 * 0.3 = 3.636e23 Since these are close, we will use the 6NC estimate and derive hardware utilization from the training time information. Later they say they measured 326W power usage per chip, which could maybe be used to estimate utilization.
Wikipedia
GLaM dataset
"To train our model, we build a high-quality dataset of 1.6 trillion tokens that are representative of a wide range of natural language use cases. Web pages constitute the vast quantity of data in our unlabeled dataset. However, their quality ranges from professional writing to low-quality comment and forum pages."
600000000000
The dataset is made of 1.6 trillion tokens, but later in the paper they say they only train the largest model for 600b tokens. 600b / 0.75 words/token = 800b words. "The complete GLaM training using 600B tokens consumes only 456 MWh and emits 40.2 net tCO2e."
Google TPU v4
Confident
Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amounts of computing resources. In this paper, we propose and develop a family of language models named GLaM (Generalist Language Model), which uses a sparsely activated mixture-of-experts architecture to s
Unreleased
United States of America
1024
Unreleased
Industry
Falcon 2 11B
Language
Language modeling/generation
Technology Innovation Institute
2024-05-09
Falcon2-11B
https://huggingface.co/tiiuae/falcon-11B
11000000000.00
11B
3.6e+23
trained on 5.5T tokens 6 * 11B * 5.5T = 3.6e23
RefinedWeb
"Falcon2-11B was trained over 5,000B tokens of RefinedWeb, a high-quality filtered and deduplicated web dataset which we enhanced with curated corpora. It followed a four stage training strategy. The first three stages were focused on increasing the context length, from to 2048 to 4096 and finally to 8192 tokens. The last stage aimed to further enhance performance using only high quality data." Possibly an updated version of RefinedWeb, which only had 3.5T tokens when Falcon 1 was released? not
5500000000000
5.5T tokens: https://falconllm.tii.ae/falcon-2.html
NVIDIA A100 SXM4 40 GB
Confident
Falcon2-11B is an 11B parameters causal decoder-only model built by TII and trained on over 5,000B tokens of RefinedWeb enhanced with curated corpora. The model is made available under the TII Falcon License 2.0, the permissive Apache 2.0-based software license which includes an acceptable use policy that promotes the responsible use of AI.
Open weights (restricted use)
United Arab Emirates
Open but has an acceptable use policy: https://falconllm-staging.tii.ae/falcon-2-acceptable-use-policy.html
Government
LaMDA
Language
Language modeling
Google
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Kathleen Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos
2022-02-10
LaMDA: Language Models for Dialog Applications
https://arxiv.org/abs/2201.08239
1375.00
137000000000.00
"LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters"
3.55e+23
"The total FLOPS is 56.5% * 123 TFLOPS/s * 1024 chips * 57.7 days = 3.55E+23" From https://arxiv.org/pdf/2201.08239.pdf p.18
Infiniset
LaMDA's underlying dataset is called 'Infiniset', and besides the dialogue also involves common crawl, wikipedia, a mixture of english and non-english web documents, and data from programming-related sites (so LaMDA models can also dabble in code).
1560000000000
"and are pre-trained on 1.56T words of public dialog data and web text"
Google TPU v3
Confident
We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and are pre-trained on 1.56T words of public dialog data and web text. While model scaling alone can improve quality, it shows less improvements on safety and factual grounding. We demonstrate that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improve
Unreleased
United States of America
1024
Unreleased
Industry
GLM-130B
Language
Language modeling/generation
Translation
Tsinghua University
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Peng Zhang, Yuxiao Dong, Jie Tang
2022-08-04
GLM-130B: An Open Bilingual Pre-trained Model
https://keg.cs.tsinghua.edu.cn/glm-130b/posts/glm-130b/
989.00
130000000000.00
Dense model
3.5490054945e+23
"96 NVIDIA A100 (40G * 8) servers for 2 months" 312 TFLOPS/GPU * 96 servers * 8 GPU/server * 2 months * 32.5% utilization = 4.037e23 utilization rate - citation from the paper: "we report hardware FLOPs utilization (HFU) of 43.3% and model FLOPs utilization (MFU) of 32.5% due to re-materialization." Aligns pretty well with 6ND: 6 * 400B * 130B = 3.12E23 Geometric mean: sqrt(4.037e23 * 3.12e23) = 3.549e23
The Pile
WuDao Corpora
"The pre-training data includes 1.2T Pile (train split) (Gao et al., 2020) English, 1.0T Chinese WudaoCorpora (Yuan et al., 2021), and 250G Chinese corpora (including online forums, encyclopedia, and QA) we crawl from the web, which form a balanced composition of English and Chinese contents"
400000000000
400B "We completed the 400B-token training and evaluation of GLM-130B in July, and subsequently released the model and pre-training details in August 2022. " from https://arxiv.org/pdf/2406.12793 "As of July 3rd, 2022, GLM-130B has been trained on over 400 billion text tokens (200B each for Chinese and English)"
NVIDIA A100 SXM4 40 GB
Confident
GLM-130B (ICLR 2023) is an open bilingual (English & Chinese) bidirectional dense model with 130 billion parameters, pre-trained using the General Language Model (GLM) algorithm1. It is designed to support inference tasks with the 130B parameters on a single A100 (40G * 8) or V100 (32G * 8) server. As of July 3rd, 2022, GLM-130B has been trained on over 400 billion text tokens (200B each for Chinese and English)
Open weights (non-commercial)
China
768
Unreleased
non commercial license. looks like inference but not training code: https://github.com/THUDM/GLM-130B/blob/main/MODEL_LICENSE
Academia
Luminous-supreme
Language
Language generation
Aleph Alpha
2022-08-15
Model Card Luminous
https://docs.aleph-alpha.com/docs/introduction/model-card/
70000000000.00
"~70B"
3.5461e+23
"~839000h" GPU-hours on A100s, per Environmental Impact section of model card. 312 trillion * 839000 * 3600 * 0.3 = 2.8e23 6ND = 6*70B*1069.30B = 4.49106e+23 sqrt(2.8e23*4.49106e+23) = 3.54612... × 10^23 reported here: 167TFLOPS https://docs.aleph-alpha.com/docs/Deprecated%20Luminous/Deprecated-Luminous/model-card/
"The Luminous family has been trained on a dataset compiled of sources in English, German, French, Spanish and Italian..." more details in model card https://docs.aleph-alpha.com/docs/introduction/model-card/
1069300000000
from the table Total Size: 2.77 + 0.79 + 0.18 + 0.07 + 0.06 + 0.02 = 3.89 TB Tokens: 761.41B + 217.15B + 49.47B + 19.29B + 16.49B + 5.49B = 1069.30B tokens
NVIDIA A100 SXM4 40 GB
NVIDIA A100 SXM4 80 GB
Confident
The Luminous series is a family of large language models. Large language models are powerful technological tools that can process and produce text. These capabilities emerge during model “training” where the model is exposed to significant amounts of human text data. Similar to a person who deliberately absorbs information while reading a whole library and half of the internet, large language models acquire structural understanding (and not necessarily also knowledge) of language and accumulated
API access
Germany
512
Unreleased
Industry
Yuan 1.0
Language
Language modeling
Inspur
Shaohua Wu, Xudong Zhao, Tong Yu, Rongguo Zhang, Chong Shen, Hongli Liu, Feng Li, Hong Zhu, Jiangang Luo, Liang Xu, Xuanwei Zhang, Jun Liu
2021-10-12
Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning
https://arxiv.org/abs/2110.04725
51.00
245730000000.00
Table 2: Parameters of Yuan models. "Parameters (billion)"
3.5380000000001e+23
Table 9: 4095 petaFLOPS-days which equals 3.538*10^23 FLOP https://www.wolframalpha.com/input?i=4095+petaFLOPS+*+1+day
Common Crawl
Wikipedia
Sogue News
"A Chinese corpus with 5TB high-quality text is built, which is sufficient to train Yuan 245B model without sampling the dataset twice." In order to obtain the high-quality dataset, we develop a Massive Data Filtering System (MDFS) built on Spark to clean and filter the raw data, and train a Bert-based model to select high quality samples. MDFS is consisted of three parts, data collection, coarse filtering and fine filtering (Fig. 5). The raw data is collected from Common Crawl, Sogou News, Sog
1000000000000
"Yuan 1.0 was trained on a new Chinese dataset of 5TB high-quality text that was built on 850TB raw data from Internet." 1 GB ~ 167M words in English or 333M words in Chinese. For a mixed dataset of mostly Chinese, 5TB may be equivalent to around 1T words. Table 2: 180B training tokens
Confident
Recent work like GPT-3 has demonstrated excellent performance of Zero-Shot and Few-Shot learning on many natural language processing (NLP) tasks by scaling up model size, dataset size and the amount of computation. However, training a model like GPT-3 requires huge amount of computational resources which makes it challengeable to researchers. In this work, we propose a method that incorporates large-scale distributed training performance into model architecture design. With this method, Yuan 1.0
API access
China
2128
Unreleased
https://github.com/Shawn-IEITSystems/Yuan-1.0
Industry
AlphaGo Zero
Games
Go
DeepMind
D Silver, J Schrittwieser, K Simonyan, I Antonoglou
2017-10-18
Mastering the game of Go without human knowledge
https://www.nature.com/articles/nature24270
8795.00
46400244.00
Quick calculation
3.41e+23
source: https://docs.google.com/spreadsheets/d/1Kj4Q5WADcDXtUJLIOfGTCE3tGvxNczEMwyy8QtgSkHk/edit#gid=54587040&fvid=1361937389 AGZ had two models, one of which was small and another of which was large. The compute for AGZ is for the large model, which has 40 residual blocks instead of 20. A second way of looking at this... we believe multiple TPUs were used for training. 29 million games * 211 moves per game on average * 0.8 seconds per move = 4.8952E+09 seconds of player-time across all TPUs.
5800000000
"Over the course of training, 29 million games of self-play were generated" Approx 200 moves per Go game on average https://homepages.cwi.nl/~aeb/go/misc/gostat.html Thus 200 * 29e6 = 5.8e9
Google TPU v1
A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on rein
Unreleased
United Kingdom of Great Britain and Northern Ireland
Unreleased
Industry
Qwen1.5-14B
Language
Chat
Language modeling/generation
Quantitative reasoning
Code generation
Translation
Alibaba
Qwen Team
2024-02-04
Introducing Qwen1.5
https://huggingface.co/Qwen/Qwen1.5-14B
14000000000.00
14B
3.36e+23
6*14*10^9*4*10^12 = 3.36e+23
Unspecified unreleased
"We pretrained the models with a large amount of data, and we post-trained the models with both supervised finetuning and direct preference optimization."
4000000000000
4 trillion tokens from this response https://github.com/QwenLM/Qwen2/issues/97
Confident
Qwen1.5 is the beta version of Qwen2, a transformer-based decoder-only language model pretrained on a large amount of data. In comparison with the previous released Qwen, the improvements include: 8 model sizes, including 0.5B, 1.8B, 4B, 7B, 14B, 32B and 72B dense models, and an MoE model of 14B with 2.7B activated; Significant performance improvement in human preference for chat models; Multilingual support of both base and chat models; Stable support of 32K context length for models of all si
Open weights (unrestricted)
China
Unreleased
https://huggingface.co/Qwen/Qwen1.5-14B
Industry
FragLlama: Next-fragment prediction for molecular design
Multimodal
Image generation
Vision
Language
Language modeling/generation
Vision-language generation
Visual question answering
Text-to-image
Facebook AI Research
Srinivasan Iyer, Bernie Huang, Lili Yu, Arun Babu, Chunting Zhou, Kushal Tirumala, Xi Victoria Lin, Hu Xu, Xian Li, Akshat Shrivastava, Omer Levy, Armen Aghajanyan, Ram Pasunuru, Andrew Cohen, Aram H. Markosyan, Koustuv Sinha, Xiaoqing Ellen Tan, Ivan Evtimov, Ping Yu, Tianlu Wang, Olga Golovneva, Asli Celikyilmaz, Pedro Rodriguez, Leonid Shamis, Vasu Sharma, Christine Jou, Karthik Padthe, Ching-Feng Yeh, Mingda Chen, Bapi Akula, Jacob Kahn, Daniel Li, Scott Yih, Barlas Oguz, Morteza Behrooz, Be
2024-05-16
Chameleon: Mixed-Modal Early-Fusion Foundation Models
https://arxiv.org/abs/2405.09818v1
7000000000.00
3.3399700602e+23
GPU method: Table 2 shows that 7B model pre-training uses 856481 GPU-hours, trained across 1024 A100s 3.12e14 * 856481 * 3600 * 0.3 = 2.89e23 Parameter-token method: Pre-training goes over 9.2T tokens, post-training only goes over 1.1B tokens (sum of tokens column in Table 3) 6 * 7B * 9.2T = 3.86e23 Geometric mean: sqrt(2.89e23 * 3.86e23) = 3.34e23
Unspecified unreleased
Pre-training: - 2.9 trillion tokens of pure text - 1.4 billion text-image pairs, which produces 1.5 trillion text-image tokens - Since each image is 1024 tokens, implies 1.43 trillion image tokens and 0.07 trillion text tokens - 400 billion tokens of image-text interleaved documents - Difficult to estimate image-to-text ratio, but references OBELIKS paper which had 141 million web pages, 353 million associated images, and 115 billion text tokens. - 353 million * 1024 = 361.5 billion image tok
4400000000000
Slightly conflicting info. Pre-training data details describe different types of data that sum to 4.8 trillion tokens, but Table 1 indicates 4.4T. Using table values as this agrees with other statements about epochs and total tokens seen.
NVIDIA A100 SXM4 80 GB
Confident
We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-fo
Open weights (non-commercial)
United States of America
Not enough info to estimate. GPU time given for pretraining, and while we know # of fine-tuning tokens we don't know # of epochs.
Unreleased
https://ai.meta.com/resources/models-and-libraries/chameleon-downloads/?gk_enable=chameleon_web_flow_is_live "The models we’re releasing today were safety tuned and support mixed-modal inputs and text-only output to be used for research purposes. While we’ve taken steps to develop these models responsibly, we recognize that risks remain. At this time, we are not releasing the Chameleon image generation model."
Industry
Qwen2.5-3B
Language
Language modeling/generation
Question answering
Quantitative reasoning
Alibaba
Qwen Team
2024-09-19
Qwen2.5: A Party of Foundation Models!
https://qwenlm.github.io/blog/qwen2.5-llm/
3090000000.00
3.09B
3.3372e+23
Training dataset size was 18 trillion 6ND = 6 * 3.09 billion parameters * 18 trillion tokens = 3.3372e+23
Unspecified unreleased
"In terms of Qwen2.5, the language models, all models are pretrained on our latest large-scale dataset, encompassing up to 18 trillion tokens."
18000000000000
"In terms of Qwen2.5, the language models, all models are pretrained on our latest large-scale dataset, encompassing up to 18 trillion tokens"
Confident
In the past three months since Qwen2’s release, numerous developers have built new models on the Qwen2 language models, providing us with valuable feedback. During this period, we have focused on creating smarter and more knowledgeable language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5. We are announcing what might be the largest opensource release in history! Let’s get the party started!
Open weights (non-commercial)
China
Unreleased
Qwen Research license
Industry
Galactica
Language
Biology
Language modeling
Question answering
Meta AI
Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, Robert Stojnic
2022-11-16
Galactica: A Large Language Model for Science
https://arxiv.org/abs/2211.09085
599.00
120000000000.00
"The largest 120B model we train runs on a single NVIDIA A100 node"
3.24e+23
Authors state the model is trained on 450b tokens. Using 6 FLOP/token/parameter, this is 6*120b*450b = 3.24e23
Galactica Corpus
"Our corpus consists of 106 billion tokens from papers, reference material, encyclopedias and other scientific sources. We combine natural language sources, such as papers and textbooks, and natural sequences, such as protein sequences and chemical formulae. We process LATEX where we can capture it, and also include academic code to capture computational science"
106000000000
"Total dataset size = 106 billion tokens"
NVIDIA A100 SXM4 80 GB
Likely
Information overload is a major obstacle to scientific progress. The explosive growth in scientific literature and data has made it ever harder to discover useful insights in a large mass of information. Today scientific knowledge is accessed through search engines, but they are unable to organize scientific knowledge alone. In this paper we introduce Galactica: a large language model that can store, combine and reason about scientific knowledge. We train on a large scientific corpus of papers,
Open weights (non-commercial)
United States of America
128
Unreleased
cc-by-nc (non-commercial): https://huggingface.co/facebook/galactica-120b repo but no training code: https://github.com/paperswithcode/galai/blob/main/README.md
Industry
InstructGPT 175B
Language
Language modeling/generation
OpenAI
Long Ouyang, Pamela Mishkin, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright,John Schulman Amanda Askell, Fraser Kelton Peter Welinder, Luke Miller Maddie Simens Paul Christiano,Ryan Lowe,Chong Zhang Jacob Hilton, Sandhini Agarwal Katarina Slama Alex Ray, Jan Leike
2022-01-27
Training language models to follow instructions with human feedback
https://arxiv.org/pdf/2203.02155
9228.00
175000000000.00
"We train three model sizes (1.3B, 6B, and 175B parameters)"
3.19181e+23
"training our 175B PPO-ptx model requires 60 petaflops/s-days, compared to 3,640 petaflops/s-days for GPT-3 (Brown et al., 2020)" 60/3640 = +1.65% to base model compute base model was reported 3.14e+23 FLOP 3.14e+23 * 1.0165 = 319181000000000000000000
374000033207
Table 6 - describes **number of prompts** 26584 + 6623 = 33207 This is added to GPT-3 dataset size.
Confident
Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the
United States of America
GPT-3 175B (davinci)
5181000000000B
Industry
GPT-3 175B (davinci)
Language
Text autocompletion
Language modeling/generation
OpenAI
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei
2020-05-28
Language Models are Few-Shot Learners
https://arxiv.org/abs/2005.14165
32643.00
175000000000.00
"we train GPT-3, an autoregressive language model with 175 billion parameters"
3.14e+23
Table D.1 https://arxiv.org/abs/2005.14165
Common Crawl
WebText2
Wikipedia
Books1
Books2
Table 2.2 (other datasets also used)
374000000000
From table 2.2, we determine that there are 410 + 19 + 12 + 55 + 3 = 499 billion tokens. We multiply this by 0.75 to give 374B words. 3.74e11 ======================== [Anson: I think the calculation below doesn't look at all the data, the CommonCrawl data only constitutes 60% of the data. Multiplying by 5/3 gives 4.75e11] "The CommonCrawl data was downloaded from 41 shards of monthly CommonCrawl covering 2016 to 2019, constituting 45TB of compressed plaintext before filtering and 570GB aft
NVIDIA Tesla V100 DGXS 32 GB
Confident
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to
API access
United States of America
10000
Unreleased
https://openai.com/blog/openai-api
Industry
InternLM2-20B
Language
Chat
Language modeling/generation
Question answering
Shanghai AI Lab
SenseTime
Chinese University of Hong Kong (CUHK)
Fudan University
Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li, Shuaibin Li, Wei Li, Yining Li, Hongwei Liu, Jiangning Liu, Jiawei Hong, Kaiwen Liu, Kuikun Liu, Xiaoran Liu, Chengqi Lv, Haijun Lv, Kai Lv, Li
2024-01-12
InternLM2 Technical Report
https://arxiv.org/abs/2403.17297
20000000000.00
20B
3.12e+23
6ND = 6 * 2600000000000 * 20000000000 = 3.12e+23
Unspecified unreleased
"The text data in our pre-training dataset can be categorized by source into web pages, papers, patents, and books. To transform these sources into a pre-training dataset, we first standardize all data into a specified format, categorize them by type and language, and store them in JSON Lines (jsonl) format. Then, for all data, we apply several processing steps including rule-based filtering, data deduplication, safety filtering, and quality filtering. This results in a rich, safe, and high-qual
2600000000000
"The total number of tokens used for pre-training the 1.8B, 7B, and 20B models ranges from 2.0T to 2.6T, and the pre-training process consists of three distinct phases. "
Confident
The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context modeling, and open-ended subjective evaluations through innovative pre-training and optimization tech
Open weights (restricted use)
China
Hong Kong
China
Hong Kong
China
China
Unreleased
need to apply for commercial license. there's a repo but doesn't look like there's pretraining code. https://github.com/InternLM/InternLM
Academia
Industry
Academia
Academia
CodeFuse-13B
Language
Code generation
Ant Group
Peng Di, Jianguo Li, Hang Yu, Wei Jiang, Wenting Cai, Yang Cao, Chaoyu Chen, Dajun Chen, Hongwei Chen, Liang Chen, Gang Fan, Jie Gong, Zi Gong, Wen Hu, Tingting Guo, Zhichao Lei, Ting Li, Zheng Li, Ming Liang, Cong Liao, Bingchang Liu, Jiachen Liu, Zhiwei Liu, Shaojun Lu, Min Shen, Guangpei Wang, Huan Wang, Zhi Wang, Zhaogui Xu, Jiawei Yang, Qing Ye, Gehao Zhang, Yu Zhang, Zelin Zhao, Xunjin Zheng, Hailian Zhou, Lifu Zhu, Xianying Zhu
2023-10-10
CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model
https://arxiv.org/abs/2310.06266
3.00
13000000000.00
3.09e+23
"CodeFuse-13B was trained using 512 Nvidia A100 GPU cards, with a Hardware FLOPs Utilization (HFU) of approximately 60%. The training process took approximately 40 days to complete." Later they state utilization of 56% 512 * 312 trillion * 40 * 24 * 3600 * 0.56 = 3.09e23 Using params*tokens, we have 13 billion * 1 trillion * 6 = 7.8e22. might be a sign of multiple epochs? 1T is the size of the dataset; they don't clearly state the number of training tokens
The Stack
GitHub
80% code, 10% English, 10% Chinese: "The pre-training data for CodeFuse consists of 196TB of code, 1.75TB of Chinese raw data, and 1.7TB of English raw data, totaling 200TB, that are tokenized into 800 billion tokens of code, 100 billion tokens of Chinese corpus, and 100 billion tokens of English corpus (see Section 3.1)." "We collected about 200+ TB of code-related data, and finally refined it to around 1.6TB (1T Token) of clean data suitable for pre-training."
1000000000000
1T tokens, mostly code but some Chinese/English
NVIDIA A100 SXM4 80 GB
Confident
Code Large Language Models (Code LLMs) have gained significant attention in the industry due to their wide applications in the full lifecycle of software engineering. However, the effectiveness of existing models in understanding non-English inputs for multi-lingual code-related tasks is still far from well studied. This paper introduces CodeFuse-13B, an open-sourced pre-trained code LLM. It is specifically designed for code-related tasks with both English and Chinese prompts and supports over 4
Open weights (unrestricted)
China
512
Unreleased
apache: https://github.com/codefuse-ai/codefuse-chatbot?tab=License-1-ov-file#readme
Industry
Gemma 1.1 7B Instruct
Language
Language modeling/generation
Question answering
Google
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot and et al.
2024-02-24
https://huggingface.co/google/gemma-1.1-7b-it
8540000000.00
Safetensors Model size 8.54B params
3.0744e+23
6ND = 6*6000000000000*8540000000=3.0744e+23
Unspecified unreleased
"These models were trained on a dataset of text data that includes a wide variety of sources, totaling 6 trillion tokens. Here are the key components: Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. Primarily English-language content. Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code or understand code-related
6000000000000
"These models were trained on a dataset of text data that includes a wide variety of sources, totaling 6 trillion tokens. "
Google TPU v5e
Confident
This is Gemma 1.1 7B (IT), an update over the original instruction-tuned Gemma release. Gemma 1.1 was trained using a novel RLHF method, leading to substantial gains on quality, coding capabilities, factuality, instruction following and multi-turn conversation quality. We also fixed a bug in multi-turn conversations, and made sure that model responses don't always start with "Sure,". We believe this release represents an improvement for most use cases, but we encourage users to test in their p
Open weights (restricted use)
United States of America
Unreleased
https://huggingface.co/google/gemma-1.1-7b-it "This repository is publicly accessible, but you have to accept the conditions to access its files and content."
Industry
Gemma 7B
Language
Language modeling/generation
Chat
Code generation
Question answering
Quantitative reasoning
Google DeepMind
Gemma Team, Google DeepMind
2024-02-21
Gemma: Open Models Based on Gemini Research and Technology
https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf
8538074112.00
Table 2, sum of embedding and non-embedding parameters
3.07e+23
6ND aproximation 6*8.54B*6T = 3.07e23 "Gemma 2B and 7B are trained on 2T and 6T tokens respectively of primarily-English data from web documents, mathematics, and code." As confirmation: "We estimate the carbon emissions from pretraining the Gemma models to be ∼ 131 𝑡𝐶𝑂2𝑒𝑞. " U.S. avg CO2 per kWh is ~0.87lbs 131 tCO2 * 2000 lb/t * (1 kWh/0.87lb) = 3.01e5 kWh Per SemiAnalysis TPU v5e uses ~ 5x less power than H100, so ~140 W TDP 3.01e5 kWh * 1000 W/kW * 1 TPUv5e/140 W = 2.15e6 TPUv5e-ho
Unspecified unreleased
"Gemma 2B and 7B are trained on 2T and 6T tokens respectively of primarily-English data from web documents, mathematics, and code."
6000000000000
"Gemma 2B and 7B are trained on 2T and 6T tokens respectively of primarily-English data from web documents, mathematics, and code." Not explicitly stated that this doesn't involve multiple epochs, but I expect it does not.
Google TPU v5e
Confident
Open weights (restricted use)
United States of America
United Kingdom of Great Britain and Northern Ireland
Multinational
4096
Unreleased
https://ai.google.dev/gemma/terms no illegal use or abuse
Industry
Granite 20B
Language
Language modeling/generation
IBM Research
IBM Research
2024-05-31
Granite Foundation Models
https://www.ibm.com/downloads/documents/us-en/10a99803c92fdb35
20000000000.00
3.0000000000001e+23
6*2500000000000*20000000000=3e+23
Stack Exchange
Common Crawl
Wikimedia
"For English and code, we used Wikimedia, Stack Exchange, and commoncrawl. For multilingual data, we used portions of commoncrawl."
2500000000000
For pre-training, we used 0.5 trillion English, 0.4 trillion multilingual (es, fr, de, pt), and 1.6 trillion code tokens.
Confident
We introduce the Granite series of decoder-only foundation models for generative artificial intelligence (AI) tasks that are ready for enterprise use. We report on the architecture, capabilities, underlying data and data governance, training algorithms, compute infrastructure, energy and carbon footprint, testing and evaluation, socio-technical harms and mitigations, and usage policies.
Open weights (unrestricted)
United States of America
Multinational
Industry
Stable LM 2 12B
Language
Language modeling/generation
Translation
Stability AI
2024-04-08
Introducing Stable LM 2 12B
https://stability.ai/news/introducing-stable-lm-2-12b https://huggingface.co/stabilityai/stablelm-2-12b
12143605760.00
Precise number given in HF model card
2.91e+23
2* 12143605760 params * 3* 2T tokens * 2 epochs = 2.91e23. Trained on 384 H100s (AWS P5 instances).
RefinedWeb
RedPajama-Data
The Pile
StarCoder
CulturaX
The dataset is comprised of a filtered mixture of open-source large-scale datasets available on the HuggingFace Hub: Falcon RefinedWeb extract (Penedo et al., 2023), RedPajama-Data (Together Computer., 2023) and The Pile (Gao et al., 2020) both without the Books3 subset, and StarCoder (Li et al., 2023). We further supplement our training with multi-lingual data from CulturaX (Nguyen et al., 2023) and, in particular, from its OSCAR corpora, as well as restructured data in the style of Yuan & Liu
2000000000000
2T tokens
NVIDIA H100 SXM5 80GB
Confident
Introducing the latest additions to our Stable LM 2 language model series: a 12 billion parameter base model and an instruction-tuned variant, trained on 2 trillion tokens in seven languages: English, Spanish, German, Italian, French, Portuguese, and Dutch. This medium-sized model balances strong performance, efficiency, memory requirements, and speed, following our established Stable LM 2 1.6B framework as detailed in our previously released technical report. With this release, we’re extending
Open weights (restricted use)
Multinational
United Kingdom of Great Britain and Northern Ireland
Open source
Requires Stability AI Membership. Free for non-commercial use, $20/month for commercial use if less than $1M in annual revenue, $1M in institutional funding, and 1M monthly active users. Apache 2.0 license for repo, which includes detailed hyperparams and training details: https://github.com/Stability-AI/StableLM/blob/main/LICENSE
Industry
ST-MoE
Language
Language modeling/generation
Google
Google Brain
Google Research
Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, William Fedus
2022-02-17
ST-MoE: Designing Stable and Transferable Sparse Expert Models
https://arxiv.org/abs/2202.08906v2
117.00
269000000000.00
269B. it's called ST-MoE-32B because it's equivalent to a 32B dense model.
2.9000000000000005e+23
The paper claims "scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder". If this is true for training cost, then 6*32e9*1.5e12 = 2.9e23
C4
"The pre-training dataset used to train our Sparse 32B model is a mix of C4 (Raffel et al., 2019) and the dataset introduced in GLaM (Du et al., 2021)."
1500000000000
"We pre-train for 1.5T tokens on a mixture of English-only C4 dataset (Raffel et al., 2019) and the dataset from GLaM (Du et al., 2021) summarized in Appendix E"
Likely
Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a spars
Unreleased
United States of America
United States of America
Multinational
United States of America
Open source
Apache License 2.0 Code for our models is available at https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/moe.py
Industry
Industry
Industry
MegaScale (175B)
Language
Language modeling/generation
ByteDance
Peking University
Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu
2024-02-23
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
https://arxiv.org/abs/2402.15627
40.00
175000000000.00
Two models are trained for epoch to evaluate the MegaScale training system; one model with 175B and another with 530B parameters. This entry reports the 175B model. There is a third production model mentioned, with fewer details.
2.7385671436e+23
Table 2 gives details for the 175B model. Looking at the largest 1 epoch run with 12288 GPUs: 2166.3 aggregate PFlops/sec * 1.75 days * 24 hours/day * 3600 seconds/hour = 3.275e23 This is consistent with the theoretical computation counting estimate, if they factor MFU rate into their 2166.3 figure: 2 × 175B params × 3 × 300B tokens × 1 epoch = 2.29e23 I use the geometric mean of these two: (3.275e23 + 2.29e23) / 2 = 2.74e23
175B and 530B models trained for paper use 300B tokens each.
225000000000
300B tokens * 0.75 words/token = 225B words
NVIDIA A100
Confident
We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pip
Unreleased
China
China
12288
Unreleased
repo, but no training code for the big model https://github.com/volcengine/vescale Model weights unreleased
Industry
Academia
LLaMA-33B
Language
Language modeling
Code generation
Meta AI
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample
2023-02-27
LLaMA: Open and Efficient Foundation Language Models
https://arxiv.org/abs/2302.13971
8872.00
32500000000.00
Table 2 in the paper
2.7300000000000996e+23
1.4T tokens * 32.5B params * 6 FLOP/token/param = 2.73e+23 FLOP
CCNet
GitHub
Wikipedia
books
arXiv
Stack Exchange
See Table 1
1340000000000
Table 1 indicates that 1.4T tokens involved sampling sub-datasets at more or less than one epoch. Correcting for this: (1.1 epoch * 3.3TB) + (1.06 epoch * 0.783TB) + ... = 1.4T tokens 5.24 epoch-TBs = 1.4T tokens 5.24 epoch-TB * 1000 GB/TB * 200M token/GB = 1.4T tokens 1.05T epoch*token = 1.4T tokens 1 epoch = 1.34T tokens
Confident
We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the resea
Open weights (non-commercial)
United States of America
Unreleased
"we are releasing our model under a noncommercial license focused on research use cases" https://ai.meta.com/blog/large-language-model-llama-meta-ai/
Industry
Whisper v3
Speech
Speech recognition
OpenAI
2023-11-06
https://huggingface.co/openai/whisper-large-v3
1550000000.00
2.7e+23
Could derive this in terms of Whisper v1, which according to the paper was trained for 680k hours for between 2-3 epochs. Whisper v3 was trained on 5 million hours for 2 epochs, or ~5-7x as much data, and has the same architecture. We have an estimate of 4.65e22 for Whisper 1. Assume Whisper v1 was trained on 2.5 epochs, or 2.5*680k = 1.7M hours. Whisper v3 was trained on 10M hours. 10/1.7 * 4.65e22 ~= 2.7e23
Unspecified unreleased
"The Whisper large-v3 model is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected using Whisper large-v2"
60000000000
English audio is roughly 228 wpm: https://docs.google.com/document/d/1G3vvQkn4x_W71MKg0GmHVtzfd9m0y3_Ofcoew0v902Q/edit#heading=h.sxcem9l5k3ce The dataset is multilingual and other languages seem to have lower wpms. So using 200 wpm, we have 200*60*5 million hours = 60,000,000,000 (60B) words
Likely
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning. Whisper was proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. The original code repository can be found here.
Open weights (unrestricted)
United States of America
Unreleased
Apache 2.0: https://huggingface.co/openai/whisper-large-v3 this seems to be inference code not training: https://github.com/openai/whisper
Industry
Viking
Language
Language modeling/generation
Language generation
Translation
Silo AI
University of Turku
2024-04-04
Viking 33B is a 33B parameter decoder-only transformer pretrained on Finnish, English, Swedish, Danish, Norwegian, Icelandic and code. It is being trained on 2 trillion tokens (700B billion as of this release). Viking 33B is a fully open source model and is made available under the Apache 2.0 License.
https://huggingface.co/LumiOpen/Viking-33B
33000000000.00
2.574e+23
Plan is to train on 2 trillion tokens, but most recent release is at 1.3T 6 * 33B * 1.3 trillion = 2.574E23
2000000000000
Viking is being trained on a 2 trillion token mixed dataset of English, Finnish, Swedish, Danish, Norwegian, Icelandic and code.
AMD Radeon Instinct MI250X
Confident
Open weights (unrestricted)
Finland
Finland
1024
Open source
code here: https://github.com/LumiOpen/Megatron-DeepSpeed/blob/main/pretrain_viking_33B.sh
Industry
Academia
Qwen2.5-Coder (7B)
Language
Code generation
Code autocompletion
Quantitative reasoning
Question answering
Language modeling/generation
Alibaba
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, Junyang Lin
2024-09-18
Qwen2.5-Coder Technical Report
https://arxiv.org/abs/2409.12186
7610000000.00
Number of Parameters: 7.61B
2.5113e+23
6ND = 6*7610000000 parameters *5.5T tokens =2.5113e+23
GitHub
Common Crawl
"we constructed a dataset named Qwen2.5-Coder-Data. This dataset comprises five key data types: Source Code Data, Text-Code Grounding Data, Synthetic Data, Math Data, and Text Data."
5500000000000
Confident
In this report, we introduce the Qwen2.5-Coder series, a significant upgrade from its predecessor, CodeQwen1.5. This series includes two models: Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B. As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder demonstrates impressive code generation capabilities w
Open weights (unrestricted)
China
Apache 2.0 https://huggingface.co/Qwen/Qwen2.5-Coder-7B
Industry
Skywork-13B
Language
Language modeling
Language modeling/generation
Translation
Kunlun Inc.
Tianwen Wei, Liang Zhao, Lichang Zhang, Bo Zhu, Lijie Wang, Haihua Yang, Biye Li, Cheng Cheng, Weiwei Lü, Rui Hu, Chenxia Li, Liu Yang, Xilin Luo, Xuejie Wu, Lunan Liu, Wenjun Cheng, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Lei Lin, Xiaokun Wang, Yutuan Ma, Chuanhai Dong, Yanqi Sun, Yifu Chen, Yongyi Peng, Xiaojuan Liang, Shuicheng Yan, Han Fang, Yahui Zhou
2023-10-30
Skywork: A More Open Bilingual Foundation Model
https://arxiv.org/abs/2310.19341
75.00
13000000000.00
13B
2.5e+23
"Our Skywork-13B is trained on a cluster of 64 NVIDIA-HGX-A800 nodes, a total of 512 A800-80G SXM GPUs... The training process of Skywork-13B spanned a total of 39 days." They note that "we achieved a token throughput of 1873 per GPU per second and a model flops utilization (MFU) of 56.5%... ". "MFU" was coined in the Palm paper (https://arxiv.org/pdf/2204.02311.pdf) and only counts operations used to train the model, not all operations observed on the hardware. MFU is lower than traditionall
SkyPile
"In order to train Skywork-13B, we build SkyPile, a vast, high quality corpus comprising more than 6 trillion tokens. A segment of the corpus, comprising over 150 billion tokens of web text, has been open sourced to facilitate research and training on Chinese LLMs"
3180000000000
The full SkyPile dataset is 6 trillion tokens, roughly half English and half Chinese: (https://huggingface.co/Skywork/Skywork-13B-base). The model is trained for the equivalent of 0.53 epochs on the full dataset, or 3.18 trillion unique tokens. This is around 2.78 trillion words, based on an average of 1 word/token for the Chinese portion and 0.75 word/token on the English portion.
NVIDIA A800
Confident
In this technical report, we present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts. This bilingual foundation model is the most extensively trained and openly published LLMs of comparable size to date. We introduce a two-stage training methodology using a segmented corpus, targeting general purpose training and then domain-specific enhancement training, respectively. We show that our model not only
Open weights (restricted use)
China
512
Open (restricted use)
commercial but restrictive license: https://github.com/SkyworkAI/Skywork/blob/main/LICENSE part of the training data is open, but only 2.5%: "In order to train Skywork-13B, we build SkyPile, a vast, high quality corpus comprising more than 6 trillion tokens. A segment of the corpus, comprising over 150 billion tokens of web text, has been open sourced to facilitate research and training on Chinese LLMs" training code: https://github.com/SkyworkAI/Skywork/blob/main/train/train.py
Industry
Qwen-14B
Language
Language modeling/generation
Alibaba
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zha
2023-09-28
Qwen Technical Report
https://arxiv.org/abs/2309.16609
169.00
14000000000.00
14B
2.5e+23
3T tokens per Table 1 14B*3T*6 = 2.5e23
"Our dataset is designed to meet these requirements and includes public web documents, encyclopedia, books, codes, etc. Additionally, our dataset is multilingual, with a significant portion of the data being in English and Chinese."
3000000000000
"We have pretrained the language models, namely QWEN, on massive datasets containing trillions of tokens" Table 1 indicates 3T tokens for Qwen-14B, and the above quote suggests the 3T aren't from multiple epochs on a smaller dataset.
Confident
Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Qwen, the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignm
Open weights (restricted use)
China
Unreleased
commercial allowed, can't use to train models https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT
Industry
Granite 13B
Language
Chat
Language modeling/generation
Question answering
Text summarization
IBM
2023-11-30
Granite Foundation Models
https://www.ibm.com/downloads/cas/X9W4O6BM
13000000000.00
13 billion
2.44e+23
Estimate using hardware: "Granite.13b.v1 used 256 A100 GPUs for 1056 hours and 120 TFLOPs. Granite.13b.v2 was trained on the same infrastructure for an additional 1152 hours with 120 TFLOPS, bringing the total to 2208 hours" Seems like 120 TFLOPS is the output per GPU after utilization, though they don't explicitly explain that part. That's 38% utilization. 256 * 2208 * 3600 * 120 TFLOPS = 2.44e23 Using 6ND: "The second version of the granite.13b models leverages an updated base model train
Unspecified unreleased
Common Crawl
arXiv
OPENWEBTEXT
"To support the training of large enterprise-grade foundation models, including granite.13b, IBM curated a massive dataset of relevant unstructured language data from sources across academia, the internet, enterprise (e.g., financial, legal), and code." More breakdowns in paper, 20 sources in total https://www.ibm.com/docs/en/cloud-paks/cp-data/4.8.x?topic=models-granite-13b-v1-model-card
2500000000000
2.5T tokens, 1.875T words at 0.75 words/token https://www.ibm.com/docs/en/cloud-paks/cp-data/5.0.x?topic=models-granite-13b-chat-v2-model-card
NVIDIA A100
Likely
We introduce the Granite series of decoder-only foundation models for generative artificial intelligence (AI) tasks that are ready for enterprise use. We report on the architecture, capabilities, underlying data and data governance, training algorithms, compute infrastructure, energy and carbon footprint, testing and evaluation, socio-technical harms and mitigations, and usage policies.
API access
United States of America
Unreleased
Industry
Falcon-40B
Language
Language modeling
Technology Innovation Institute
2023-03-15
Abu Dhabi-based Technology Innovation Institute Introduces Falcon LLM: Foundational Large Language Model (LLM) outperforms GPT-3 with 40 Billion Parameters
https://arxiv.org/abs/2311.16867; https://www.tii.ae/news/abu-dhabi-based-technology-innovation-institute-introduces-falcon-llm-foundational-large
0.00
40000000000.00
Model comes in 7B and 40B variants.
2.4e+23
C = 6ND = 6 * 40B * 1000B = 2.4e+23 FLOP (assuming one epoch) Table 1 from https://arxiv.org/pdf/2311.16867 Falcon paper 2,800 petaflop-days * 1e15 * 24 * 3600 = 2.4192e+23 FLOPs
RefinedWeb
Falcon-40B was trained on 1,000B tokens of RefinedWeb, a high-quality filtered and deduplicated web dataset which we enhanced with curated corpora. Significant components from our curated copora were inspired by The Pile (Gao et al., 2020).
1000000000000
1000B tokens ~= 750B words
NVIDIA A100
Confident
Open weights (unrestricted)
United Arab Emirates
384
Unreleased
apache 2.0
Government
Nanbeige-16B
Language
Chat
Language modeling/generation
Code generation
Question answering
Nanbeige LLM Lab
2023-11-01
https://github.com/Nanbeige/Nanbeige/blob/main/README_EN.md
16000000000.00
16 billion
2.4e+23
"It uses 2.5T Tokens for pre-training". I think that's the number of tokens the model was trained on, not the dataset size, but I'm not sure. 16 billion * 2.5 trillion * 6 = 2.4e23
Unspecified unreleased
"The training data includes a large amount of high-quality internet corpus, various books, code, etc"
2500000000000
"It uses 2.5T Tokens for pre-training"
Likely
Nanbeige-16B is a 16 billion parameter language model developed by Nanbeige LLM Lab. It uses 2.5T Tokens for pre-training. The training data includes a large amount of high-quality internet corpus, various books, code, etc. It has achieved good results on various authoritative evaluation data sets. This release includes the Base, Chat, Base-32k and Chat-32k.
Open weights (unrestricted)
China
Open source
Apache 2.0 training code: https://github.com/Nanbeige/Nanbeige/blob/main/scripts/train.sh
Industry
LightOn Mini
Language
Language modeling/generation
Chat
LightOn
2023-03-21
LightOn's Large Language Model of 40 billion parameters: MINI
https://www.lighton.ai/blog/lighton-s-blog-4/lighton-s-large-language-model-of-40-billion-parameters-mini-19
40000000000.00
"Boasting an impressive 40 billion parameters, Mini is a formidable addition to the growing array of language models available in the market today."
2.4e+23
6ND aproximation: 6*40B*1T = 2.4e23
"The amount of data in Mini corpus is 1 trillion tokens. We mainly used data from the public web to pre-train our model, with strong filtering, toxicity reduction, and deduplication to ensure that only high-quality data is retained."
1000000000000
"The amount of data in Mini corpus is 1 trillion tokens. We mainly used data from the public web to pre-train our model, with strong filtering, toxicity reduction, and deduplication to ensure that only high-quality data is retained." assuming 0.75 words per token - 750000000000.0 words
Confident
Hosted access (no API)
France
Unreleased
Industry
BloombergGPT
Language
Language modeling
Bloomberg
Johns Hopkins University
Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, Gideon Mann
2023-03-30
BloombergGPT: A Large Language Model for Finance
https://arxiv.org/abs/2303.17564
556.00
50558868480.00
2.3599999999999997e+23
2.36e23 per Table 4 (using our usual hardware method, 512 A100s over 53 days would be 512 * 312 teraFLOP/s * 53 * 24 * 3600 * 0.3 = 2.19e23)
"To train BloombergGPT, we construct “FinPile”, a comprehensive dataset consisting of a range of English financial documents including news, filings, press releases, web-scraped financial documents, and social media drawn from the Bloomberg archives. These documents have been acquired through our business process over the past two decades. We augment FinPile with public data widely used to train LLMs. The result is a training corpus that is roughly half domain-specific text and half general-purp
532000000000
708.9 billion tokens. At 0.75 English words per token, that's 532B words
NVIDIA A100
Confident
The use of NLP in the realm of financial technology is broad and complex, with applications ranging from sentiment analysis and named entity recognition to question answering. Large Language Models (LLMs) have been shown to be effective on a variety of tasks; however, no LLM specialized for the financial domain has been reported in literature. In this work, we present BloombergGPT, a 50 billion parameter language model that is trained on a wide range of financial data. We construct a 363 billion
Unreleased
United States of America
United States of America
512
Unreleased
Industry
Academia
YaLM
Language
Language modeling
Chat
Yandex
Mikhail Khrushchev, Ruslan Vasilev, Alexey Petrov, Nikolay Zinov
2022-06-23
Yandex Publishes YaLM 100B. It’s the Largest GPT-Like Neural Network in Open Source
https://medium.com/yandex/yandex-publishes-yalm-100b-its-the-largest-gpt-like-neural-network-in-open-source-d1df53d0e9a6
100000000000.00
100B
2.2e+23
"It took us 65 days to train the model on a pool of 800 A100 graphics cards and 1.7 TB of online texts, books, and countless other sources."
The Pile
YaLM Russian Dataset
"25% The Pile — open English dataset by Eleuther AI team 75% Texts in Russian collected by our team (percentages of the whole dataset are given)" https://github.com/yandex/YaLM-100B?tab=readme-ov-file
300000000000
1.7TB of data 300B tokens – from github https://github.com/yandex/YaLM-100B I've assumed that 1 token correspond to 1 word in russian language.
NVIDIA A100
Likely
Open weights (unrestricted)
Russia
800
Unreleased
Apache 2.0 for weights. training details, but no code: https://medium.com/yandex/yandex-publishes-yalm-100b-its-the-largest-gpt-like-neural-network-in-open-source-d1df53d0e9a6
Industry
Flamingo
Multimodal
Vision
Language
Video
Visual question answering
Image captioning
DeepMind
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karen Simonyan
2022-04-29
Flamingo: a Visual Language Model for Few-Shot Learning
https://arxiv.org/abs/2204.14198
2473.00
80000000000.00
"We obtain three models, Flamingo-3B, Flamingo-9B and Flamingo-80B" " The Flamingo-80B model builds on top of the frozen Chinchilla 70B language model [42]. Starting from the very first layer and before every seventh transformer blocks, we add a GATED XATTN-DENSE layer attending to the visual inputs; this accounts for 10B additional learned parameters. For simplicity, we refer to this model as simply Flamingo throughout the paper"
2.1897200000000104e+23
1536 TPU v4 chips for 15 days. Assuming 40% utilization: C = 1536 TPU * 275*10^12 FLOP/s/TPU * 15 day * 86400 s/day * 0.40 = 2.2*10^23 FLOP "All training and evaluation was performed on TPUv4 instances. The largest model containing 80 billion parameters is trained on QUSV chips for 15 days and sharded across 16 devices." "All trained parameters and optimizer accumulators are stored and updated in float32; all activations and gradients are computed in bfloat16 after downcasting of parameters fr
MultiModal MassiveWeb
LTIP
VTP
ALIGN
Flamingo was trained on a mixture of web-scraped datasets: 43M pages of text with interleaved images (MultiModal MassiveWeb dataset) 312M image-text pairs (LTIP dataset) 27M video-text pairs (VTP dataset) 1.8B image-alt text pairs (ALIGN dataset) Training dataset size is at least 2.1 billion.
Google TPU v4
Confident
Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks t
Unreleased
United Kingdom of Great Britain and Northern Ireland
Chinchilla
1536
Unreleased
Industry
phi-3-small 7.4B
Language
Chat
Language modeling/generation
Microsoft
Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, Ronen Eldan, Dan Iter, Amit Garg, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, Russell J. Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin
2024-04-23
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
https://arxiv.org/abs/2404.14219
7400000000.00
7.4B
2.1312000000000003e+23
6ND = 6*7.4B parameters * 4.8T tokens =2.1312e+23
"4.8T tokens total as for phi-3-small"
4800000000000
Confident
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered web data a
United States of America
Industry
Poro 34B
Language
Code generation
Language modeling/generation
High-Performance Language Technologies (HPLT)
University of Turku
Risto Luukkonen, Jonathan Burdge, Elaine Zosa, Aarne Talman, Ville Komulainen, Väinö Hatanpää, Peter Sarlin, Sampo Pyysalo
2023-12-14
Poro 34B and the Blessing of Multilinguality
https://arxiv.org/abs/2404.01856
34200000000.00
https://huggingface.co/LumiOpen/Poro-34B
2.052e+23
6ND = 6*1T*34.2B= 2.04e+23 "This allowed total training cycle throughput of 49618 TFLOPs and 174378 tokens/second." the training took around 18 months (https://hplt-project.org/deliverables) 49618*18*30*24*3600*10^12=2.3149774e+24
mC4
SlimPajama
StarCoder
Dolma
https://huggingface.co/LumiOpen/Poro-34B "The Finnish dataset is a combination of many Finnish resources: Finnish Internet Parsebank mC4 multilingual colossal, cleaned Common Crawl Common Crawl Finnish Finnish Wikipedia Lönnrot Projekti Lönnrot Suomi24 The Suomi 24 Corpus 2001-2020 Reddit r/Suomi submissions and comments STT Finnish News Agency Archive 1992-2018 Yle Finnish News Archive 2011-2018 Yle Finnish News Archive 2019-2020 Yle News Archive Ea
1000000000000
1T tokens, assuming 0.75 word per token "Poro is a 34B parameter decoder-only transformer pretrained on Finnish, English and code. It is being trained on 1 trillion tokens. Poro is a fully open source model and is made available under the Apache 2.0 License."
AMD Radeon Instinct MI250X
Confident
The pretraining of state-of-the-art large language models now requires trillions of words of text, which is orders of magnitude more than available for the vast majority of languages. While including text in more than one language is an obvious way to acquire more pretraining data, multilinguality is often seen as a curse, and most model training efforts continue to focus near-exclusively on individual large languages. We believe that multilinguality can be a blessing and that it should be possi
Open weights (unrestricted)
Multinational
Finland
512
Apache 2.0 https://huggingface.co/LumiOpen/Poro-34B
Research collective
Academia
AlexaTM 20B
Language
Language modeling
Translation
Question answering
Amazon
Saleh Soltan, Shankar Ananthakrishnan, Jack FitzGerald, Rahul Gupta, Wael Hamza, Haidar Khan, Charith Peris, Stephen Rawls, Andy Rosenbaum, Anna Rumshisky, Chandana Satya Prakash, Mukund Sridhar, Fabian Triefenbach, Apurv Verma, Gokhan Tur, Prem Natarajan
2022-08-02
AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model
https://arxiv.org/abs/2208.01448
73.00
19750000000.00
See Table 1 on p.3 of the paper
2.04374016e+23
Training throughput is reported as 154 TFLOP/s - see p.5 of the paper. "We relied on an internal and optimized version of DeepSpeed that we have since open-sourced (Chiu & Zheng, 2022) to obtain training throughput of up to 154 TFLOPS/GPU on 16 AWS p4d.24xlarge compute instances." Accelerator compute days are reported as 15,360 days - see Table 17 on p.18 of the paper.
mC4
Wikipedia
1319000000000
See Table 2 on p.3 of the paper. 119B Wikipedia tokens + 1.2T mC4 tokens = 1319000000000 tokens
NVIDIA A100
Confident
In this work, we demonstrate that multilingual large-scale sequence-to-sequence (seq2seq) models, pre-trained on a mixture of denoising and Causal Language Modeling (CLM) tasks, are more efficient few-shot learners than decoder-only models on various tasks. In particular, we train a 20 billion parameter multilingual seq2seq model called Alexa Teacher Model (AlexaTM 20B) and show that it achieves state-of-the-art (SOTA) performance on 1-shot summarization tasks, outperforming a much larger 540B P
API access
United States of America
128
https://aws.amazon.com/about-aws/whats-new/2022/11/alexatm-20b-model-available-sagemaker-jumpstart/?nc1=h_ls
Industry
Baichuan2-13B
Language
Chat
Baichuan
Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, Fan Yang, Fei Deng, Feng Wang, Feng Liu, Guangwei Ai, Guosheng Dong, Haizhou Zhao, Hang Xu, Haoze Sun, Hongda Zhang, Hui Liu, Jiaming Ji, Jian Xie, JunTao Dai, Kun Fang, Lei Su, Liang Song, Lifeng Liu, Liyun Ru, Luyao Ma, Mang Wang, Mickel Liu, MingAn Lin, Nuolan Nie, Peidong Guo, Ruiyang Sun, Tao Zhang, Tianpeng Li, Tianyu Li, Wei Cheng, Weipeng Chen, Xiangrong Zeng, Xiaochuan Wang, Xi
2023-09-06
Baichuan 2: Open Large-scale Language Models
https://huggingface.co/baichuan-inc/Baichuan2-13B-Base, https://arxiv.org/abs/2309.10305
13000000000.00
2.03e+23
They describe the dataset as having 2.6T tokens, but the checkpoint graph makes it clear that's also the number of tokens the model was trained on. 13b * 2.6t * 6 = 2.03e23
2.6 trillion tokens, bilingual. paper/model card don't give breakdown between English and Chinese
2275000000000
2.6T tokens, or ~2.3T words assuming that the dataset is roughly even English (0.75 words/token) and Chinese (1 word/token) 1.3T Chinese tokens * (1 word/token) = 1.3T Chinese words 1.3T English tokens * (0.75 words/token) = 0.975T English words total: 2.275T, or ~2.3T
Confident
Open weights (restricted use)
China
1024
Unreleased
Baichuan community license, restrictive commercial: https://huggingface.co/baichuan-inc/Baichuan2-13B-Base
Industry
AlphaGo Master
Games
Go
DeepMind
D Silver, J Schrittwieser, K Simonyan, I Antonoglou
2017-10-19
Mastering the game of Go without human knowledge
https://www.nature.com/articles/nature24270
8795.00
2.0001000000000102e+23
This is a guess. There was no single journal publication that accompanied this model, that gave information about architecture/model training time etc. All I could find was that it has the same architecture as AlphaGo Zero, and that it had roughly the same power consumption as AGZ. See for instance: https://deepmind.com/blog/article/alphago-zero-starting-scratch Since AGZ reaches the ELO of AlphaGo Master in about 25-30 days (60-75% of the total training time), I estimate the compute to be aro
Google TPU v1
A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on rein
Unreleased
United Kingdom of Great Britain and Northern Ireland
Unreleased
Industry
ViT-22B
Vision
Object detection
Image classification
Google
Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Patrick Collier, Alexey Gritsenko, Vighnesh Birodk
2023-02-10
Scaling Vision Transformers to 22 Billion Parameters
https://arxiv.org/abs/2302.05442v1
428.00
21743000000.00
21.743B, Table 1
1.93248e+23
"ViT-22B was trained using 256 visual tokens per image, where each token represents a 14 × 14 patch extracted from 224 × 224 sized images. ViT-22B is trained for 177k steps with batch size of 65k: approximately 3 epochs" "ViT-22B was trained on 1024 TPU V4 chips for 177K steps" "Using these techniques, ViT-22B processes 1.15k tokens per second per core during training (forward and backward pass) on TPUv4 [...] ViT-22B’s model flops utilization (MFU) is 54.9%" 256 * 177k * 65k = 2.945T tokens
JFT-4B
"Dataset. ViT-22B is trained on a version of JFT (Sun et al., 2017), extended to around 4B images (Zhai et al., 2022a). These images have been semi-automatically annotated with a class-hierarchy of 30k labels"
4000000000
"Dataset. ViT-22B is trained on a version of JFT (Sun et al., 2017), extended to around 4B images (Zhai et al., 2022a). These images have been semi-automatically annotated with a class-hierarchy of 30k labels"
Google TPU v4
Confident
The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-2
Unreleased
United States of America
1024
Unreleased
don't see it here: https://github.com/google-research/vision_transformer?tab=readme-ov-file#available-vit-models
Industry
MPT-30B
Language
Language generation
Code generation
MosaicML
2023-06-22
https://huggingface.co/mosaicml/mpt-30b
30000000000.00
30B
1.8900000000001e+23
According to their blog post, "MPT-30B FLOPs ~= 6 * 30e9 [params] * 1.05e12 [tokens] = 1.89e23 FLOPs"
mC4
C4
RedPajama
The Stack
https://www.databricks.com/sites/default/files/inline-images/open-source-foundations-models-1.png
1050000000000
~4T tokens across sources, but only trained on 1.05T of these
NVIDIA H100 SXM5 80GB
Confident
MPT-30B is a decoder-style transformer pretrained from scratch on 1T tokens of English text and code. This model was trained by MosaicML. MPT-30B is part of the family of Mosaic Pretrained Transformer (MPT) models, which use a modified transformer architecture optimized for efficient training and inference. MPT-30B comes with special features that differentiate it from other LLMs, including an 8k token context window (which can be further extended via finetuning; see MPT-7B-StoryWriter), suppo
Open weights (unrestricted)
United States of America
512
Open source
apache 2.0 for weights. pretrain code here: https://github.com/mosaicml/llm-foundry/tree/main/scripts/train/yamls/pretrain
Industry
Mamba2-Hybrid
Language
Language modeling/generation
Question answering
NVIDIA
Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, Garvit Kulshreshtha, Vartika Singh, Jared Casper, Jan Kautz, Mohammad Shoeybi, Bryan Catanzaro
2024-06-12
An Empirical Study of Mamba-based Language Models
https://arxiv.org/abs/2406.07887
8660000000.00
Table 6
1.8186e+23
6ND = 6*8660000000.00 parameters * 3500000000000 tokens = 1.8186 × 10^23
Unspecified unreleased
"We train the models discussed in this report on 1.1T and 3.5T token datasets. Both datasets are predecessors of the dataset used to train Nemotron-4 and are comprised of 70% English, 15% non-English, and 15% code."
3500000000000
NVIDIA H100 SXM5 80GB
Likely
Selective state-space models (SSMs) like Mamba overcome some of the shortcomings of Transformers, such as quadratic computational complexity with sequence length and large inference-time memory requirements from the key-value cache. Moreover, recent studies have shown that SSMs can match or exceed the language modeling capabilities of Transformers, making them an attractive alternative. In a controlled setting (e.g., same data), however, studies so far have only presented small scale experiments
Open weights (unrestricted)
United States of America
1024
Open source
https://github.com/NVIDIA/Megatron-LM/tree/ssm/examples/mamba Apache 2.0 train script: https://github.com/NVIDIA/Megatron-LM/blob/ssm/examples/mamba/train.sh
Industry
Nemotron-3-8B
Language
Chat
Language generation
Language modeling/generation
Translation
Code generation
Question answering
NVIDIA
2023-11-15
NVIDIA AI Foundation Models: Build Custom Enterprise Chatbots and Co-Pilots with Production-Ready LLMs
https://developer.nvidia.com/blog/nvidia-ai-foundation-models-build-custom-enterprise-chatbots-and-co-pilots-with-production-ready-llms/ https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/nemotron-3-8b-base-4k
8000000000.00
1.8e+23
https://huggingface.co/nvidia/nemotron-3-8b-base-4k "This model was trained on a dataset containing 3.8 Trillion tokens of text" 8 billion * 3.8 trillion * 6 = 1.8e23 Also, using the hardware method: "1,024 A100s were used for 19 days to train the model." 19*1024 * 312 trillion * 24 * 3600 * 0.3 = 1.57e23
Unspecified unreleased
Flan
P3 (Public Pool of Prompts)
"NVIDIA models are trained on a diverse set of public and proprietary datasets. This model was trained on a dataset containing 3.8 Trillion tokens of text. The dataset contains 53 different human languages (including English, German, Russian, Spanish, French, Japanese, Chinese, Italian, and Dutch) and 37 programming languages. The model also uses the training subsets of downstream academic benchmarks from sources like FLANv2, P3, and NaturalInstructions v2"
3800000000000
NVIDIA A100
Confident
Large language models (LLMs) are revolutionizing data science, enabling advanced capabilities in natural language understanding, AI, and machine learning. Custom LLMs, tailored for domain-specific insights, are finding increased traction in enterprise applications. The NVIDIA Nemotron-3 8B family of foundation models is a powerful new tool for building production-ready generative AI applications for the enterprise–fostering innovations ranging from customer service AI chatbots to cutting-edge A
Open weights (restricted use)
United States of America
1024
can't use to train other models: https://developer.download.nvidia.com/ai-foundation-models/nvidia-ai-foundation-models-license-10Nov2023.pdf
Industry
Granite 3.0 2B
Language
Language modeling/generation
Question answering
Translation
Text summarization
Text classification
Code generation
IBM
Granite Team IBM
2024-10-21
Granite 3.0 Language Models
https://github.com/ibm-granite/granite-3.0-language-models/tree/main
2500000000.00
2.5B
1.8e+23
6ND = 6*2.5*10^9*12*10^12 = 1.8e+23 ""All our Granite 3.0 models are trained using a compute budget of 8.35 × 10^23 FLOPS." 8.35 × 10^23 * 174.6 (model's power consumption) / (174.6+757.0+64.5+121.2) =1.304851e+23 hardware estimation: 192030*3600*989500000000000*0.3 = 2.0521478e+23
Unspecified unreleased
Granite 3.0 language models are trained using data from various sources such as unstructured natural language text and code data from theWeb curated by IBM, a collection of synthetic datasets generated by IBM, and publicly available high-quality datasets with permissible licenses.
12000000000000
12T tokens
NVIDIA H100 SXM5 80GB
Confident
This report presents Granite 3.0, a new set of lightweight, state-of-the-art, open foundation models ranging in scale from 400 million to 8 billion active parameters. Equipped with native support of multilingual, coding, function calling, and strong safety performance, these models target enterprise use cases, including on-premise and on-device settings. Evaluations on a comprehensive set of tasks demonstrate that our models consistently reach state-of-the-art performance for their size (as sho
Open weights (unrestricted)
United States of America
768
Unreleased
Apache 2.0 license https://huggingface.co/ibm-granite/granite-3.0-2b-instruct
Industry
OLMo 2 Furious 7B
Language
Language modeling/generation
Question answering
Allen Institute for AI
University of Washington
New York University (NYU)
Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William Merrill, Lester James V. Miranda, Jacob Morrison, Tyler Murray, Crystal Nam, Valentina Pyatkin, Aman Rangapur, Michael Schmitz, Sam Skjonsberg,
2024-12-31
2 OLMo 2 Furious
https://arxiv.org/abs/2501.00656
7000000000.00
7B
1.8e+23
1.8*10^23 FLOPs (Table 6 - developers calculated using 6ND formula)
OLMo-Mix-1124
Dolmino-Mix-1124
Tulu 3
4000000000000
Pretraining Stage 1 (OLMo-Mix-1124) 4 trillion tokens (= 1 epoch) Pretraining Stage 2 (Dolmino-Mix-1124) 50B tokens (3 runs) merged Post-training (Tulu 3 SFT OLMo mix) SFT + DPO + PPO (preference mix)
NVIDIA H100 SXM5 80GB
Confident
We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes dense autoregressive models with improved architecture and training recipe, pretraining data mixtures, and instruction tuning recipes. Our modified model architecture and training recipe achieve both better training stability and improved per-token efficiency. Our updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124, which significantly improves model capabilities
Open weights (unrestricted)
United States of America
United States of America
United States of America
Open source
apache 2 https://huggingface.co/allenai/OLMo-2-1124-7B https://github.com/allenai/OLMo
Research collective
Academia
Academia
Yuan 2.0
Language
Language modeling/generation
Translation
Code generation
Inspur
Shaohua Wu, Xudong Zhao, Shenling Wang, Jiangang Luo, Lingjun Li, Xi Chen, Bing Zhao, Wei Wang, Tong Yu, Rongguo Zhang, Jiahua Zhang, Chao Wang
2023-11-27
YUAN 2.0: A Large Language Model with Localized Filtering-based Attention
https://arxiv.org/abs/2311.15786v1
102600000000.00
102.6 billion
1.78e+23
Trained on 288B tokens 6*102.6b*288b = 1.78e23
"The pretraining corpus includes a mix of books, codes, and encyclopedia in both Chinese and English (Table 2)" with synthetic code data: "Code (CN). Considering the diversity of programming tasks, we also build a synthesized instruction dataset with 4 million code samples in Chinese. To cover the concepts involved in programming tasks as many as possible, we collect 15,000 words of programming, computer science, mathematics, and other relevant topics from the Sogou input dictionary. Two topic
288000000000
Most likely the 288B tokens do not represent multiple epochs. As a sense check, Table 2 appears to indicate that 5.73% of pre-training tokens come from synthetically generated text output by GPT-3.5. If the full training corpus is 288B tokens, this would imply ~$24k in API costs at $1.50/1M tokens to generate the data, which seems plausible.
Confident
In this work, the Localized Filtering-based Attention (LFA) is introduced to incorporate prior knowledge of local dependencies of natural language into Attention. Based on LFA, we develop and release Yuan 2.0, a large language model with parameters ranging from 2.1 billion to 102.6 billion. A data filtering and generation method is presented to build pretraining and fine-tuning dataset in high quality. A distributed training method with non-uniform pipeline parallel, data parallel, and optimizer
Open weights (restricted use)
China
commercial ok, but nothing that "may cause harm to the country and society, or for any services that have not undergone security assessment and filing" https://huggingface.co/IEITYuan/Yuan2-102B-hf
Industry
InternVL
Vision
Language
Visual question answering
Image classification
Image captioning
Shanghai AI Lab
Nanjing University
The University of Hong Kong
Tsinghua University
SenseTime
University of Science and Technology of China
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai
2024-01-15
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
https://arxiv.org/abs/2312.14238
14000000000.00
14B
1.744956e+23
trainable / total parameters stage 1: 13B / 13B stage 2: 1B / 14B training tokens: stage 1: (28.7-0.5)*0.5*(196/16)^2 + 0.5*(224/16)^2 = 2213B stage 2: 1.6*(224/16)^2 = 313.6 B 6*13B*2213B + 6*1B*313.6 B = 174495.6 *10^18 = 1.744956 × 1023
LAION-COCO
COYO-700M
SBU
Conceptual Captions (CC3M)
Conceptual Captions 12M (CC12M)
Wukong
LAION
2527000000000
Stage 1 " The training involves a total batch size of 164K across 640 A100 GPUs, extending over 175K iterations to process about 28.7 billion samples. To enhance efficiency, we initially train at a 196×196 resolution, masking 50% of image tokens [87], and later switch to 224×224 resolution without masking for the final 0.5 billion samples." Stage 2 1.6B samples "The input images are processed at a resolution of 224×224. For optimization, the AdamW optimizer [98] is employed with β1 = 0.9, β2
NVIDIA A100 SXM4 80 GB
Likely
The exponential growth of large language models (LLMs) has opened up numerous possibilities for multimodal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the LLM, using web-scale image-text data fr
Open weights (unrestricted)
China
China
Hong Kong
China
China
Hong Kong
China
China
InternViT-6B
LLaMA-7B
640
https://huggingface.co/OpenGVLab/InternVL-14B-224px MIT license
Academia
Academia
Academia
Academia
Industry
Academia
Llama 3.2 3B
Language
Language modeling/generation
Text summarization
Question answering
Quantitative reasoning
Translation
Meta AI
2024-09-24
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models
https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/
3210000000.00
https://huggingface.co/meta-llama/Llama-3.2-1B
1.7334e+23
6ND = 6*3210000000.00*9000000000000 = 1.7334e+23 460000 hours * 3600 s * 133800000000000 FLOPS/s* 0.3 = 6.647184e+22
Unspecified unreleased
9000000000000
"Llama 3.2 was pretrained on up to 9 trillion tokens of data from publicly available sources."
NVIDIA H100 SXM5 80GB
Confident
Today, we’re releasing Llama 3.2, which includes small and medium-sized vision LLMs (11B and 90B), and lightweight, text-only models (1B and 3B) that fit onto edge and mobile devices, including pre-trained and instruction-tuned versions. The Llama 3.2 1B and 3B models support context length of 128K tokens and are state-of-the-art in their class for on-device use cases like summarization, instruction following, and rewriting tasks running locally at the edge. These models are enabled on day one f
Open weights (restricted use)
United States of America
Unreleased
LLAMA 3.2 COMMUNITY LICENSE AGREEMENT https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE
Industry
Konan LLM 41B
Language
Vision
Language modeling/generation
Konan Technology
Yang Seung-hyun, Wiretin, Changmin, Kim Jong-tae
2023-12-15
Konan LLM: A Korean Large Language Model
https://en.konantech.com/en/llm/konanllm https://techfinch.kr/ai/konan-technology-unveils-konan-llm--its-own-ai-language-model https://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE11610127
41000000000.00
1.722e+23
=41000000000*700000000000*6=1.722 × 10^23
Unspecified unreleased
700000000000
https://www.konantech.com/pr/press?number=2628&pn=1&stype2=&sfi=subj&sword= Since 2007, via the real-time AI analysis service pulseK, over 20.5 billion pieces of data have been independently secured. Among them, only 2 billion high-quality, large-scale data pieces have been used for training.
Likely
Konan LLM is a Large Language Model developed in-house by Konan Technology. Konan Technology optimized for super-large AI training, it leverages high-quality, large-scale data and over 20 years of expertise in natural language processing. Konan LLM supports all corporate documentation and creative tasks, leading the way in workplace innovation.
Hosted access (no API)
Korea (Republic of)
Unreleased
Industry
Ovis-7B
Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Han-Jia Ye
2024-06-17
Ovis: Structural Embedding Alignment for Multimodal Large Language Model
https://arxiv.org/abs/2405.20797
12.00
7000000000.00
1.7e+23
Fine tune: 989500000000000*128*0.3*15*60*60=2.0518272e+21 QWEN 1.5 training FLOP: 1.68e+23 Total: 1.7e+23
15M datapoints (image-text pairs)
NVIDIA H100 SXM5 80GB
Unverified
We propose Ovis, a novel MLLM architecture designed to structurally align visual and textual embeddings. Ovis integrates an additional learnable visual embedding table into the visual encoder’s process. To capture rich visual semantics, each image patch indexes the visual embedding table multiple times, resulting in a final visual embedding that is a probabilistic combination of the indexed embeddings.
Qwen1.5-7B
128
PaLI
Language
Vision
Multimodal
Visual question answering
Language modeling/generation
Google
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut
2022-09-14
PaLI: A Jointly-Scaled Multilingual Language-Image Model
https://arxiv.org/abs/2209.06794v4
567.00
16900000000.00
3.9b Image Encoder, 14b Multimodal Encoder-Decoder
1.69e+23
Pre-training the ViT component involved 1.1 million steps (they train over 1M steps but run the last 100k twice and then average the two resulting models). Batch size is 16384 and the inputs are 224x224. Table 8 indicates a forward pass with ViT-e/14 on a 224 image takes 1980 GFLOPs, so total training compute for the ViT-e/14 model is: 1980e9 * 16384 * 1.1 million * 3 (account for backward passes) = 1.07e23 In the "overal model" section, they then say: "The largest model, PaLI-17B, is pretraine
WebLI
"we introduce WebLI, a multilingual imagelanguage dataset built from images and texts available on the public web... Due to the abundance of multilingual content on the internet, the collection process for the WebLI dataset can be scaled to cover 10 billion images and 12 billion alt-texts. In addition to annotation with web text, we use publicly available automatic service to extract OCR annotations on all images, resulting in 29 billion image-OCR pairs. To balance quality and retain scale, we f
1600000000
"During training, the model passes over 1.6B images, one epoch over the entire pretraining dataset"
Google TPU v4
Likely
Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pre-trained encoder-decoder language models and Vision Transformers (ViTs).
Unreleased
United States of America
1024
Unreleased
Industry
Qwen1.5-7B
Language
Chat
Language modeling/generation
Quantitative reasoning
Code generation
Translation
Alibaba
Qwen Team
2024-02-04
Introducing Qwen1.5
https://huggingface.co/Qwen/Qwen1.5-7B
7000000000.00
7B
1.68e+23
6*7*10^9*4*10^12 = 1.68e+23
Unspecified unreleased
"We pretrained the models with a large amount of data, and we post-trained the models with both supervised finetuning and direct preference optimization."
4000000000000
4 trillion tokens from this response https://github.com/QwenLM/Qwen2/issues/97
Confident
Qwen1.5 is the beta version of Qwen2, a transformer-based decoder-only language model pretrained on a large amount of data. In comparison with the previous released Qwen, the improvements include: 8 model sizes, including 0.5B, 1.8B, 4B, 7B, 14B, 32B and 72B dense models, and an MoE model of 14B with 2.7B activated; Significant performance improvement in human preference for chat models; Multilingual support of both base and chat models; Stable support of 32K context length for models of all si
Open weights (unrestricted)
China
Unreleased
https://huggingface.co/Qwen/Qwen1.5-7B
Industry
Jiutian
Language
Language modeling/generation
China Mobile
2023-10-12
https://www.globaltimes.cn/page/202310/1299716.shtml
13900000000.00
A 13.9B parameter model is mentioned prominently at https://jiutian.10086.cn/portal/#/home 2025-01-13.
1.668e+23
6*13.9e9*2e12=1.668e23
2000000000000
"Designed to enhance efficiency, the model has trained over 2 trillion tokens"
Likely
China Mobile, the largest telecom operator in the world by subscribers, unveiled its "Jiutian" artificial intelligence (AI) large-scale model on Thursday, which has reportedly won support from large enterprises including China Ocean Shipping (Group) Co and China Railway Construction Co.
China
Industry
Qwen2.5-1.5B
Language
Language modeling/generation
Question answering
Quantitative reasoning
Alibaba
Qwen Team
2024-09-19
Qwen2.5-LLM: Extending the boundary of LLMs
https://qwenlm.github.io/blog/qwen2.5-llm/
1540000000.00
1.54B
1.6632e+23
Training dataset size was 18 trillion 6ND = 6 * 1.54B billion parameters * 18 trillion tokens = 1.6632e+23
Unspecified unreleased
"In terms of Qwen2.5, the language models, all models are pretrained on our latest large-scale dataset, encompassing up to 18 trillion tokens."
18000000000000
"In terms of Qwen2.5, the language models, all models are pretrained on our latest large-scale dataset, encompassing up to 18 trillion tokens"
Confident
In the past three months since Qwen2’s release, numerous developers have built new models on the Qwen2 language models, providing us with valuable feedback. During this period, we have focused on creating smarter and more knowledgeable language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5. We are announcing what might be the largest opensource release in history! Let’s get the party started!
Open weights (unrestricted)
China
Unreleased
Apache 2.0
Industry
AlphaCode
Language
Code generation
DeepMind
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d'Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, Oriol Vinyals
2022-02-02
Competition-Level Code Generation with AlphaCode
https://arxiv.org/abs/2203.07814
1013.00
41100000000.00
41.1B. Table 3
1.63944e+23
Figure 7 (a) shows a maximum training compute budget of approx 23000 TPU-days per model. 23000 days * 24 h/day * 3600 sec/h * 2.75e14 FLOP/s * 0.3 utilization = 1.64e23 FLOP
CodeContests
Unspecified unreleased
Looks like evaluation data is released but not pretraining data: "We use large transformer language models to generate code, pre-training them on selected GitHub code and fine-tuning on our curated set of competitive programming problems... A core part of developing our system was ensuring that submissions are rigorously evaluated and that evaluation problems are truly unseen during training, so difficult problems cannot be solved by copying from the training set. Towards this goal, we release
Appendix part A has answers for pretraining.
Google TPU v4
Programming is a powerful and ubiquitous problem-solving tool. Developing systems that can assist programmers or even generate programs independently could make programming more productive and accessible, yet so far incorporating innovations in AI has proven challenging. Recent large-scale language models have demonstrated an impressive ability to generate code, and are now able to complete simple programming tasks. However, these models still perform poorly when evaluated on more complex, unsee
Unreleased
United Kingdom of Great Britain and Northern Ireland
3750
Unreleased
Industry
FinGPT-13B
Language
Named entity recognition
Sentiment classification
Language modeling/generation
University of California Los Angeles (UCLA)
Columbia University
New York University (NYU)
Neng Wang, Hongyang Yang, Christina Dan Wang
2023-10-07
FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets
https://arxiv.org/abs/2310.04793; https://github.com/AI4Finance-Foundation/FinGPT
33.00
13000000000.00
Finetunes using LoRA, so only trains 3.67 million parameters
1.6e+23
From Llama 2-13B
Financial sentiment data (for fine-tuning): https://huggingface.co/datasets/FinGPT/fingpt-sentiment-train
NVIDIA GeForce RTX 3090
Likely
In the swiftly expanding domain of Natural Language Processing (NLP), the potential of GPT-based models for the financial sector is increasingly evident. However, the integration of these models with financial datasets presents challenges, notably in determining their adeptness and relevance. This paper introduces a distinctive approach anchored in the Instruction Tuning paradigm for open-source large language models, specifically adapted for financial contexts. Through this methodology, we capi
Open weights (unrestricted)
United States of America
United States of America
United States of America
Llama 2-13B
653248800B
fine-tuned Llama 2 13B RTX 3090 for 17 hours, at a cost of $17 35.5 trillion flops * 17 * 3600 * 0.3 = 6.532488e+17
1
Open source
MIT license (though probably subject to Llama 2 license too) https://github.com/AI4Finance-Foundation/FinGPT/blob/master/LICENSE train code: https://github.com/AI4Finance-Foundation/FinGPT/blob/master/fingpt/FinGPT_Benchmark/train.sh
Academia
Academia
Academia
Llama 2-13B
Language
Language modeling
Meta AI
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Mar
2023-07-18
Llama 2: Open Foundation and Fine-Tuned Chat Models
https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/ https://arxiv.org/abs/2307.09288
8056.00
13000000000.00
Llama has been released in 7B, 13B, and 70B variants.
1.6e+23
13 billion * 2 trillion * 6 = 1.6e23
Llama 2 dataset
2 trillion tokens of publicly available text, with no text from Meta's products. "Our training corpus includes a new mix of data from publicly available sources, which does not include data from Meta’s products or services. We made an effort to remove data from certain sites known to contain a high volume of personal information about private individuals. We trained on 2 trillion tokens of data as this provides a good performance–cost trade-off, up-sampling the most factual sources in an effort
2000000000000
2 trillion tokens ~= 1.5 trillion words
NVIDIA A100 SXM4 80 GB
Confident
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closedsource models. We provide a detailed description of our approach to
Open weights (restricted use)
United States of America
Unreleased
Llama 2 license. can't use outputs to train models. https://github.com/meta-llama/llama/blob/main/LICENSE
Industry
Llama Guard
Language
Chat
Meta AI
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Davide Testuggine, Madian Khabsa
2023-12-07
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
https://arxiv.org/abs/2312.06674
201.00
7000000000.00
7B
1.6e+23
1.7e17 finetune compute, plus Llama 2-13B pretrain compute (1.6e+23)
Dataset of prompt-response pairs of human-AI conversations "We leverage the human preference data about harmlessness from Anthropic (Ganguli et al., 2022). From this dataset, we pick the first human prompt and discard the corresponding response from the assistant, as well as all the other turns to create an initial single-turn prompt dataset. Next, we use one of our internal Llama checkpoints to generate a mix of cooperating and refusing responses for these prompts. We employ our expert, in-hou
4096000
14k prompt-response pairs. Based on training details it's trained on ~4M tokens, which is stated to be ~1 epoch: 2 * 4096 * 500 = 4,096,000 (batch size) * (sequence length) * (steps)
NVIDIA A100 SXM4 80 GB
Confident
We introduce Llama Guard, an LLM-based input-output safeguard model geared towards Human-AI conversation use cases. Our model incorporates a safety risk taxonomy, a valuable tool for categorizing a specific set of safety risks found in LLM prompts (i.e., prompt classification). This taxonomy is also instrumental in classifying the responses generated by LLMs to these prompts, a process we refer to as response classification. For the purpose of both prompt and response classification, we have met
Open weights (restricted use)
United States of America
Llama 2-7B
170000000B
"We train on a single machine with 8xA100 80GB GPUs using a batch size of 2, with sequence length of 4096, using model parallelism of 1 and a learning rate of 2 × 10−6. We train for 500 steps, which corresponds to ∼1 epoch over our training set." 6 * 2*4096*500 * 7 billion = 1.7e17
Unreleased
Llama 2 license https://github.com/facebookresearch/PurpleLlama/tree/main/Llama-Guard
Industry
SparseOPT-175B
Language
Language modeling/generation
Question answering
Institute of Science and Technology Austria (ISTA)
Neural Magic
Elias Frantar, Dan Alistarh
2023-01-02
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
https://arxiv.org/abs/2301.00774
406.00
87500000000.00
1.58e+23
this is a distillation of OPT; see OPT dataset
NVIDIA A100 SXM4 80 GB
We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models. We can execute SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, in under 4.5 hours, and can reach 60% unstructured spar
Unreleased
Austria
United States of America
OPT-175B
312000000000000 FLOP/sec [A100 with assumed bf16 precision] * 1 GPU * 4 hours * 3600 sec/hour * 0.3 [assumed utilization] = 1.34784e+18 FLOP OPT-175 estimated compute: 4.3e+23 FLOP
1
Open source
code is Apache 2.0 (but OPT, which you'd need to recreate this model, is non-commercial) https://github.com/IST-DASLab/sparsegpt/blob/master/opt.py
Academia
Industry
StarCoder 2 7B
Language
Code generation
Code autocompletion
Hugging Face
ServiceNow
NVIDIA
BigCode
Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muenn
2024-02-29
StarCoder 2 and The Stack v2: The Next Generation
https://arxiv.org/abs/2402.19173
7000000000.00
7B
1.55e+23
estimation is given in Table 6
The Stack v2
See Table 4. The Stack V2 plus some extras. Created from repositorites from Github with permissive licences.
658580000000
from Table 4
NVIDIA H100 SXM5 80GB
Confident
The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a
Open weights (restricted use)
Multinational
United States of America
United States of America
United States of America
https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement
Industry
Industry
Industry
Multi-Token Prediction 13B
Language
Code generation
Facebook AI Research
Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve
2024-04-30
Better & Faster Large Language Models via Multi-token Prediction
https://arxiv.org/abs/2404.19737
13000000000.00
13B (Figure 1)
1.5364368e+23
"training all models reported in the paper required around 500K GPU hours of computation on hardware of type A100-80GB and H100." A100-80 GB peak FLOP/s [assumed fp16 precision]: 77970000000000 H100 peak FLOP/s [assumed SXM5 TensorCore]: 989000000000000 assuming 50/50 usage: (77970000000000+989000000000000)*0.5*500000hours*3600s*0.3=2.880819e+23 for ALL models in the paper assuming this model has taken around 16% of all used compute (https://docs.google.com/spreadsheets/d/1Yc-HAdYgn6e9SUIliMaQ
209700000000
209.7B (Table S13)
NVIDIA A100 SXM4 80 GB
NVIDIA H100 SXM5 80GB
Likely
Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved do
Unreleased
United States of America
Unreleased
Industry
Hunyuan Video
Video
Video generation
Tencent
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, Weiyan Wang, Wenqing Yu, Xinchi Deng, Yang Li, Yi Chen, Yutao Cui, Yuanbo Peng, Zhentao Yu, Zhiyu He, Zhiyong Xu, Zixiang Zhou, Zunnan Xu, Yang
2024-12-03
HunyuanVideo: A Systematic Framework For Large Video Generative Models
https://www.arxiv.org/abs/2412.03603
13000000000.00
13b
1.4814814999999999e+23
from Figure 10: the optimal model has 13b parameters, 5.8e+07PF (image training) + 7.0e+07PF (video training) of compute and 740B (image tokens) + 928B (video tokens) 5.8e+07PF + 7.0e+07PF = 12.8e+07PF = 12.8*10^7*10^20/(24*3600) = 1.4814815e+23 FLOPs 6ND = 6*13*10^9*(740+928)*10^9 = 1.30104e+23
Unspecified unreleased
"We employ various filters for data filtering and progressively increase their thresholds to build 4 training datasets, i.e., 256p, 360p, 540p, and 720p, while the final SFT dataset is built through manual annotation."
Confident
Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models
Open weights (restricted use)
China
Unreleased
"THIS LICENSE AGREEMENT DOES NOT APPLY IN THE EUROPEAN UNION, UNITED KINGDOM AND SOUTH KOREA" also requires additional licensing in case of massive commercial use https://huggingface.co/tencent/HunyuanVideo/blob/main/LICENSE the code seems to be just inference code not training code
Industry
HyperCLOVA 82B
Language
Language modeling/generation
Chat
Translation
Text classification
NAVER
Search Solutions
Boseop Kim, HyoungSeok Kim, Sang-Woo Lee, Gichang Lee, Donghyun Kwak, Dong Hyeon Jeon, Sunghyun Park, Sungju Kim, Seonhoon Kim, Dongpil Seo, Heungsub Lee, Minyoung Jeong, Sungjae Lee, Minsub Kim, Suk Hyun Ko, Seokhun Kim, Taeyong Park, Jinuk Kim, Soyoung Kang, Na-Hyeon Ryu, Kang Min Yoo, Minsuk Chang, Soobin Suh, Sookyo In, Jinseong Park, Kyungduk Kim, Hiun Kim, Jisu Jeong, Yong Goo Yeo, Donghoon Ham, Dongju Park, Min Young Lee, Jaewook Kang, Inho Kang, Jung-Woo Ha, Woomyoung Park, Nako Sung
2021-09-10
What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers
https://arxiv.org/abs/2109.04650
92.00
82000000000.00
"We introduce a Korean in-context large-scale LM with 82B parameters, i.e., HyperCLOVA. This is the first discovery on near 100B-scale non-English LM." According to media reports, HyperCLOVA has 204B parameters (i.e. a different version than in the paper) https://m.koreaherald.com/view.php?ud=20210525000824
1.476e+23
"For experiments in Section 4, the model trained with 150B is used for fair comparison, because not all models are finished training at the same iteration. However, experiments in Section 5.2 use the model trained with 300B tokens, as HyperCLOVA Studio provided the 39B and 82B models trained with 300B tokens." 82e9 connections * 2 FLOP/connection * 300e9 tokens * 3 backward pass = 1.476e23 FLOP Calculation using GPU time corroborates this: - "Our model is based on megatron-LM (Shoeybi et al.,
Unspecified unreleased
Blog corpus: 273.6 billion tokens Cafe corpus (online community): 83.3 billion tokens News corpus: 73.8 billion tokens Comments (crawled from various platforms): 41.1 billion tokens KiN (Korean QnA website): 27.3 billion tokens Modu (collection of five datasets): 6.0 billion tokens WikiEn, WikiJp (Foreign Wikipedia): 5.2 billion tokens Other unspecified sources: 51.5 billion tokens
300000000000
"However, experiments in Section 5.2 use the model trained with 300B tokens, as HyperCLOVA Studio provided the 39B and 82B models trained with 300B tokens." "We introduce HyperCLOVA, a large-scale Korean in-context learning-based LM with nearly 100B parameters, by constructing a large Korean-centric corpus of 560B tokens." Based on tokenizing the Hyperclova article itself using OpenAI's tiktoken BPE tokenizer (https://github.com/openai/tiktoken), there are 3285 tokens for 1069 words - about 3
NVIDIA A100
Confident
GPT-3 shows remarkable in-context learning ability of large-scale language models (LMs) trained on hundreds of billion scale data. Here we address some remaining issues less reported by the GPT-3 paper, such as a non-English LM, the performances of different sized models, and the effect of recently introduced prompt optimization on in-context learning. To achieve this, we introduce HyperCLOVA, a Korean variant of 82B GPT-3 trained on a Korean-centric corpus of 560B tokens. Enhanced by our Korean
API access
Korea (Republic of)
Korea (Republic of)
1024
Unreleased
"We introduce HyperCLOVA Studio, an interactive prompt engineering interface which provides GUI and API interfaces like the OpenAI playground1"
Industry
Industry
Movie Gen Audio
Audio
Audio generation
Meta AI
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Mitesh Kumar Singh, Peizhao Zhang, Peter Vajda, Quentin Duval, Rohit Girdhar, Roshan Sumbaly, Sai Saketh Rambhatla, Sam Tsai, Samaneh Azadi, Samyak Datta, Sanyuan Chen, Sea
2024-10-04
Movie Gen: A Cast of Media Foundation Models
https://ai.meta.com/static-resource/movie-gen-research-paper
13000000000.00
13B
1.4e+23
Pre trained for 14 days on 384 H100s. I assumed a 0.3 utilization rate
It was trained on O(1k) hours of audio
NVIDIA H100 SXM5 80GB
Confident
We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user’s image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generatio
Unreleased
United States of America
384
Industry
Yi 6B
Language
Chat
Language modeling/generation
Translation
Code generation
01.AI
Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, Zonghong Dai
2023-11-02
Yi: Open Foundation Models by 01.AI
https://arxiv.org/abs/2403.04652
6000000000.00
6B
1.26e+23
6*7*10^9*3*10^12 = 1.26e+23
Unspecified unreleased
3100000000000
"language models pretrained from scratch on 3.1T highly-engineered large amount of data, and finetuned on a small but meticulously polished alignment data."
Confident
The Yi series models are large language models trained from scratch by developers at 01.AI.
Open weights (restricted use)
China
Unreleased
llama license https://huggingface.co/01-ai/Yi-6B no training code
Industry
VARCO LLM 2.0 base
Language
Language modeling/generation
Chat
Translation
Question answering
NCSOFT
2023-08-16
VARCO LLM 2.0 is NCSOFT's large language model that can be applied to the development of natural language processing-based AI services.
https://ncsoft.github.io/ncresearch/varco-llm-details/ https://aws.amazon.com/marketplace/pp/prodview-d7amr4yxpibew?sr=0-3&ref_=beagle&applicationId=AWSMPContessa
13000000000.00
1.248e+23
=1600000000000*6*13000000000=1.248×10^23
"Our LLM is trained with datasets that are either publicly available for pretraining, collected from the Internet or internally constructed,” Jehee Lee, CRO of NCSOFT, told Engadget via email.
1600000000000
https://ncsoft.github.io/ncresearch/varco-llm-details/
Likely
VARCO LLM 2.0 is NCSOFT's large language model that can be applied to the development of various natural language processing-based AI services such as text generation, question answering, chatbots, summarization, and information extraction. NCSOFT's VARCO LLM 2.0 was developed with our own technology, including data construction, pre-training, instruction tuning and alignment tuning. We evaluated VARCO LLM 2.0 on various NLP tasks and its performance has significantly improved compared to VARCO
API access
Korea (Republic of)
Industry
Phi-4-Multimodal
Multimodal
Language
Vision
Speech
Language modeling/generation
Question answering
Visual question answering
Speech recognition
Translation
Audio question answering
Character recognition
Microsoft
Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong Chen, Dongdong Chen, Junkun Chen, Weizhu Chen, Yen-Chun Chen, Yi-ling Chen, Qi Dai, Xiyang Dai, Ruchao Fan, Mei Gao, Min Gao, Amit Garg, Abhishek Goswami, Junheng Hao, Amr Hendy, Yuxuan Hu, Xin Jin, Mahmoud Khademi, Dongwoo Kim, Young Jin Kim, Gina Lee, Jinyu Li, Yunsheng Li, Chen Liang, Xihui Lin, Zeqi Lin, Mengchen Liu, Yang Liu, Gilsin
2025-03-03
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
https://arxiv.org/abs/2503.01743
5600000000.00
5.6B 1. base: Phi-4 Mini (3.8b parameters) 2. "The audio encoder and projector introduce 460M parameters while LoRAA consumes another 460M parameters." 3. "The image encoder and projector introduce 440M model parameters while the vision adapter LoRAV consumes another 370M model parameters."
1.2117239999999998e+23
1.14e+23 (base model training compute) + 7.1724e+21 (finetune compute) = 1.211724e+23
Unspecified unreleased
"The Phi-4-Multimodal model’s pre-training phase involves a rich and varied dataset, encompassing interleaved image-text documents, image-text pairs, image grounding data, synthetic datasets from OCR of PDFs and realistic images, and synthesized datasets for chart comprehension" "For vision-speech data, Phi-4-Multimodal model is trained on a diverse set of synthetic vision-speech data, covering single-frame and multi-frame scenarios. "
1100000000000
"The pre-training process involves a total of 0.5T tokens, combining both visual and textual elements." "To pre-train the adapter and reduce the modality gap between the speech and text sequences, we curate a dataset of approximately 2M hours of anonymized in-house speech-text pairs with strong/weak ASR supervisions, covering the eight supported languages" "Note that the speech token rate is 80ms, indicating 750 tokens for 1-minute audio." 2*10^6 hours * 60 min / hour * 750 tokens = 90B token
Likely
We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement is driven by a carefully curated synthetic data recipe emphasizing high-quality math and coding dat
API access
United States of America
Phi-4 Mini
7172400000000B
3.8B frozen parameters (Base LM) 1. Vision-Language Training (0.5T tokens) 810M (440M Image Encoder + Projector + 370M LoRA_V) 6ND = 6*0.5*10^12*810*10^6 = 2.43e+21 2. Multimodal SFT (0.3T tokens) 810M 6ND = 6*0.3*10^12*810*10^6 = 1.458e+21 3. Speech Pre-training (2M hours = 90B tokens, see dataset size notes) 460M (Audio Encoder + Projector) 6ND = 6*90*10^9*460*10^6 = 2.484e+20 4. Speech Post-training (100M samples ~ 1.1T tokens, see dataset size notes) 460M (LoRA_A) 6ND = 6*1.1*1
Industry
UL2
Language
Language modeling/generation
Google Research
Google Brain
Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler
2022-05-10
Unifying Language Learning Paradigms
https://arxiv.org/abs/2205.05131v1
253.00
20000000000.00
Taken from Directory of LLMs
1.2e+23
Trained on 1T tokens 20B * 1T * 6 = 1.2e23 Second source: Section 5.1 says model was trained on 512 TPUv4 chips, and took slightly over 1 month 512 * 2.75e14 * 31 * 24 * 3600 * 0.3 = 1.13e23
C4
'The model is trained on a total of 1 trillion tokens on C4 (2 million steps).'
1000000000000
1T tokens
Google TPU v4
Confident
Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized and unified perspecti
Open weights (unrestricted)
Multinational
United States of America
United States of America
512
Apache 2.0
Industry
Industry
PLaMo-13B
Language
Language modeling/generation
Chat
Question answering
Preferred Networks Inc
Preferred Networks, Inc
2023-09-28
PLaMo-13B
https://huggingface.co/pfnet/plamo-13b
13000000000.00
1.17e+23
6ND = 6*13e9*1.5e12=1.17e+23 from https://huggingface.co/pfnet/plamo-13b#model-details 480 GPUs * 30 days [assumed, likely less] * 24 hours * 3600 s * 77970000000000 FLOP/s * 41.0 [reported utilization] = 3.9772934e+24
C4
Project Gutenberg
RedPajama
mC4
Wikipedia (ja)
from https://huggingface.co/pfnet/plamo-13b#training-dataset
1500000000000
Trained tokens: 1.5T tokens (English: 1.32T tokens, Japanese: 0.18T tokens) from https://huggingface.co/pfnet/plamo-13b#model-details 0.75*1.32T + 0.18T = 1170000000000 0.75 words per token for English 1 for Japanese
NVIDIA A100 SXM4 40 GB
Confident
Open weights (unrestricted)
Japan
480
Unreleased
Apache 2.0 for weights. Open data
Industry
IDEFICS-80B
Multimodal
Language
Vision
Language modeling
Image captioning
Visual question answering
Hugging Face
Hugo Laurencon, Daniel van Strien, Stas Bekman, Leo Tronchon, Lucile Saulnier, Thomas Wang, Siddharth Karamcheti, Amanpreet Singh, Giada Pistilli, Yacine Jernite, Victor Sanh
2023-08-22
Introducing IDEFICS: An Open Reproduction of State-of-the-Art Visual Language Model
https://huggingface.co/blog/idefics
80000000000.00
IDEFICS... comes in two variants—the base version and the instructed version. Each variant is available at the 9 billion and 80 billion parameter sizes.
1.1593580544e+23
flops = 512 * 312e12 * 28*24*3600 * 0.3 = 1.159e23 (num gpus) * (peak perforemence) * (time in seconds) * (assumed utilization rate) "The IDEFICS models were trained on an AWS SageMaker cluster with 8x80GB A100 GPUs nodes and EFA network. IDEFICS-80B took ~28 days of training on 64 nodes (512 GPUs)." https://huggingface.co/HuggingFaceM4/idefics-80b-instruct trained on 150B text tokens + images 6ND = 6*734000000000*80*10^9 = 3.5232e+23
Wikipedia
Public Multimodal Dataset (PMD)
LAION
OBELICS
IDEFICS was trained on a mixture of openly available datasets: Wikipedia, Public Multimodal Dataset, and LAION, as well as a new 115B token dataset called OBELICS that we created. OBELICS consists of 141 million interleaved image-text documents scraped from the web and contains 353 million images.
734000000000
IDEFICS was trained on a mixture of openly available datasets: Wikipedia, Public Multimodal Dataset, and LAION, as well as a new 115B token dataset called OBELICS that we created. OBELICS consists of 141 million interleaved image-text documents scraped from the web and contains 353 million images. See https://huggingface.co/HuggingFaceM4/idefics-80b-instruct 149.6B tokens and 1.582B images in total. Effective Batch Size (# of tokens) 3.67M Max Training Steps 200K 3.67*10^6*200000 = 7340000000
NVIDIA A100
Confident
We are excited to release IDEFICS (Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS), an open-access visual language model. IDEFICS is based on Flamingo, a state-of-the-art visual language model initially developed by DeepMind, which has not been released publicly. Similarly to GPT-4, the model accepts arbitrary sequences of image and text inputs and produces text outputs.
Open weights (non-commercial)
Multinational
United States of America
LLaMA-65B
CLIP ViT-H/14 - LAION-2B
512
Llama license (non commercial)
Industry
Phi-4 Mini
Language
Language modeling/generation
Visual question answering
Code generation
Quantitative reasoning
Translation
Microsoft
Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong Chen, Dongdong Chen, Junkun Chen, Weizhu Chen, Yen-Chun Chen, Yi-ling Chen, Qi Dai, Xiyang Dai, Ruchao Fan, Mei Gao, Min Gao, Amit Garg, Abhishek Goswami, Junheng Hao, Amr Hendy, Yuxuan Hu, Xin Jin, Mahmoud Khademi, Dongwoo Kim, Young Jin Kim, Gina Lee, Jinyu Li, Yunsheng Li, Chen Liang, Xihui Lin, Zeqi Lin, Mengchen Liu, Yang Liu, Gilsin
2025-03-03
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
https://arxiv.org/abs/2503.01743
3800000000.00
3.8-billion
1.14e+23
6ND = 6 * 3800000000 parameters * 5000000000000 tokens = 1.14e+23
Unspecified unreleased
5000000000000
"With these techniques, we built the 5 trillion pre-training data corpus"
Confident
We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement is driven by a carefully curated synthetic data recipe emphasizing high-quality math and coding dat
API access
United States of America
Industry
Meena
Language
Text autocompletion
Google Brain
Dongling Xiao, Han Zhang, Yukun Li, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang
2020-01-28
Towards a Human-like Open-Domain Chatbot
https://arxiv.org/abs/2001.09977
879.00
2600000000.00
"We present Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token."
1.12e+23
https://arxiv.org/ftp/arxiv/papers/2104/2104.10350.pdf Table 4 In the paper: "We trained our best model for 30 days on a TPUv3 Pod (2,048 TPU cores) on the Meena dataset containing 40B words (or 61B BPE tokens) [...] by the end of training, the model had traversed the full training set 164 times (or epochs) and observed a total of about 10T tokens" Hardware: 30 * 24 * 3600 * (2048/2) * 1.23e14 * 0.3 = 9.794e22 Ops counting: 6 * 10T * 2.6B = 1.56E23 Geometric mean: sqrt(9.79e22*1.56E23) = 1.24e
40000000000
"The final Meena dataset contains 341GB of text (40B words)" Converting from GB to words yields 6.8e10, which is in the same OOM
Google TPU v3
Confident
We present Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token. We also propose a human evaluation metric called Sensibleness and Specificity Average (SSA), which captures key elements of a human-like multi-turn conversation. Our experiments show strong correlation between perplexity and SSA. The fact that the best perplexi
Unreleased
United States of America
1024
Unreleased
Industry
WizardCoder-15.5B
Language
Code generation
Microsoft
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Daxin Jiang
2023-06-14
WizardCoder: Empowering Code Large Language Models with Evol-Instruct
https://arxiv.org/abs/2306.08568
449.00
15500000000.00
15.5B
1.12e+23
1.12e23 base compute (StarCoder estimate) + 1.95e19 finetune compute (see below) ~= 1.12e23
synthetic code data: "To construct the training dataset, we initialized it with the 20K instruction-following dataset called Code Alpaca5. We iteratively employ the Evol-Instruct technique on this dataset consisting of 20,000 samples to produce evolved data"
"The evolved dataset consists of approximately 78k samples" Not sure how big the samples are.
Likely
Code Large Language Models (Code LLMs), such as StarCoder, have demonstrated exceptional performance in code-related tasks. However, most existing models are solely pre-trained on extensive raw code data without instruction finetuning. In this paper, we introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning, by adapting the Evol-Instruct method to the domain of code. Through comprehensive experiments on four prominent code generation benchmarks, namely HumanEval, Hu
Open weights (restricted use)
United States of America
StarCoder
19503513600B
"The StarCoder [11] serves as our basic foundation model. The evolved dataset consists of approximately 78k samples. To fine-tune the basic models, we employ specific configurations, including a batch size of 512, a sequence length of 2048, 200 fine-tuning steps, 30 warmup steps, a learning rate of 2e-5, a Cosine learning rate scheduler, and fp16 mixed precision." 512*2048*200 = 209,715,200 training tokens 209715200 * 15.5B * 6 = 1.95e19
Open source
commercial, responsible use restrictions: https://github.com/nlpxucan/WizardLM/blob/main/WizardCoder/MODEL_WEIGHTS_LICENSE code is apache: https://github.com/nlpxucan/WizardLM/blob/main/WizardCoder/CODE_LICENSE training code here: https://github.com/nlpxucan/WizardLM/blob/main/WizardCoder/src/train_wizardcoder.py data non-commercial: https://github.com/nlpxucan/WizardLM/blob/main/WizardCoder/DATA_LICENSE
Industry
OPT-66B
Language
Language modeling
Chat
Language modeling/generation
Question answering
Meta AI
Susan Zhang∗ , Stephen Roller∗ , Naman Goyal∗ , Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott† , Sam Shleifer† , Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, Luke Zettlemoyer
2022-06-21
OPT: Open Pre-trained Transformer Language Models
https://arxiv.org/abs/2205.01068
2932.00
66000000000.00
1.100000000001e+23
OPT-66B was trained for 140k steps, using a batch size of 2M tokens (see the OPT baselines logbook and Table 1 in Zhang et al. (2022), respectively), so training took 140e3 ∗ 2e6 ∗ 66e9 ∗ 6 = 1.1e23 FLOP
The Pile
BookCorpus (BooksCorpus, Toronto Book Corpus)
CC-Stories
Pushshift Reddit
C.2 Composition section: – BookCorpus (Zhu et al., 2015) consists of more than 10K unpublished books – CC-Stories (Trinh and Le, 2018) contains a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas – The Pile (Gao et al., 2021a) from which the following was included: * Pile-CC * OpenWebText2 * USPTO * Project Gutenberg * OpenSubtitles * Wikipedia * DM Mathematics * HackerNews – Pushshift.io Reddit dataset that was developed in Baumgartner et al. (2020) and proce
180000000000
"Our final corpus contains roughly 180B tokens."
NVIDIA A100 SXM4 80 GB
Confident
Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M t
Open weights (non-commercial)
United States of America
Open source
non-commercial for weights: https://ai.meta.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/ training code (MIT) https://github.com/facebookresearch/metaseq/blob/main/docs/training.md
Industry
Whisper v2
Speech
Speech recognition
OpenAI
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever
2022-12-05
Robust Speech Recognition via Large-Scale Weak Supervision
https://huggingface.co/openai/whisper-large-v2 https://arxiv.org/abs/2212.04356
2240.00
1550000000.00
1550M
1.1e+23
"Compared to the Whisper large model, the large-v2 model is trained for 2.5x more epochs with added regularization for improved performance." We (roughly) estimated Whisper v1 as 4.65e22. 2.5x that is 1.16e23 or ~1.1e23
Unspecified unreleased
"The models are trained on 680,000 hours of audio and the corresponding transcripts collected from the internet. 65% of this data (or 438,000 hours) represents English-language audio and matched English transcripts, roughly 18% (or 126,000 hours) represents non-English audio and English transcripts, while the final 17% (or 117,000 hours) represents non-English audio and the corresponding transcript. This non-English data represents 98 different languages."
9302400000
"When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zeroshot transfer setting without the need for any finetuning." 13,680 words/h (estimate) * 680,000h = 9,302,400,000 words
Likely
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning. Whisper was proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. The original code repository can be found here. Compared to the Whisper large model, the large-v2 model is trained
Open weights (unrestricted)
United States of America
Unreleased
Apache 2.0 for weights code for v1 is MIT: https://github.com/openai/whisper
Industry
Code Llama-7B
Language
Code generation
Meta AI
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Ellen Tan, Yossef (Yossi) Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Defossez, Jade Copet, Faisal Azhar, Hugo Touvron, Gabriel Synnaeve, Louis Martin, Nicolas Usunier, Thomas Scialom
2023-08-14
Code Llama: Open Foundation Models for Code
https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/ https://arxiv.org/abs/2308.12950
1297.00
7000000000.00
7B
1.1e+23
2.5e22 finetune compute + 8.4e22 base compute for Llama 2-7B, for ~1.1e23 compute overall Table 26: "In aggregate, training all 12 Code Llama models required 1400K GPU hours of computation on hardware of type A100-80GB" Suggests all versions took a combined 4.7e23 FLOPs: 3.12e14 * 1400000 * 3600 * 0.3 = 4.7e23 Assuming this refers to the finetune compute only, agrees with our finetune estimate if compute is proportional to parameter count: 7 / (7+13+34+70) = 0.056 0.056 * 4.7e23 = 2.65e22
Unspecified unreleased
"As shown in Table 1, Code Llama is trained predominantly on a near-deduplicated dataset of publicly available code. We also source 8% of our samples data from natural language datasets related to code. This dataset contains many discussions about code and code snippets included in natural language questions or answers. To help the model retain natural language understanding skills, we also sample a small proportion of our batches from a natural language dataset"
600000000000
Llama 2 used 2T tokens, and "We train Code Llama on 500B additional tokens and Code Llama - Python further on 100B tokens" 2T + 500B + 100B = 2600000000000
NVIDIA A100 SXM4 80 GB
Confident
We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters
Open weights (restricted use)
United States of America
Llama 2-7B
25000000000000B
Code Llama-base is trained from Llama 2 with 500B tokens: "We train Code Llama on 500B tokens during the initial phase, starting from the 7B, 13B, and 34B versions of Llama 2" Code Llama-Python required an additional 100B tokens in fine-tuning: "We train Code Llama on 500B additional tokens and Code Llama - Python further on 100B tokens." Code Llama-Instruct is fine-tuned on 5B tokens: "For Code Llama - Instruct, we train with a batch size of 524,288 tokens and on approx. 5B tokens in total."
Unreleased
Llama 2 license. can't use outputs to train models. https://github.com/meta-llama/llama/blob/main/LICENSE
Industry
OpenVLA
Robotics
Vision
Language
Robotic manipulation
Stanford University
University of California (UC) Berkeley
Toyota Research Institute
Google DeepMind
Massachusetts Institute of Technology (MIT)
Physical Intelligence
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, Chelsea Finn
2024-06-13
OpenVLA: An Open-Source Vision-Language-Action Mode
https://openvla.github.io/ https://arxiv.org/abs/2406.09246
73.00
7188100000.00
Based on a Prismatic-7B VLM backbone, which itself is comprised of 600M parameter vision encoder (DinoV2 + SigLIP) plus Llama-2 7B. Table 1 indicates 7.1881 billion trainable parameters
1.1e+23
Majority of compute is from VLA pre-training embedded in Prismatic-7B and it's constituent models. The fine-tuning compute used in this paper is "64 A100 GPUs for 14 days, or a total of 21,500 A100-hours" 21500 * 3600 * 3.12e14 * 0.4 = 9.66e21 Prismatic-7B training took "less than 9 hours" on 8 A100s: 9 * 3600 * 8 * 3.12e14 * 0.4 = 3.23e19 Add in the pre-trained components: - DinoV2 = 7.42e21, per our database - The SigLIP model in question is SoViT-400m/14 from the cited Alabdulmohsin et al
Open X-Embodiment
"The full OpenX dataset, at the time of writing, consists of more than 70 individual robot datasets, with more than 2M robot trajectories [...] we apply multiple steps of data curation to the raw dataset."
970000
"OpenVLA consists of a pretrained visually-conditioned language model backbone that captures visual features at multiple granularities, fine-tuned on a large, diverse dataset of 970k robot manipulation trajectories from the Open-X Embodiment [1] dataset" Filtered from 2M total in OpenX.
NVIDIA A100
Confident
Large policies pretrained on a combination of Internet-scale vision-language data and diverse robot demonstrations have the potential to change how we teach robots new skills: rather than training new behaviors from scratch, we can fine-tune such vision-language-action (VLA) models to obtain robust, generalizable policies for visuomotor control. Yet, widespread adoption of VLAs for robotics has been challenging as 1) existing VLAs are largely closed and inaccessible to the public, and 2) prior w
Open weights (unrestricted)
United States of America
United States of America
United States of America
United States of America
United Kingdom of Great Britain and Northern Ireland
Multinational
United States of America
United States of America
Llama 2-7B
9660000000000B
"64 A100 GPUs for 14 days, or a total of 21,500 A100-hours" 21500 * 3600 * 3.12e14 * 0.4 = 9.66e21
64
Open source
"OpenVLA uses multiple pretrained model components: SigLIP [9] and DinoV2 [25] vision encoders and a Llama 2 [10] language model backbone. For all three models, weights are open, but not their training data or code. We release training data, code and model weights for reproducing OpenVLA on top of these components." All published material is on an MIT license. train code: https://github.com/openvla/openvla/blob/main/scripts/pretrain.py
Academia
Academia
Industry
Industry
Academia
Industry
BlueLM 7B
Language
Chat
Translation
Language modeling/generation
Question answering
Code generation
vivo AI lab
2023-10-31
BlueLM: An Open Multilingual 7B Language Model
https://github.com/vivo-ai-lab/BlueLM/blob/main/BlueLM_technical_report.pdf
7000000000.00
"BlueLM is a large-scale pre-trained language model independently developed by vivo AI Global Research Institute. This release includes 7B base (base) model and 7B conversation (chat) model. At the same time, we have open sourced the long text base (base) model that supports 32K and conversation (chat) model." from GitHub https://github.com/vivo-ai-lab/BlueLM
1.0920000000001e+23
C = 6DN = 6 * 2.6T * 7B = 1.092*10^23 FLOP https://www.wolframalpha.com/input?i=6*7+billion+*+2.6+trillion (assuming 1 epoch) Figure 1 gives compute of 10^12 FLOPs but this seems improbable Training over 2.59T tokens took approximately 26 days using the vivolm system, with a throughput of 3150 tokens/sec/GPU.
Unspecified unreleased
2592000000000
"Larger amounts of high-quality data : high-quality corpus for training, reaching a scale of 2.6 trillion tokens. The corpus includes Chinese, English and a small amount of Japanese and Korean data" from GitHub see 2.1 https://github.com/vivo-ai-lab/BlueLM/blob/main/BlueLM_technical_report.pdf
Confident
BlueLM is a large-scale open-source language model independently developed by the vivo AI Lab. This release includes 2K and 32K context length versions for both Base and Chat models. High-quality Data: BlueLM is trained on a high-quality data with 2.6 trillion tokens. Our train corpus mainly consists of Chinese and English data, with a small amount of Japanese and Korean data. Stronger Performance: BlueLM-7B-Chat achieves a strong competitive performance in C-Eval and CMMLU benchmarks of the sa
Open weights (restricted use)
China
Unreleased
https://github.com/vivo-ai-lab/BlueLM/blob/main/MODEL_LICENSE_EN.pdf Our code is licensed under the Apache-2.0 and Community License for BlueLM Model. The BlueLM weights are completely open for academic research, and free commercial use is allowed after completing the questionnaire. "BlueLM weights are open for academic research and commercial use."
Industry
Baichuan 2-7B
Baichuan
Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, Fan Yang, Fei Deng, Feng Wang, Feng Liu, Guangwei Ai, Guosheng Dong, Haizhou Zhao, Hang Xu, Haoze Sun, Hongda Zhang, Hui Liu, Jiaming Ji, Jian Xie, Juntao Dai, Kun Fang, Lei Su, Liang Song, Lifeng Liu, Liyun Ru, Luyao Ma, Mang Wang, Mickel Liu, MingAn Lin, Nuolan Nie, Peidong Guo, Ruiyang Sun, Tao Zhang, Tianpeng Li, Tianyu Li, Wei Cheng, Weipeng Chen, Xiangrong Zeng, Xiaochuan Wang, Xiaoxi Chen
2023-09-20
Baichuan 2: Open Large-scale Language Models
https://arxiv.org/pdf/2309.10305
405.00
7000000000.00
1.0919999999999998e+23
7b * 2.6t * 6 = 1.092e23 Also mentions 1,024 NVIDIA A800 GPUs at 180 TFLOPS per GPU
2600000000000
NVIDIA A800
Unverified
In this technical report, we present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens.
China
1024
Industry
ruGPT-3.5 13B
Language
Chat
Language modeling/generation
Sber
2023-04-24
ruGPT-3.5 13B
https://huggingface.co/ai-forever/ruGPT-3.5-13B
13000000000.00
1.0699776e+23
"Model was trained using Deepspeed and Megatron libraries, on 300B tokens dataset for 3 epochs, around 45 days on 512 V100. After that model was finetuned 1 epoch with sequence length 2048 around 20 days on 200 GPU A100 on additional data" 512 GPUs * 125000000000000 FLOPs/s [peak] * 45 days * 24 hours * 3600 s * 0.3 + 200 GPUs * 312000000000000 FLOPs/s [peak for fp16] * 20 days * 24 hours * 3600 s * 0.3 = 1.0699776e+23 they probably used fp16 as in their similar project: https://habr.com/ru/co
300000000000
NVIDIA A100
NVIDIA Tesla V100 SXM2
Confident
Open weights (unrestricted)
Russia
512
Unreleased
MIT license https://huggingface.co/ai-forever/ruGPT-3.5-13B/discussions
Industry
Government
OLMo-7B
Language
Language modeling/generation
Chat
Allen Institute for AI
University of Washington
Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Emma Strubell, N
2024-02-01
OLMo: Accelerating the Science of Language Models
https://arxiv.org/abs/2402.00838v1
7000000000.00
1.0332e+23
direct calculation: 6*7B*2.46trillion=1.0332 × 10^23 (calculation also reporoduced by the developers in https://arxiv.org/pdf/2501.00656)
Dolma
2000000000000
"We built our training dataset out of a 2T-token sample from our open dataset, Dolma [...] All of our released models have been trained to at least 2T tokens (a single epoch over our training data), and some have been trained beyond that by starting a second epoch over the data with a different shuffling order" Table 1 indicates total tokens seen are 2.46T for the 7B parameter model, though note that a later release in July 2024 has been trained to 2.75T tokens: https://github.com/allenai/OLMo
AMD Radeon Instinct MI250X
NVIDIA A100
Confident
Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community
Open weights (unrestricted)
United States of America
United States of America
Open source
License: The code and model are released under Apache 2.0. Weights https://huggingface.co/allenai/OLMo-7B Code https://github.com/allenai/OLMo
Research collective
Academia
DeepSeekMath 7B
Language
Quantitative reasoning
DeepSeek
Tsinghua University
Peking University
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo
2024-02-05
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
https://arxiv.org/abs/2402.03300
7000000000.00
"Our model is initialized with DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024) and trained for 500B tokens."
1.014e+23
8.04e+22 (base model) + 2.1e+22 (fine-tune) = 1.014e+23
DeepSeekMath Corpus
arXiv
GitHub
Common Crawl
"By implementing a meticulously designed data selection pipeline, we successfully construct the DeepSeekMath Corpus, a high-quality dataset of 120B tokens from web pages filtered for mathematical content, which is almost 7 times the size of the math web pages used by Minerva (Lewkowycz et al., 2022a) and 9 times the size of the recently released OpenWebMath (Paster et al., 2023)." "The distribution of the data is as follows: 56% is from the DeepSeekMath Corpus, 4% from AlgebraicStack, 10% from
500000000000
"Our model is initialized with DeepSeek-Coder-Base-v1.5 7B (Guo et al., 2024) and trained for 500B tokens."
Confident
Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the perfor
Open weights (restricted use)
China
China
China
DeepSeek Coder 6.7B
21000000000000B
6*7B*500B = 2.1e+22
Unreleased
deepseek license https://huggingface.co/deepseek-ai/deepseek-math-7b-base
Industry
Academia
Academia
Qwen-7B
Language
Language modeling/generation
Translation
Alibaba
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zha
2023-09-28
Qwen Technical Report
https://arxiv.org/abs/2309.16609, https://huggingface.co/Qwen/Qwen-7B
169.00
7000000000.00
7B
1.0099999999999998e+23
2.4T tokens per Table 1 7b*2.4T*6 = 1.01e23
Unspecified unreleased
"Our dataset is designed to meet these requirements and includes public web documents, encyclopedia, books, codes, etc. Additionally, our dataset is multilingual, with a significant portion of the data being in English and Chinese."
2400000000000
"We have pretrained the language models, namely QWEN, on massive datasets containing trillions of tokens" Table 1 indicates 2.4T tokens for Qwen-7B, and the above quote suggests the 2.4T aren't from multiple epochs on a smaller dataset.
Confident
Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Qwen, the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignm
Open weights (restricted use)
China
Unreleased
commercial allowed, can't use to train models https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT
Industry
Evo 2 7B
Biology
Protein or nucleotide language model (pLM/nLM)
Arc Institute
Stanford University
NVIDIA
Liquid
University of California (UC) Berkeley
Goodfire
Columbia University
University of California San Francisco
Garyk Brixi, Matthew G. Durrant, Jerome Ku, Michael Poli, Greg Brockman, Daniel Chang, Gabriel A. Gonzalez, Samuel H. King, David B. Li, Aditi T. Merchant, Mohsen Naghipourfar, Eric Nguyen, Chiara Ricci-Tam, David W. Romero, Gwanggyu Sun, Ali Taghibakshi, Anton Vorontsov, Brandon Yang, Myra Deng, Liv Gorton, Nam Nguyen, Nicholas K. Wang, Etowah Adams, Stephen A. Baccus, Steven Dillmann, Stefano Ermon, Daniel Guo, Rajesh Ilango, Ken Janik, Amy X. Lu, Reshma Mehta, Mohammad R.K. Mofrad, Madelena Y
2025-02-19
Genome modeling and design across all domains of life with Evo 2
https://arcinstitute.org/manuscripts/Evo2
7000000000.00
1.008e+23
7e9 parameters *2.4e12 training datapoints*6=1.008e23
OpenGenome 2
2400000000000
"We trained two versions of Evo 2: a smaller version at 7B parameters trained on 2.4 trillion tokens and a full version at 40B parameters trained on 9.3 trillion tokens."
Unverified
All of life encodes information with DNA. While tools for sequencing, synthesis, and editing of genomic code have transformed biological research, intelligently composing new biological systems would also require a deep understanding of the immense complexity encoded by genomes. We introduce Evo 2, a biological foundation model trained on 9.3 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life. We train Evo 2 with 7B and 40B parameters to have an unprecedente
Open weights (unrestricted)
United States of America
United States of America
United States of America
United States of America
United States of America
United States of America
Academia
Industry
Industry
Academia
Academia
Academia
Luminous-extended
Language
Language modeling/generation
Aleph Alpha
2022-08-15
https://docs.aleph-alpha.com/docs/Deprecated%20Luminous/Deprecated-Luminous/model-card/
30000000000.00
~30B (~42B with multi-modality)
1.0019457e+23
311840000000000*360000*3600*0.3 = 1.2124339e+23 6ND = 6*30*10^9*460000000000 = 8.28e+22 sqrt(8.28e+22*1.2124339e+23) = 1.0019457e+23
"""The Luminous family has been trained on a dataset compiled of sources in English, German, French, Spanish and Italian..."" more details in model card https://docs.aleph-alpha.com/docs/introduction/model-card/"
460000000000
~460B tokens 230000 iterations
NVIDIA A100 SXM4 40 GB
Confident
Aleph Alpha luminous-extended is the second largest model which is faster and cheaper than Luminous-supreme. the model can perform information extraction, language simplification and has multi-capable image description capability. You can try Aleph Alpha models with predefined examples for free. Go to at the Jumpstart page on their site and click through the examples on Classification and Labelling, Generation, Information Extraction, Translation and Conversion and Multimodal. Aleph Alpha are ba
API access
Germany
512
Unreleased
Industry
225 records

Alert

Lorem ipsum
Okay