Issue Brief: Measuring Training Compute

By:

Posted on:

Recent AI safety and governance proposals have leveraged total training compute as a proxy for the capability level of general purpose models. By tying specific safety measures to particular compute thresholds, such proposals aim to ensure both that AI safety mechanisms are proportionate to model capabilities and also that those mechanisms can be applied in a consistent and objective manner. While it would be preferable to move towards thresholds based more directly on models’ capabilities, training compute is for now being relied on as a useful proxy for capabilities until alternatives are matured, and could serve as one of several triggers for more direct evaluation of model capabilities.

For compute-based threshold proposals to achieve their aim, there need to be clear standards for how model developers can measure and report training compute. Developing principled standards for measuring training compute is essential for ensuring that safety measures are applied in a proportionate and appropriate way. 

Based on expert consensus, the Frontier Model Forum recommends the following principles for how model developers can measure training compute: 

  • All operations should be treated equally. Models are trained using floating point or integer operations with different levels of precision, such as FP16 or FP8. The primary driver of capabilities is the total number of operations, and we recommend that any numerical operation be counted the same regardless of precision, while also acknowledging that the underlying computational workload may differ based on precision. 
  • Approaches to calculating total compute should be context dependent. One approach to calculating total compute is to plug architectural details and hyperparameters into a formula. Another is to approximate total compute through the multiplication of total chip-days used and the peak operations of that chip. Although the first approach is more precise and often preferable, both are valid and have unique advantages and tradeoffs. Firms may initially utilize diverse methods (e.g., the latter, hardware-based approach) for initial assessments and to narrow the scope of models under consideration. Factors such as the results of preliminary methods and computational thresholds can inform whether the precision is warranted.
  • Recomputations should not be included. To reduce the amount of memory needed to train a model, developers will often recompute intermediate activations during backpropagation. This process, known as gradient checkpointing, essentially trades less memory for greater computation–the added operations are a substitute for information that would otherwise be calculated once and then stored in memory. Since the operations are intended only to save memory and do not increase the capability of the resulting model, they should not be counted toward training compute. 
  • Discarded versions or branches should not be included. Model developers will often experiment with different branches and versions of a given model that are ultimately discarded. Since information from such branches is not explicitly included in the final model, the operations used to train discarded branches should not be included in measures of training compute. 
  • Models should be defined only by systems that have been trained (at least partially) end-to-end. Some AI models are produced by taking multiple independently-trained AI models, combining them, and then further training the result to integrate them – and these should be considered an individual model, with all the resulting computation. Other composite AI systems may operate by sampling from a variety of underlying models that have been trained separately but have never been jointly-trained (such as a chat model and a separate safety-filtering model). Though these are part of a system, they should not be considered part of the same “model” and only the compute of each individual model should be reported (if necessary). 
  • Any approximations that cumulatively change the total compute used by <5% should be valid. Depending on the model, these may include approximations such as ignoring embedding, bias addition, layer normalization, addition of residual connections, and gating networks of MoE models.
  • If multiple variants of the same foundation model each exceed a compute threshold, only the best of these variants needs to be reported in a fixed period (e.g. every 6 months). AI developers will often create numerous post-trained modes based on a single pretrained foundation model, and these variants may possess qualitatively different capabilities. Comprehensively assessing and reporting on every variant would be extremely expensive and provide limited additional information. AI developers should instead be required to report only on the best variant derived from each foundation model in a 6-month period, where “best” may be according to public or private benchmarks, the holistic evaluation of the developer, or other criteria specified. This should not preclude additional reporting on specific variants at the discretion of the AI developer, especially if they are assessed to be of interest for other reasons (such as being widely deployed). 

If standards for measuring training compute are based on the principles and criteria above, the safety measures and governance mechanisms they inform are far more likely to be applied in an effective and proportionate way.