In today’s urgent need for decarbonization, the growing use of AI and LLMs has raised a new challenge to the already behemoth task of reaching net zero. HuggingFace hosted a timely competition called the Frugal AI challenge, which challenged participants to develop the most environmentally efficient models, measured by emissions, across text, vision, and audio classification tasks. It is refreshing to see an influential company promoting frugality in AI and the interests of a wide range of participants. Participating in the competition taught me many new techniques and approaches to training more energy-efficient language models. This blog shares some of the analyses I did that I hope will help fellow ML practitioners integrate frugality.
What is frugality?
Being frugal is something I have personally tried to practice through buying renewed electronics and thrifting where possible. Being frugal is different from being cheap. Frugality means spending your resources mindfully, where you achieve the most value for your bucks. Frugality in AI means choosing the right model for the task, not the flashiest and largest because it’s available. Like in life, it’s the difference between choosing a thoughtful, home-cooked meal for your guests instead of splashing out to buy the most luxurious ingredients or treating them to beans on toast (unless it’s a brunch of course). Frugality is about understanding the situation and making intentional decisions to prevent waste.
Interestingly, as mentioned by the HuggingFace team during the competition, the term Frugal AI (l’intelligence artificielle frugale) is only found in French on Wikipedia. In fact, France’s standardization organization (AFNOR) have already responded to the EU AI Act’s requirement for standardized approaches to assessing AI energy by releasing AFNOR SPEC 2314. Frugal AI is a newcomer among recent terms promoting AI efficiency, such as green AI and sustainable AI, yet it differs from those two.

Green and Sustainable AI focus on efficiency, that is, maximizing units of result (e.g., accuracy, precision) for each unit of compute (i.e., cost or environmental impact). Frugality encourages using only enough compute to achieve the necessary results by either readjusting the success criteria to favour a less resource-intensive approach or using AI efficiently when the need has been demonstrated. Put differently, being frugal with AI means deploying AI only when necessary and doing so efficiently with resources.
This definition made me curious about the resource implications of available inference approaches other than AI (in the sense of decoder LLMs). The rest of this article presents the approach and results of an experiment that illustrate the resource and result implications of various methods for a text classification task.
Experimental Setup
The task was a multi-class text classification problem for which several approaches are available, from basic machine learning to large language models (LLMs). As I was curious about the energy implications of methods all the way from “traditional” data science approaches to the newest AI/LLM models, I included methods from the trusty TF-IDF vectorization and XGBoost to the latest LLM models using zero and few-shot classifications. I will briefly go into some details about each of the models, but plenty of excellent resources explain each in greater depth. All code for this experiment can be found on my GitHub.
The data
I worked with a dataset containing 6091 quotes on climate change by various organizations and public figures, put together by Desmog and FLICC, where each was labelled with the type of misinformation it contained (or didn’t). The labels come from the 7 top level labels of the CARDS climate misinformation taxonomy project by Coan, T.G. and colleagues, and it consisted of the following:
This is a rather unbalanced dataset, meaning we must evaluate the performance of the models carefully. That is why I have chosen to report the F-1 score in this instance instead of simply accuracy. The dataset was split using 80% for fine-tuning and 20% for validation.
Hardware Setup
I ran the experiments using Google Colab Pro, which allowed me to choose the same compute resources to ensure consistency across the models. The hardware used was the following:
The energy consumption was measured using the CodeCarbon package. It was simple to set up and use out of the box on Colab. I used the tracker object approach and tracked the portion of the code used for training and inference.
The inclusion of inference energy reflects the recent shift towards LLMs using more compute during the inference stage due to longer prompts and frequent requests, as mentioned by this Artfish.ai blog. In fact, training is often thought to consume the bulk of the energy at first glance due to its large numbers, but its occurrence is usually one-off or infrequent at best. On the other hand, inference is estimated to account for 80-90% of the energy in AI models because of its frequent occurrence. This means that understanding inference energy consumption will help guide model design choices that result in more significant energy, cost and carbon reductions in production.
The models
The above table summarizes the models I experimented with in this blog. It roughly represents the three most recent eras of NLP, progressing from the classical NLP and ML pipeline to transformer-based encoder and encoder-decoder models.
Fine-tuning Setup
The encoder models in the BERT family were loaded from HuggingFace and trained using a custom PyTorch class with a custom classifier head. For the encoder-decoder models, I used Transformer package’s AutoModelForSeq2SeqLM and the Trainer class to fine-tune the model. The hyperparameters for each run was kept constant to ensure comparability of results and all run in one go to prevent variability in the cloud resources.
But wait, what about LLMs?
If you are referring to the GPTs and LLAMAs, well, as a classification task with reasonably sized training data, the decoder architecture of most LLMs is just not as efficient. Yes, decoder-based LLMs are highly capable, even in a classification task using few/zero-shot, but their expensive architecture is hardly frugal. The autoregressive nature and longer input lengths in few-shot prompting increase the computational resource. On classification performance, encoder-only models were still observed to perform better than decoder-only models.
Efficiency Measures for Training
I experimented mostly with training energy reduction for this post, as it directly impacted me since I am paying for the computing on Colab and making few inferences. I firstly built a frugal non-neural baseline using classical NLP and ML pipeline of vectorizing using TF-IDF and building a classifier using XGBoost. Following which, all models were neural networks from a custom one to pre-trained encoder and encoder-decoder models.
I highlight the key results from techniques applied to the NN and transformer models, such as layer freezing, distillation, and using a custom architecture to reduce the training footprint.
Layer freezing means preventing the calculation of gradients on some tuneable layers so that fewer parameters require updating. Reducing gradient calculations reduces the fine-tuning energy consumption, so the more frozen layers, the less training energy. The tradeoff is that the model’s performance is significantly reduced as it has fewer learnable parameters.
The first layer of any encoder model is the embedding layer which embeds the token and its position. It is often frozen when fine-tuning with small datasets to prevent corrupting the pre-trained embedding space for certain tokens. Which transformer layers to freeze requires a more empirical approach that balances the tradeoff between energy savings and performance loss.
Model distillation is a process that involves transferring or “distilling” the learned parameters of a much larger model into a smaller and usually more efficient model. Distilled models tend to closely preserve the performance of the full models while significantly reducing the inference resources. While I didn’t distill any models myself and used pre-distilled models, it is a possible efficiency measure to implement to reduce training and inference energy.
How did the models stack up?
The above plot shows that the models tested span a wide range. The TFIDF-XGBoost model consumed 0.00056 kWh of energy during training, while DeBERTa-v3-base consumed 71x that, with 0.04 kWh. The increase in energy is coupled with better predictive performance, as F1 shoots up to beyond 0.65 for all the pre-trained encoder and encoder-decoder models.
Interestingly, the pre-trained transformers from BART to DeBERTa-v3-base, the F1 increases marginally from ~0.71 to ~0.73 while the energy nearly doubles. Only BERT-base-uncased defies this trend with a significantly lower F1 score. We can confirm this tradeoff behaviour using the scatterplot below.

The plateauing F1 performance becomes much clearer from the above scatterplot of training energy versus F1 score. There is clearly an efficiency frontier, which I am calling the frugal frontier, where F1 plateaus despite increasing energy performance.
Visualising this frontier can help with model selection, where tradeoffs between energy and predictive performance can be clearly visualised.
Layer freezing has rapid tradeoffs

Freezing tuneable layers showed an interesting tradeoff pattern between energy and F1 score performance. Freezing the first eight layers, we see a large reduction in energy (30-37%) with only a minimal drop in F1 (1.6-4.0%). Meanwhile, freezing two more layers at ten drastically decreases the model’s predictive performance (38-85%), with only marginal energy saving (15-21%). Too many frozen layers, especially the higher layers, as it has been shown to process semantic information, can cause the network to fail at learning from the data. This shows empirically that freezing up to 8 layers is an optimal balance between energy and predictive performance for smaller datasets, such as ours.
Training vs. Inference
If there aren’t already enough tradeoffs to consider, one more to consider is, arguably the most important, the tradeoff between training versus energy. As mentioned earlier, inference will make up the bulk of energy footprint in the life cycle of a model. So far, the analysis has focused predominantly on training energy as it was the focus of this blog but here we will look at training and inference tradeoffs.
The above plot does not show a clear relationship between training and inference energy. TF-IDF+ XGBoost remains close to zero compared to the neural networks. Models that are the most energy-intensive during training can have comparatively lower inference energy, such as the case between DeBERTa-v3 and FLAN-T5. Conversely, distilled models, such as DistilBERT and DistilRoBERTa, show significant savings in both training and inference energy compared to the full-sized models. Layer-freezing, on the other hand, is effective at reducing training energy use as previously shown, but it has practically no impact on the inference energy.
So which model?
The diminishing return on performance versus energy of larger models and diverging training and inference consumption brings us back to the goal of Frugal AI: which model is sufficient for our needs?
This is a rather multi-faceted decision with the unsatisfying, non-conclusive answer of — it depends. A series of guiding questions below can more concretely translate frugality into a practical decision-making process:
What is the performance floor?
What is the lowest acceptable level of F1/accuracy before the model is unusable?
Mission-critical predictions requiring high performance or human-in-the-loop processes where lesser performance is acceptable?
How often will the model retrain? What is the volume of inference?
Daily or weekly retraining with fresh and growing data, then efficient training should be prioritized
If retraining is infrequent, then inference efficiency should be prioritized
What is the source of energy?
If the source is renewable by nature, the associated CO₂e of the energy will be low
If electricity is purchased from a non-renewable grid and training is on-site, the model training and inference will likely contribute significantly to the organization’s Scope 2 emission.
One way to make these answers more empirical is to rate the models using a weighted approach. Based on your answer to the above questions, you can assign weights α (F1), β (train), γ (inference) for each of the relevant metrics to calculate a frugal ranking.
I have made a simple Streamlit tool where you can upload your model results, play around with weights and visualize the results.
Conclusion
Frugality in AI is not just about being efficient, but a mindset shift to building models that do enough with less. As this experiment and analysis have shown, being frugal is neither about chasing the most efficient model nor solely maximizing accuracy. Frugal AI is about balancing what’s good enough for your task while being mindful of resource constraints and aligning with frugal principles. So I guess in some ways, Frugal AI could be understood about not using AI at all! Hopefully, integrating frugality into your model design will not just be another optimization exercise but a new value that guides future model development.