Efficient Deep Learning: A Practical Guide - Part 2
Sparsification
Why Sparsification ?
In the previous article, we explored Knowledge Distillation, where a smaller student model learns from a larger teacher. In this article, we take a fundamentally different approach: instead of training a new, smaller model, we remove unnecessary parts from an existing model, making it sparse.
Neural networks are typically dense: every neuron in one layer is connected to every neuron in the next. However, research has consistently shown that not all of these connections are equally important. The weight distribution of a trained network tells the story: most weights cluster near zero and contribute very little to the model's output.
Sparsification is the process of setting unimportant weights to zero. The model architecture and tensor shapes remain unchanged, but the weight matrices become sparse, containing a large proportion of exact zeros.
One of the most influential ideas behind this approach comes from Frankle & Carbin (2019), who proposed the Lottery Ticket Hypothesis:
A randomly initialized dense network contains a subnetwork (a "winning ticket") that, when trained in isolation, can match the performance of the full network.
Within any large neural network, there exists a much smaller network that is just as capable; we just need to find it. Sparsification is essentially the search for this winning ticket:
The implications are profound: overparameterization helps training (more possible winning tickets to find), but most parameters are redundant at inference time, and sparse networks can be just as accurate as their dense counterparts.
In this series, we distinguish between two levels of compression: sparsification, which zeros out weights while keeping the model structure intact, and pruning, which physically removes entire components (filters, channels) to produce a smaller model. While the literature often uses "pruning" for both, we find this distinction useful. This post focuses on sparsification; pruning will be covered separately.
In FasterAI, the sparsification process is designed around 4 orthogonal design parameters, each answering a fundamental question:
| Parameter | Question | Examples |
|---|---|---|
| Granularity | How to sparsify? | Weight, filter, channel, kernel... |
| Context | Where to sparsify? | Local (per-layer) vs. Global (whole network) |
| Criteria | What to sparsify? | Magnitude, movement, random... |
| Schedule | When to sparsify? | One-shot, linear, cosine, AGP... |
These four dimensions are independent: you can combine any granularity with any context, any criteria, and any schedule. This combinatorial flexibility is what makes FasterAI's approach so powerful. We cover the main options below, but FasterAI provides many more built-in choices for each parameter and makes it easy to define your own.
1. Granularity — How to Sparsify
The granularity determines the structural level at which weights are removed. This is perhaps the most impactful design choice, as it directly affects whether you get real-world speedups.
Granularity is best understood as a spectrum from fine-grained to coarse-grained:
$$ \text{weight} \longrightarrow \text{row/column} \longrightarrow \text{kernel} \longrightarrow \text{channel/filter} \longrightarrow \text{layer} $$
Regardless of the granularity, sparsification applies a binary mask $M$ to the weight tensor $W$:
$$ W_{\text{sparse}} = W \odot M $$
What changes across granularities is the structure of the mask. Each granularity defines a grouping of the weight indices: all indices within the same group share the same mask value (0 or 1). For a Conv2d tensor $W \in \mathbb{R}^{C_{\text{out}} \times C_{\text{in}} \times K_H \times K_W}$:
- At weight granularity, each index $(i,j,k,l)$ is its own group, so every entry of $M$ can be set independently.
- At filter granularity, all indices sharing the same $i$ form a group: $M_{i,:,:,:}$ is either all zeros or all ones.
The finer the grouping, the more flexibility in choosing what to zero out. The coarser the grouping, the more structure the zeros have, which can later be exploited for pruning. FasterAI supports the following granularities:
| Granularity | What is removed | Effect |
|---|---|---|
weight |
Individual weights | Sparse tensors, maximum flexibility |
row / column |
Rows or columns of weights | Partial zero patterns |
kernel |
Spatial kernel patterns | Zeroed input-output connections |
channel |
Entire input channels | Zeroed input channels (prunable) |
filter |
Entire output filters | Zeroed output filters (prunable) |
Moving along the spectrum: finer granularities offer more flexibility in choosing what to zero out, and typically better accuracy retention at the same sparsity level. Coarser granularities are more constrained, but the zeroed-out components can later be pruned (physically removed), producing a smaller dense model that runs faster on any hardware.
2. Context — Where to Sparsify
The context determines the scope over which importance scores are compared to decide which weights to zero out. The sparsification mechanism is always the same: compute importance scores, rank them, and zero out the lowest-scoring ones. What changes is the scope of that ranking.
| Context | Scope of ranking | Effect |
|---|---|---|
local |
Per layer | Every layer is ranked independently and gets exactly $s_{\text{target}}$ sparsity |
global |
Entire network | All weights are ranked together; per-layer sparsity emerges from the global ranking |
With local context, every layer loses the same fraction of weights. This is simple and predictable, but treats all layers as equally compressible, which is rarely true.
With global context, layers with many low-importance weights naturally end up sparser, while layers with more important weights are preserved. In practice, this often keeps more weights in the early layers (which learn fundamental features like edges and textures) and sparsifies more aggressively in later, more redundant layers. This adaptive behavior typically leads to better accuracy at the same overall compression ratio.
3. Criteria — What to Sparsify
The criteria determines how we score the importance of each weight (or group of weights). This is the core question: which weights matter?
In FasterAI, a criterion is a scoring function $f$ that assigns an importance score to each weight (or group of weights):
$$ \text{importance}(w) = f(w, w_0, \nabla w, \ldots) $$
The sparsification pipeline always works the same way: compute scores, rank them, zero out the lowest. Only $f$ changes, which means implementing a new criterion is as simple as defining a new function. FasterAI provides several built-in ones:
| Criterion | Scoring function $f$ | Intuition |
|---|---|---|
large_final |
$f(w) = |w|$ | Keep the largest weights at the end of training |
large_init |
$f(w_0) = |w_0|$ | Keep the largest weights at initialization |
movement |
$f(w, w_0) = |w - w_0|$ | Keep weights that moved the most from initialization (Zhou et al., 2019) |
random |
$f = \text{rand}()$ | Baseline to validate your criterion against |
Movement is particularly effective when fine-tuning pre-trained models, where the initial magnitudes may not reflect importance for the new task: a weight that changed significantly during training is likely important, while one that barely moved can be safely removed.
4. Schedule — When to Sparsify
The schedule controls how sparsity evolves over the course of training. Like a learning rate schedule maps training steps to learning rates, a sparsity schedule is a function $s$ that maps a training step $t$ to a sparsity level:
$$ s: [0, T] \longrightarrow [0, s_{\text{target}}] $$
At each step, the framework queries $s(t)$ to determine the current target sparsity. The mechanism is always the same; what changes is the shape of $s$. Implementing a new schedule is as simple as defining a new function $s(t)$.
| Schedule | Function $s(t)$ | Intuition |
|---|---|---|
one_shot |
$s(t) = s_{\text{target}}$ at $t=0$ | All at once, simple but sharp accuracy drop |
lin |
$s(t) = s_{\text{target}} \cdot \frac{t}{T}$ | Linear ramp, general-purpose default |
cos |
$s(t) = \frac{s_{\text{target}}}{2} \left(1 - \cos\left(\pi \cdot \frac{t}{T}\right)\right)$ | Smooth cosine annealing |
agp |
$s(t) = s_{\text{target}} \left(1 - \left(1 - \frac{t}{T}\right)^3\right)$ | Cubic: aggressive early, careful later (Zhu & Gupta, 2017) |
one_cycle |
$s(t) = \frac{1+e{-\alpha+\beta}}{1+e{-\alpha t/T+\beta}} \cdot s_{\text{target}}$ | Logistic curve, smooth acceleration then plateau |
5. Putting It All Together
The beauty of FasterAI's design is that these four parameters are fully composable. Here's how to combine them in practice:
from fastai.vision.all import *
from fasterai.sparse.all import *
path = untar_data(URLs.PETS)
files = get_image_files(path/"images")
def label_func(f): return f[0].isupper()
dls = ImageDataLoaders.from_name_func(path, files, label_func, item_tfms=Resize(64))
learn = vision_learner(dls, resnet18, metrics=accuracy)
learn.unfreeze()
sp_cb = SparsifyCallback(
compression_ratio=0.5, # How much to zero out
granularity='filter', # How: zero out entire filters
context='global', # Where: rank globally across all layers
criteria=large_final, # What: keep largest weights
schedule=cos, # When: cosine schedule
)
learn.fit_one_cycle(5, 1e-3, cbs=sp_cb)
Each parameter maps directly to a design choice:
compression_ratio: How sparse the final model should begranularity: What structural level to sparsify ('weight','filter','kernel', etc.)context:'global'(adaptive per-layer) or'local'(uniform)criteria: Which importance function to useschedule: How sparsity evolves during training
Want to try movement with an AGP schedule on a local context? Just swap the parameters:
sp_cb = SparsifyCallback(
compression_ratio=0.7,
granularity='weight',
context='local',
criteria=movement,
schedule=agp,
)
Conclusion
Sparsification is a powerful and flexible approach to model compression. By decomposing the problem into four independent design choices (granularity, context, criteria, and schedule), FasterAI provides a systematic framework for exploring the vast space of possible sparsification strategies.
These four building blocks are not specific to sparsification: FasterAI reuses the same criteria, granularity, and schedule abstractions across other techniques like regularization, making them a shared language for model compression.
Key takeaways:
- Granularity ranges from individual weights to entire filters, trading flexibility for hardware efficiency (coarser granularities enable pruning)
- Context can be local (uniform) or global (adaptive), with global typically yielding better results
- Criteria determine importance: magnitude is simple and effective, movement works better for fine-tuning
- Schedule controls the sparsification dynamics. Gradual approaches generally outperform one-shot
The combinatorial nature of these four parameters means there's always room to find a better configuration for your specific model and use case. FasterAI makes this exploration easy with just a few lines of code.
In the next blog post, we will explore Quantization, another fundamental compression technique that reduces the numerical precision of weights to shrink models and accelerate inference.
Join us on Discord to stay tuned!