Why Sparsification ?

In the previous article, we explored Knowledge Distillation, where a smaller student model learns from a larger teacher. In this article, we take a fundamentally different approach: instead of training a new, smaller model, we remove unnecessary parts from an existing model, making it sparse.

Neural networks are typically dense: every neuron in one layer is connected to every neuron in the next. However, research has consistently shown that not all of these connections are equally important. The weight distribution of a trained network tells the story: most weights cluster near zero and contribute very little to the model's output.

Sparsification is the process of setting unimportant weights to zero. The model architecture and tensor shapes remain unchanged, but the weight matrices become sparse, containing a large proportion of exact zeros.

One of the most influential ideas behind this approach comes from Frankle & Carbin (2019), who proposed the Lottery Ticket Hypothesis:

A randomly initialized dense network contains a subnetwork (a "winning ticket") that, when trained in isolation, can match the performance of the full network.

Within any large neural network, there exists a much smaller network that is just as capable; we just need to find it. Sparsification is essentially the search for this winning ticket:

The implications are profound: overparameterization helps training (more possible winning tickets to find), but most parameters are redundant at inference time, and sparse networks can be just as accurate as their dense counterparts.

In this series, we distinguish between two levels of compression: sparsification, which zeros out weights while keeping the model structure intact, and pruning, which physically removes entire components (filters, channels) to produce a smaller model. While the literature often uses "pruning" for both, we find this distinction useful. This post focuses on sparsification; pruning will be covered separately.

In FasterAI, the sparsification process is designed around 4 orthogonal design parameters, each answering a fundamental question:

Parameter	Question	Examples
Granularity	How to sparsify?	Weight, filter, channel, kernel...
Context	Where to sparsify?	Local (per-layer) vs. Global (whole network)
Criteria	What to sparsify?	Magnitude, movement, random...
Schedule	When to sparsify?	One-shot, linear, cosine, AGP...

These four dimensions are independent: you can combine any granularity with any context, any criteria, and any schedule. This combinatorial flexibility is what makes FasterAI's approach so powerful. We cover the main options below, but FasterAI provides many more built-in choices for each parameter and makes it easy to define your own.

1. Granularity — How to Sparsify

The granularity determines the structural level at which weights are removed. This is perhaps the most impactful design choice, as it directly affects whether you get real-world speedups.

Granularity is best understood as a spectrum from fine-grained to coarse-grained:

$$ \text{weight} \longrightarrow \text{row/column} \longrightarrow \text{kernel} \longrightarrow \text{channel/filter} \longrightarrow \text{layer} $$

Regardless of the granularity, sparsification applies a binary mask $M$ to the weight tensor $W$:

$$ W_{\text{sparse}} = W \odot M $$

What changes across granularities is the structure of the mask. Each granularity defines a grouping of the weight indices: all indices within the same group share the same mask value (0 or 1). For a Conv2d tensor $W \in \mathbb{R}^{C_{\text{out}} \times C_{\text{in}} \times K_H \times K_W}$:

At weight granularity, each index $(i,j,k,l)$ is its own group, so every entry of $M$ can be set independently.
At filter granularity, all indices sharing the same $i$ form a group: $M_{i,:,:,:}$ is either all zeros or all ones.

The finer the grouping, the more flexibility in choosing what to zero out. The coarser the grouping, the more structure the zeros have, which can later be exploited for pruning. FasterAI supports the following granularities:

Granularity	What is removed	Effect
`weight`	Individual weights	Sparse tensors, maximum flexibility
`row` / `column`	Rows or columns of weights	Partial zero patterns
`kernel`	Spatial kernel patterns	Zeroed input-output connections
`channel`	Entire input channels	Zeroed input channels (prunable)
`filter`	Entire output filters	Zeroed output filters (prunable)

Moving along the spectrum: finer granularities offer more flexibility in choosing what to zero out, and typically better accuracy retention at the same sparsity level. Coarser granularities are more constrained, but the zeroed-out components can later be pruned (physically removed), producing a smaller dense model that runs faster on any hardware.

2. Context — Where to Sparsify

The context determines the scope over which importance scores are compared to decide which weights to zero out. The sparsification mechanism is always the same: compute importance scores, rank them, and zero out the lowest-scoring ones. What changes is the scope of that ranking.

Context	Scope of ranking	Effect
`local`	Per layer	Every layer is ranked independently and gets exactly $s_{\text{target}}$ sparsity
`global`	Entire network	All weights are ranked together; per-layer sparsity emerges from the global ranking

With local context, every layer loses the same fraction of weights. This is simple and predictable, but treats all layers as equally compressible, which is rarely true.

With global context, layers with many low-importance weights naturally end up sparser, while layers with more important weights are preserved. In practice, this often keeps more weights in the early layers (which learn fundamental features like edges and textures) and sparsifies more aggressively in later, more redundant layers. This adaptive behavior typically leads to better accuracy at the same overall compression ratio.

3. Criteria — What to Sparsify

The criteria determines how we score the importance of each weight (or group of weights). This is the core question: which weights matter?

In FasterAI, a criterion is a scoring function $f$ that assigns an importance score to each weight (or group of weights):

$$ \text{importance}(w) = f(w, w_0, \nabla w, \ldots) $$

The sparsification pipeline always works the same way: compute scores, rank them, zero out the lowest. Only $f$ changes, which means implementing a new criterion is as simple as defining a new function. FasterAI provides several built-in ones:

Criterion	Scoring function $f$	Intuition
`large_final`	$f(w) = \|w\|$	Keep the largest weights at the end of training
`large_init`	$f(w_0) = \|w_0\|$	Keep the largest weights at initialization
`movement`	$f(w, w_0) = \|w - w_0\|$	Keep weights that moved the most from initialization (Zhou et al., 2019)
`random`	$f = \text{rand}()$	Baseline to validate your criterion against

Movement is particularly effective when fine-tuning pre-trained models, where the initial magnitudes may not reflect importance for the new task: a weight that changed significantly during training is likely important, while one that barely moved can be safely removed.

4. Schedule — When to Sparsify

The schedule controls how sparsity evolves over the course of training. Like a learning rate schedule maps training steps to learning rates, a sparsity schedule is a function $s$ that maps a training step $t$ to a sparsity level:

$$ s: [0, T] \longrightarrow [0, s_{\text{target}}] $$

At each step, the framework queries $s(t)$ to determine the current target sparsity. The mechanism is always the same; what changes is the shape of $s$. Implementing a new schedule is as simple as defining a new function $s(t)$.

Schedule	Function $s(t)$	Intuition
`one_shot`	$s(t) = s_{\text{target}}$ at $t=0$	All at once, simple but sharp accuracy drop
`lin`	$s(t) = s_{\text{target}} \cdot \frac{t}{T}$	Linear ramp, general-purpose default
`cos`	$s(t) = \frac{s_{\text{target}}}{2} \left(1 - \cos\left(\pi \cdot \frac{t}{T}\right)\right)$	Smooth cosine annealing
`agp`	$s(t) = s_{\text{target}} \left(1 - \left(1 - \frac{t}{T}\right)^3\right)$	Cubic: aggressive early, careful later (Zhu & Gupta, 2017)
`one_cycle`	$s(t) = \frac{1+e^{{-\alpha+\beta}}{1+e}{-\alpha t/T+\beta}} \cdot s_{\text{target}}$	Logistic curve, smooth acceleration then plateau

5. Putting It All Together

The beauty of FasterAI's design is that these four parameters are fully composable. Here's how to combine them in practice:

from fastai.vision.all import *
from fasterai.sparse.all import *

path = untar_data(URLs.PETS)
files = get_image_files(path/"images")

def label_func(f): return f[0].isupper()

dls = ImageDataLoaders.from_name_func(path, files, label_func, item_tfms=Resize(64))

learn = vision_learner(dls, resnet18, metrics=accuracy)
learn.unfreeze()

sp_cb = SparsifyCallback(
    compression_ratio=0.5,    # How much to zero out
    granularity='filter',      # How: zero out entire filters
    context='global',          # Where: rank globally across all layers
    criteria=large_final,      # What: keep largest weights
    schedule=cos,              # When: cosine schedule
)

learn.fit_one_cycle(5, 1e-3, cbs=sp_cb)

Each parameter maps directly to a design choice:

compression_ratio: How sparse the final model should be
granularity: What structural level to sparsify ('weight', 'filter', 'kernel', etc.)
context: 'global' (adaptive per-layer) or 'local' (uniform)
criteria: Which importance function to use
schedule: How sparsity evolves during training

Want to try movement with an AGP schedule on a local context? Just swap the parameters:

sp_cb = SparsifyCallback(
    compression_ratio=0.7,
    granularity='weight',
    context='local',
    criteria=movement,
    schedule=agp,
)

Conclusion

Sparsification is a powerful and flexible approach to model compression. By decomposing the problem into four independent design choices (granularity, context, criteria, and schedule), FasterAI provides a systematic framework for exploring the vast space of possible sparsification strategies.

These four building blocks are not specific to sparsification: FasterAI reuses the same criteria, granularity, and schedule abstractions across other techniques like regularization, making them a shared language for model compression.

Key takeaways:

Granularity ranges from individual weights to entire filters, trading flexibility for hardware efficiency (coarser granularities enable pruning)
Context can be local (uniform) or global (adaptive), with global typically yielding better results
Criteria determine importance: magnitude is simple and effective, movement works better for fine-tuning
Schedule controls the sparsification dynamics. Gradual approaches generally outperform one-shot

The combinatorial nature of these four parameters means there's always room to find a better configuration for your specific model and use case. FasterAI makes this exploration easy with just a few lines of code.

In the next blog post, we will explore Pruning, which goes one step further than sparsification: instead of zeroing out weights, it physically removes entire filters and channels to produce a genuinely smaller model that runs faster on any hardware.

Join us on Discord to stay tuned!

Efficient Deep Learning: A Practical Guide - Part 2

Why Sparsification ?

1. Granularity — How to Sparsify

2. Context — Where to Sparsify

3. Criteria — What to Sparsify

4. Schedule — When to Sparsify

5. Putting It All Together

Conclusion

Related Posts

Efficient Deep Learning: A Practical Guide - Part 1

Efficient Deep Learning: A Practical Guide - Part 0

smaller. faster. open.

Company

Resources

Legal