Depth vs. Width in Neural Networks under a Fixed Parameter Budget

April 2, 2026 · 8 min read

Abstract

This work investigates how neural network architecture—specifically depth versus width—affects learning behavior under an approximately matched parameter budget with non-trivial variation (~14%). Using multilayer perceptrons (MLPs) constrained to approximately 100,000 trainable parameters, we evaluate performance on the MNIST classification task across five architectural regimes ranging from shallow-wide to deep-narrow. Results show that final test accuracy remains tightly bounded across architectures, while differences emerge in training efficiency, convergence dynamics, and generalization gap. These findings suggest that, in low-complexity settings, parameter allocation influences training behavior more than final representational capacity.

1. Introduction

Model capacity is often treated as a primary determinant of neural network performance, with parameter count serving as a common proxy. However, parameter count alone does not uniquely define a model; architectural structure—specifically the distribution of parameters across layers—introduces inductive biases that influence optimization and generalization.

This study examines how architectural variation affects model behavior when total parameter count is held approximately constant, acknowledging a non-trivial variation (~14%) due to discrete layer dimensions. We restrict attention to fully connected networks and evaluate performance on MNIST to isolate architectural effects under controlled conditions.

2. Experimental Setup

2.1 Dataset

Experiments were conducted on the MNIST dataset, consisting of 60,000 training samples and 10,000 test samples of 28×28 grayscale images. Inputs were flattened into 784-dimensional vectors and normalized using the dataset mean (0.1307) and standard deviation (0.3081). No data augmentation was applied.

2.2 Training Configuration

All models were trained under identical conditions:

Optimizer: Adam
Learning rate: 1e-3
Batch size: 64
Number of epochs: 20
Loss function: Cross-entropy
Regularization: None (no dropout or weight decay)
Initialization: PyTorch default for nn.Linear (Kaiming uniform)
Evaluation: Averaged over three seeds

Weights and biases for each linear layer were initialized from a uniform distribution:

\[ \mathcal{U}\left(-\sqrt{\frac{6}{\text{fan\_in}}}, \sqrt{\frac{6}{\text{fan\_in}}}\right) \]

No learning rate scheduling or gradient clipping was applied.

2.3 Parameter Constraint

Each model was constrained to approximately 100,000 trainable parameters. Counts were computed programmatically by summing the number of elements across all parameters requiring gradients:

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

For a single fully connected layer with \(N\) input units and \(M\) output units, the parameter count is:

\[ P = (N \times M) + M \]

where \(N \times M\) corresponds to the weight matrix and \(M\) corresponds to the bias vector.

Example: For a two-layer MLP with architecture [784, 127, 10]:

Layer 1 (784 → 127): \((784 \times 127) + 127 = 99,695\)
Layer 2 (127 → 10): \((127 \times 10) + 10 = 1,280\)
Total: \(99,695 + 1,280 = 100,975\) parameters.

Final parameter counts across all architectures:

Ultra Wide: 100,975
Wide: 96,645
Normal: 95,465
Deep: 89,895
Ultra Deep: 86,780

This represents a spread of approximately 14% between the largest and smallest models. Consequently, wider models systematically contain more parameters than deeper models, a confounding factor acknowledged in the analysis.

3. Model Architectures

Five MLP architectures were evaluated (hidden layer sizes shown; input=784, output=10 fixed):

Ultra Wide: [127] (1 hidden layer)
Wide: [110, 85] (2 hidden layers)
Normal: [105, 80, 50] (3 hidden layers)
Deep: [95, 75, 60, 50] (4 hidden layers)
Ultra Deep: [90, 70, 60, 50, 40] (5 hidden layers)

These configurations span a range of parameter distributions from width-dominant to depth-dominant structures.

4. Results

Table 1. Performance comparison across architectures (averaged over three seeds).

Model	Test Acc (%)	Std Dev	Train Acc (%)	Gap (%)	Epochs to 90%
Ultra Wide	97.867	0.076	99.704	1.838	1
Wide	97.550	0.182	99.534	1.984	1
Normal	97.690	0.238	99.495	1.805	1
Deep	97.830	0.078	99.495	1.665	2
Ultra Deep	97.523	0.138	99.379	1.856	2

5. Analysis

5.1 Final Performance

All architectures achieved high test accuracy within a narrow range (97.52%–97.87%), with a total spread of approximately 0.35%. This indicates that MNIST is largely saturated for MLPs at this scale, and architectural differences have limited impact on final accuracy.

5.2 Optimization Dynamics

Shallow and wide architectures reached 90% training accuracy within one epoch, whereas deeper architectures required approximately two epochs. This suggests that width improves early training efficiency, potentially due to simpler gradient flow and reduced compositional depth.

5.3 Stability

Variance across seeds was low for all models, with standard deviations below 0.25%. The Ultra Wide and Deep configurations exhibited the lowest variance, indicating stable optimization across runs.

5.4 Generalization

The overfitting gap ranged from 1.665% to 1.984%. The Deep architecture showed the smallest gap, suggesting slightly improved generalization under moderate depth. However, the magnitude of this difference is small.

6. Discussion

The primary result is that architectural variation affects training dynamics more strongly than final performance on MNIST. Specifically:

Width-dominant models converged faster in early training.
Moderately deep models exhibited slightly smaller generalization gaps.
Extremely deep, narrow models did not provide clear benefits under this parameter regime.

These differences occurred within a narrow performance band (~0.35%), limiting the strength of architectural conclusions under this experimental setup. Given the observed parameter imbalance, results should be interpreted as an approximate comparison of architectural trends rather than a strictly controlled capacity study. In particular, wider models systematically contained higher parameter counts, which may partially contribute to their observed performance. Nevertheless, the results suggest that for low-complexity tasks, architectural choices primarily influence optimization behavior rather than representational limits.

7. Limitations

Several limitations constrain interpretation:

MNIST is a low-complexity dataset and does not strongly differentiate architectures.
Parameter counts were not exactly matched, with up to ~14% variation.
Wider models systematically held higher capacity than deeper models.
Only fully connected architectures were considered.
Results were averaged over a small number of runs (three seeds).

8. Future Work

Future work will extend this study by:

Evaluating on more complex datasets (e.g., Fashion-MNIST, synthetic reasoning tasks).
Enforcing stricter parameter equality across architectures.
Measuring internal dynamics such as gradient norms and representation structure.
Extending comparisons to convolutional and attention-based models.

9. Conclusion

Under an approximately matched parameter budget with non-trivial variation, neural network architecture influences training efficiency and optimization dynamics, even when final performance remains similar. On MNIST, these effects manifest primarily in convergence behavior rather than accuracy.

These findings highlight the importance of parameter allocation as a design factor, particularly in settings where efficiency and stability are critical. Further investigation is required to determine how these effects scale to more complex tasks and architectures.