Depth vs. Width in Neural Networks under a Fixed Parameter Budget

April 2, 2026 · 8 min read

Abstract

This work investigates how neural network architecture—specifically depth versus width—affects learning behavior under an approximately matched parameter budget with non-trivial variation (~14%). Using multilayer perceptrons (MLPs) constrained to approximately 100,000 trainable parameters, we evaluate performance on the MNIST classification task across five architectural regimes ranging from shallow-wide to deep-narrow. Results show that final test accuracy remains tightly bounded across architectures, while differences emerge in training efficiency, convergence dynamics, and generalization gap. These findings suggest that, in low-complexity settings, parameter allocation influences training behavior more than final representational capacity.


1. Introduction

Model capacity is often treated as a primary determinant of neural network performance, with parameter count serving as a common proxy. However, parameter count alone does not uniquely define a model; architectural structure—specifically the distribution of parameters across layers—introduces inductive biases that influence optimization and generalization.

This study examines how architectural variation affects model behavior when total parameter count is held approximately constant, acknowledging a non-trivial variation (~14%) due to discrete layer dimensions. We restrict attention to fully connected networks and evaluate performance on MNIST to isolate architectural effects under controlled conditions.

2. Experimental Setup

2.1 Dataset

Experiments were conducted on the MNIST dataset, consisting of 60,000 training samples and 10,000 test samples of 28×28 grayscale images. Inputs were flattened into 784-dimensional vectors and normalized using the dataset mean (0.1307) and standard deviation (0.3081). No data augmentation was applied.

2.2 Training Configuration

All models were trained under identical conditions:

Weights and biases for each linear layer were initialized from a uniform distribution:

\[ \mathcal{U}\left(-\sqrt{\frac{6}{\text{fan\_in}}}, \sqrt{\frac{6}{\text{fan\_in}}}\right) \]

No learning rate scheduling or gradient clipping was applied.

2.3 Parameter Constraint

Each model was constrained to approximately 100,000 trainable parameters. Counts were computed programmatically by summing the number of elements across all parameters requiring gradients:

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

For a single fully connected layer with \(N\) input units and \(M\) output units, the parameter count is:

\[ P = (N \times M) + M \]

where \(N \times M\) corresponds to the weight matrix and \(M\) corresponds to the bias vector.

Example: For a two-layer MLP with architecture [784, 127, 10]:

Final parameter counts across all architectures:

This represents a spread of approximately 14% between the largest and smallest models. Consequently, wider models systematically contain more parameters than deeper models, a confounding factor acknowledged in the analysis.

3. Model Architectures

Five MLP architectures were evaluated (hidden layer sizes shown; input=784, output=10 fixed):

These configurations span a range of parameter distributions from width-dominant to depth-dominant structures.

4. Results

Table 1. Performance comparison across architectures (averaged over three seeds).

Model Test Acc (%) Std Dev Train Acc (%) Gap (%) Epochs to 90%
Ultra Wide 97.867 0.076 99.704 1.838 1
Wide 97.550 0.182 99.534 1.984 1
Normal 97.690 0.238 99.495 1.805 1
Deep 97.830 0.078 99.495 1.665 2
Ultra Deep 97.523 0.138 99.379 1.856 2

5. Analysis

5.1 Final Performance

All architectures achieved high test accuracy within a narrow range (97.52%–97.87%), with a total spread of approximately 0.35%. This indicates that MNIST is largely saturated for MLPs at this scale, and architectural differences have limited impact on final accuracy.

5.2 Optimization Dynamics

Shallow and wide architectures reached 90% training accuracy within one epoch, whereas deeper architectures required approximately two epochs. This suggests that width improves early training efficiency, potentially due to simpler gradient flow and reduced compositional depth.

5.3 Stability

Variance across seeds was low for all models, with standard deviations below 0.25%. The Ultra Wide and Deep configurations exhibited the lowest variance, indicating stable optimization across runs.

5.4 Generalization

The overfitting gap ranged from 1.665% to 1.984%. The Deep architecture showed the smallest gap, suggesting slightly improved generalization under moderate depth. However, the magnitude of this difference is small.

6. Discussion

The primary result is that architectural variation affects training dynamics more strongly than final performance on MNIST. Specifically:

These differences occurred within a narrow performance band (~0.35%), limiting the strength of architectural conclusions under this experimental setup. Given the observed parameter imbalance, results should be interpreted as an approximate comparison of architectural trends rather than a strictly controlled capacity study. In particular, wider models systematically contained higher parameter counts, which may partially contribute to their observed performance. Nevertheless, the results suggest that for low-complexity tasks, architectural choices primarily influence optimization behavior rather than representational limits.

7. Limitations

Several limitations constrain interpretation:

8. Future Work

Future work will extend this study by:

9. Conclusion

Under an approximately matched parameter budget with non-trivial variation, neural network architecture influences training efficiency and optimization dynamics, even when final performance remains similar. On MNIST, these effects manifest primarily in convergence behavior rather than accuracy.

These findings highlight the importance of parameter allocation as a design factor, particularly in settings where efficiency and stability are critical. Further investigation is required to determine how these effects scale to more complex tasks and architectures.