Per-Layer Embeddings (PLE)

concept
neural-architectureefficiencyembeddingson-device

An architectural technique where each decoder layer in a transformer has its own small embedding table for every token, rather than sharing a single global embedding. This maximizes parameter efficiency for on-device deployments by separating parameter count (total storage) from effective parameters (active during inference).

How it works

Traditional transformers share a single embedding table across all layers. PLE gives each of the model’s decoder layers its own embedding for every token in the vocabulary. These embedding tables are large but are only used for quick lookups — they contribute to total parameter count but not to active computation during inference.

Effective vs. total parameters

The “effective” parameter count excludes these per-layer embedding tables because they don’t participate in matrix multiplications — they’re just lookup tables. This is why Gemma 4 E2B has 5.1B total parameters but only 2.3B effective parameters.

ModelTotal parametersEffective parametersEmbedding overhead
E2B5.1B2.3B2.8B (55%)
E4B8.0B4.5B3.5B (44%)

Why this matters for on-device deployment

Lookup operations (embeddings) are memory bandwidth-bound, not compute-bound. Modern mobile and edge processors handle memory lookups efficiently. By shifting parameter budget from expensive matrix multiplications to cheap lookups, PLE enables larger effective model capacity within the same compute budget.

This trades increased model size (storage) for reduced active computation (energy and latency). On mobile devices with limited compute but adequate storage, this is a favorable tradeoff.

Design rationale

Rather than adding more layers or increasing layer width (which increases compute proportionally), PLE increases representational capacity through per-layer contextualized embeddings. Each layer can tailor its token representations to the specific abstraction level it operates at.

The approach is specific to on-device models where inference latency and energy consumption are the primary constraints, not model size.

Contrast with standard architectures

ApproachEmbedding strategyParameter typeBest for
Standard transformerSingle shared embeddingEffectiveServer deployment, batch inference
Per-Layer EmbeddingsPer-layer embeddingsTotal (storage-heavy)On-device, latency-sensitive
Mixture-of-ExpertsSparse activationActive (subset of total)Fast inference with high capacity

PLE and MoE both achieve efficiency through separation of total vs. active parameters, but via different mechanisms. PLE uses cheap lookups; MoE uses conditional computation.

Observed in practice

Gemma 4 E2B and E4B are the first widely deployed models to use PLE at scale. They target deployment on laptops and high-end phones — environments where storage is less constrained than compute.

The E2B model (2.3B effective) achieves 60.0% on MMLU Pro, comparable to prior-generation 3B dense models, while running faster on mobile hardware.

Connections