Article · Wikipedia archive · Last revised May 28, 2026

Smooth maximum

In mathematics, a smooth maximum of an indexed family x1, ..., xn of numbers is a smooth approximation to the maximum function meaning a parametric family of functions such that for every ⁠⁠, the function ⁠⁠ is smooth, and the family converges to the maximum function ⁠⁠ as ⁠⁠. The concept of smooth minimum is similarly defined. In many cases, a single family approximates both: maximum as the parameter goes to positive infinity, minimum as the parameter goes to negative infinity; in symbols, ⁠⁠ as ⁠⁠ and ⁠⁠ as ⁠⁠. The term can also be used loosely for a specific smooth function that behaves similarly to a maximum, without necessarily being part of a parametrized family.

Last revised
May 28, 2026
Read time
≈ 6 min
Length
1,302 w
Citations
4
Source

In mathematics, a smooth maximum of an indexed family x1, ..., xn of numbers is a smooth approximation to the maximum function max ( x 1 , , x n ) , {\displaystyle \max(x_{1},\ldots ,x_{n}),} meaning a parametric family of functions m α ( x 1 , , x n ) {\displaystyle m_{\alpha }(x_{1},\ldots ,x_{n})} such that for every α {\displaystyle \alpha } , the function m α {\displaystyle m_{\alpha }} is smooth, and the family converges to the maximum function m α max {\displaystyle m_{\alpha }\to \max } as α {\displaystyle \alpha \to \infty } . The concept of smooth minimum is similarly defined. In many cases, a single family approximates both: maximum as the parameter goes to positive infinity, minimum as the parameter goes to negative infinity; in symbols, m α max {\displaystyle m_{\alpha }\to \max } as α {\displaystyle \alpha \to \infty } and m α min {\displaystyle m_{\alpha }\to \min } as α {\displaystyle \alpha \to -\infty } . The term can also be used loosely for a specific smooth function that behaves similarly to a maximum, without necessarily being part of a parametrized family.

Examples

Boltzmann operator

Smoothmax of (−x, x) versus x for various parameter values. Very smooth for α {\displaystyle \alpha } =0.5, and more sharp for α {\displaystyle \alpha } =8. source ↗

For large positive values of the parameter α > 0 {\displaystyle \alpha >0} , the following formulation is a smooth, differentiable approximation of the maximum function. For negative values of the parameter that are large in absolute value, it approximates the minimum.

S α ( x 1 , , x n ) = i = 1 n x i e α x i i = 1 n e α x i {\displaystyle {\mathcal {S}}_{\alpha }(x_{1},\ldots ,x_{n})={\frac {\sum _{i=1}^{n}x_{i}e^{\alpha x_{i}}}{\sum _{i=1}^{n}e^{\alpha x_{i}}}}}

S α {\displaystyle {\mathcal {S}}_{\alpha }} has the following properties:

  1. S α max {\displaystyle {\mathcal {S}}_{\alpha }\to \max } as α {\displaystyle \alpha \to \infty }
  2. S 0 {\displaystyle {\mathcal {S}}_{0}} is the arithmetic mean of its inputs
  3. S α min {\displaystyle {\mathcal {S}}_{\alpha }\to \min } as α {\displaystyle \alpha \to -\infty }

The gradient of S α {\displaystyle {\mathcal {S}}_{\alpha }} is closely related to softmax and is given by

x i S α ( x 1 , , x n ) = e α x i j = 1 n e α x j [ 1 + α ( x i S α ( x 1 , , x n ) ) ] . {\displaystyle \nabla _{x_{i}}{\mathcal {S}}_{\alpha }(x_{1},\ldots ,x_{n})={\frac {e^{\alpha x_{i}}}{\sum _{j=1}^{n}e^{\alpha x_{j}}}}[1+\alpha (x_{i}-{\mathcal {S}}_{\alpha }(x_{1},\ldots ,x_{n}))].}

This makes the softmax function useful for optimization techniques that use gradient descent.

This operator is sometimes called the Boltzmann operator,1 after the Boltzmann distribution.

LogSumExp

Another smooth maximum is LogSumExp:

L S E α ( x 1 , , x n ) = 1 α log i = 1 n exp α x i {\displaystyle \mathrm {LSE} _{\alpha }(x_{1},\ldots ,x_{n})={\frac {1}{\alpha }}\log \sum _{i=1}^{n}\exp \alpha x_{i}}

This can also be normalized if the x i {\displaystyle x_{i}} are all non-negative, yielding a function with domain [ 0 , ) n {\displaystyle [0,\infty )^{n}} and range [ 0 , ) {\displaystyle [0,\infty )} :

g ( x 1 , , x n ) = log ( i = 1 n exp x i ( n 1 ) ) {\displaystyle g(x_{1},\ldots ,x_{n})=\log \left(\sum _{i=1}^{n}\exp x_{i}-(n-1)\right)}

The ( n 1 ) {\displaystyle (n-1)} term corrects for the fact that exp ( 0 ) = 1 {\displaystyle \exp(0)=1} by canceling out all but one zero exponential, and log 1 = 0 {\displaystyle \log 1=0} if all x i {\displaystyle x_{i}} are zero.

Mellowmax

The mellowmax operator1 is defined as follows:

m m α ( x ) = 1 α log 1 n i = 1 n exp α x i {\displaystyle \mathrm {mm} _{\alpha }(x)={\frac {1}{\alpha }}\log {\frac {1}{n}}\sum _{i=1}^{n}\exp \alpha x_{i}}

It is a non-expansive operator. As α {\displaystyle \alpha \to \infty } , it acts like a maximum. As α 0 {\displaystyle \alpha \to 0} , it acts like an arithmetic mean. As α {\displaystyle \alpha \to -\infty } , it acts like a minimum. This operator can be viewed as a particular instantiation of the quasi-arithmetic mean. It can also be derived from information theoretical principles as a way of regularizing policies with a cost function defined by KL divergence. The operator has previously been utilized in other areas, such as power engineering.2

Connection between LogSumExp and Mellowmax

LogSumExp and Mellowmax are the same function differing by a constant log n α {\displaystyle {\frac {\log {n}}{\alpha }}} . LogSumExp is always larger than the true max, differing at most from the true max by log n α {\displaystyle {\frac {\log {n}}{\alpha }}} in the case where all n arguments are equal and being exactly equal to the true max when all but one argument is {\displaystyle -\infty } . Similarly, Mellowmax is always less than the true max, differing at most from the true max by log n α {\displaystyle {\frac {\log {n}}{\alpha }}} in the case where all but one argument is {\displaystyle -\infty } and being exactly equal to the true max when all n arguments are equal.

p-Norm

Another smooth maximum is the p-norm:

( x 1 , , x n ) p = ( i = 1 n | x i | p ) 1 p {\displaystyle \|(x_{1},\ldots ,x_{n})\|_{p}=\left(\sum _{i=1}^{n}|x_{i}|^{p}\right)^{\frac {1}{p}}}

which converges to ( x 1 , , x n ) = max 1 i n | x i | {\displaystyle \|(x_{1},\ldots ,x_{n})\|_{\infty }=\max _{1\leq i\leq n}|x_{i}|} as p {\displaystyle p\to \infty } .

An advantage of the p-norm is that it is a norm. As such it is scale invariant (homogeneous): ( λ x 1 , , λ x n ) p = | λ | ( x 1 , , x n ) p {\displaystyle \|(\lambda x_{1},\ldots ,\lambda x_{n})\|_{p}=|\lambda |\cdot \|(x_{1},\ldots ,x_{n})\|_{p}} , and it satisfies the triangle inequality.

Smooth maximum unit

The following binary operator is called the Smooth Maximum Unit (SMU):3

max ε ( a , b ) = a + b + | a b | ε 2 = a + b + ( a b ) 2 + ε 2 {\displaystyle {\begin{aligned}\textstyle \max _{\varepsilon }(a,b)&={\frac {a+b+|a-b|_{\varepsilon }}{2}}\\&={\frac {a+b+{\sqrt {(a-b)^{2}+\varepsilon }}}{2}}\end{aligned}}}

where ε 0 {\displaystyle \varepsilon \geq 0} is a parameter. As ε 0 {\displaystyle \varepsilon \to 0} , | | ε | | {\displaystyle |\cdot |_{\varepsilon }\to |\cdot |} and thus max ε max {\displaystyle \textstyle \max _{\varepsilon }\to \max } .

See also

See also

References

References

  1. Asadi, Kavosh; Littman, Michael L. (2017). "An Alternative Softmax Operator for Reinforcement Learning". PMLR. 70: 243–252. arXiv:1612.05628. Retrieved January 6, 2023.
  2. Safak, Aysel (February 1993). "Statistical analysis of the power sum of multiple correlated log-normal components". IEEE Transactions on Vehicular Technology. 42 (1): {58–61. doi:10.1109/25.192387.
  3. Biswas, Koushik; Kumar, Sandeep; Banerjee, Shilpak; Ashish Kumar Pandey (2021). "SMU: Smooth activation function for deep networks using smoothing maximum technique". arXiv:2111.04682 [cs.LG].

https://www.johndcook.com/soft_maximum.pdf

M. Lange, D. Zühlke, O. Holz, and T. Villmann, "Applications of lp-norms and their smooth approximations for gradient based learning vector quantization," in Proc. ESANN, Apr. 2014, pp. 271-276. (https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2014-153.pdf)