What is the SoftPlus activation feature?

Deep learning models are based on activation functions that provide non-linearity and enable networks to learn complicated patterns. This article will cover the Softplus activation function, what it is and how it can be used in PyTorch. Softplus can be said to be a smooth form of the popular ReLU activation that mitigates the disadvantages of ReLU but comes with its own disadvantages. We’ll discuss what Softplus is, its math formula, how it compares to ReLU, what its advantages and limitations are, and walk through some PyTorch code using it.

What is the Softplus activation feature?

The Softplus activation function is a nonlinear function of neural networks and is characterized by a smooth approximation of the ReLU function. In simpler terms, Softplus behaves like ReLU in cases where the positive or negative input is very large, but lacks a sharp corner at the zero point. In its place, it smoothly rises and gives negative inputs a marginally positive output instead of a fixed zero. This continuous and differentiable behavior means that Softplus is continuous and differentiable everywhere unlike ReLU, which is discontinuous (with a sharp change in slope) at x = 0.

Why is Softplus used?

Softplus is chosen by developers who prefer the more convenient activation it offers. non-zero gradients also where ReLU would otherwise be inactive. With gradient-based optimization, the main distortions caused by Softplus smoothness (the gradient moves smoothly instead of stepping) can be isolated. It also naturally trims the outputs (as ReLU does), but the trimming is not zero. In short, Softplus is a softer version of ReLU: it’s like ReLU when the value is big, but it’s better around zero and nice and smooth.

Softplus math formula

Softplus is mathematically defined as:

When x is wide Ex is very large and therefore ln(1 + ex) is very similar ln(ex)is equal to x. This means that Softplus is almost linear on large inputs like ReLU.

When x is large and negative, Ex is therefore very small ln(1 + ex) it’s almost ln(1)and that is 0. The values ​​produced by Softplus are close to zero, but never zero. To take on zero values, x must approach negative infinity.

Another useful thing is that the Softplus derivative is a sigmoid. Derivative of ln(1 + ex) is:

Ex / (1 + ex)

That’s the real sigmata x. This means that at any moment the slope is Softplus sigmoid(x)that is, it has non-zero slope everywhere and is smooth. This makes Softplus useful in transition-based learning because it doesn’t have flat areas where transitions disappear.

Using Softplus in PyTorch

PyTorch provides Softplus activation as a native activation, so it can easily be used as ReLU or any other activation. Below is an example of two simple ones. The trainer uses Softplus on a small number of test values ​​and it shows how to feed Softplus into a small neural network.

Softplus on sample inputs

The excerpt below applies nn.Softplus to a small tensor, so you can see how it behaves with negative, zero, and positive inputs.

import torch
import torch.nn as nn

# Create the Softplus activation
softplus = nn.Softplus()  # default beta=1, threshold=20

# Sample inputs
x = torch.tensor((-2.0, -1.0, 0.0, 1.0, 2.0))
y = softplus(x)

print("Input:", x.tolist())
print("Softplus output:", y.tolist())
Softplus outputs

What it shows:

  • At x = -2 and x = -1 the Softplus values ​​are small positive values ​​rather than 0.
  • The output is approximately 0.6931 at x =0, i.e ln(2)
  • For positive inputs such as 1 or 2, the results are slightly larger than the inputs because Softplus smoothes the curve. Softplus approaches x as it increases.

Softplus PyTorch is represented by a formula ln(1 + exp (betax)). Its internal threshold of 20 is to prevent numbers from overflowing. Softplus is linear in large betax, which means that in that case PyTorch simply returns x.

Using Softplus in a neural network

Here is a simple PyTorch network that uses Softplus as activation for its hidden layer.

import torch
import torch.nn as nn

class SimpleNet(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
    super(SimpleNet, self).__init__()
    self.fc1 = nn.Linear(input_size, hidden_size)
        self.activation = nn.Softplus()
    self.fc2 = nn.Linear(hidden_size, output_size)

def forward(self, x):
    x = self.fc1(x)
    x = self.activation(x)  # apply Softplus
    x = self.fc2(x)
    return x

# Create the model
model = SimpleNet(input_size=4, hidden_size=3, output_size=1)
print(model)
SimpleNet

Passing input through the model works as usual:

x_input = torch.randn(2, 4)  # batch of 2 samples
y_output = model(x_input)

print("Input:\n", x_input)
print("Output:\n", y_output)
Input and output tensor

In this arrangement, Softplus activation is used so that the values ​​coming from the first layer to the second layer are non-negative. Replacing the Softplus with an existing model may not require any further structural variation. It’s just important to note that Softplus can be a bit slower to train and requires more calculations than ReLU.

The final layer can also be implemented with Softplus when there are positive values ​​that the model should generate as outputs, e.g. scale parameters or positive regression targets.

Softplus vs ReLU: Comparison Table

Softplus vs. ReLU
Appearance Softplus ReLU
Definition f(x) = ln(1 + ex) f(x) = max(0, x)
Face Smooth transition over all x’s Sharp bend vx = 0
Behavior for x < 0 Small positive output; never reaches zero The output is exactly zero
Example at x = -2 Softplus ≈ 0.13 ReLU = 0
Near x = 0 Smooth and differentiable; value ≈ 0.693 Cannot differentiate to 0
Behavior for x > 0 Almost linear, closely matching ReLU Linear with slope 1
Example vx = 5 Softplus ≈ 5.0067 ReLU = 5
Gradient Always non-zero; the derivative is sigmoid (x) Zero for x < 0, undefined at 0
Risk of dead neurons No Possible for negative inputs
Sparsity Does not produce exact zeros It creates true zeros
Training effect Stable gradient flow, smoother update Simple, but it can stop learning for some neurons

An analogue of ReLU is softplus. It is a ReLU with very large positive or negative inputs, but with the zero corner removed. This prevents dead neurons because the gradient does not go to zero. This comes at the cost of Softplus not generating true zeros, meaning it is not as sparse as ReLU. Softplus provides more comfortable training dynamics in practice, but ReLU is still used because it is faster and easier.

Advantages of using Softplus

Softplus has some practical advantages that make it useful in some models.

  1. Smooth and distinguishable everywhere

Softplus has no sharp corners. It is completely distinguishable from each input. This helps in maintaining gradients, which can ultimately make optimization a bit easier as the loss changes more slowly.

  1. Avoid dead neurons

ReLU can prevent updating when the neuron is constantly receiving negative input because the gradient will be zero. Softplus does not give an exact zero value on negative numbers, so all neurons remain partially active and are updated on the gradient.

  1. Responds more favorably to negative input

Softplus does not discard negative inputs by generating a zero value like ReLU, but rather generates a small positive value. This allows the model to retain some of the information of the negative signals, rather than losing all of it.

In short, Softplus maintains smooth gradients, prevents dead neurons, and offers smooth behavior for use in some architectures or tasks where continuity is important.

Softplus limitations and compromises

Softplus also has disadvantages that limit the frequency of its use.

  1. More expensive to calculate

Softplus uses exponential and logarithmic operations, which are slower than simple ones max(0, x) of ReLU. This additional overhead is visibly noticeable on large models, as ReLU is extremely optimized on most hardware.

  1. No real rarity

ReLU generates perfect zeros on negative examples, which can save computation time and occasionally help with regularization. Softplus does not give a true zero and therefore all neurons are not always inactive. This eliminates the risk of dead neurons as well as the efficiency benefits of sparse activations.

  1. Gradually slow down the convergence of deep networks

ReLU is commonly used to train deep models. It has a sharp cutoff and a linear positive region that can force learning. Softplus is smoother and can have slow updates, especially in very deep networks where the difference between layers is small.

To summarize, Softplus has nice mathematical properties and avoids problems like dead neurons, but these advantages do not always translate into better results in deep networks. It is best used in cases where smoothness or positive outputs are important, rather than as a one-size-fits-all replacement for ReLU.

Conclusion

Softplus provides smooth, soft ReLU alternatives to neural networks. It learns gradients, does not kill neurons, and is fully differentiable on all inputs. At large values ​​it is like ReLU, but at zero it behaves more like a constant than ReLU because it produces a non-zero output and slope. Meanwhile, it comes with trade-offs. It is also slower to compute; it also does not generate true zeros and may not speed up learning in deep networks as fast as ReLU. Softplus is more efficient in models where gradients are smooth or where positive outputs are mandatory. In most other scenarios, this is a useful alternative to the default ReLU replacement.

Frequently Asked Questions

Q1. What problem does the activation function of Softplus solve compared to ReLU?

A. Softplus prevents dead neurons by keeping gradients non-zero for all inputs, thus offering a smooth alternative to ReLU while still behaving similarly for large positive values.

Q2. When should I choose Softplus over ReLU in a neural network?

A: This is a good choice when your model benefits from smooth transitions or must output strictly positive values, such as scale parameters or certain regression targets.

Q3. What are the main limitations of using Softplus?

Answer: It is slower to compute than ReLU, does not produce sparse activations, and can lead to slightly slower convergence in deep networks.

Janvi Kumari

Hi, I’m Janvi and I’m a passionate data science enthusiast currently working at Analytics Vidhya. My journey into the world of data began with a deep curiosity about how we can extract meaningful insights from complex datasets.

Sign in to continue reading and enjoy content created by experts.

Leave a Comment