Differentiable Perceptrons
The basic version of a perceptron works by taking inputs (x_i) and assigning them weights (w_i). The sum of each product (x_i * w_i) gets passed into an activation function σ, which produces an output ŷ for classification. The final output can either be 0 or 1. This perceptron can be trained over each epoch by adding the input to the weight vector if ŷ is 0 and the actual value is 1 or subtracting the input from the weight vector in the opposite scenario.
While this version of the perceptron can converge to a solution, certain disadvantages and limitations exist. This version of a perceptron cannot produce a solution for non-separable classes and a non-linear boundary. Additionally, it cannot be used to run gradient descent as the model is non-differentiable. However, we can tweak certain ideas of this basic perceptron to make it differentiable and usable for the same classification problem between two classes.
The decision function for the basic perceptron is simply :
y = sign(w^{T}*x)
This decision function is not differentiable. Therefore, we can replace this with a sigmoid function, which is differentiable but behaves similarly to a sign function. There is a steep increase in the decision function from y = 0 to y = 1.
The differentiable perceptron’s decision function would be:
y = 1 / (1 + e ^ {-w^{T}*x})
Moving on to the data, while the subset of classifications Y contains 0s and 1s, the subset of inputs X would contain continuous values. Each vector input would be in a real coordinate space with as many dimensions as features, with its associated output being either 0 or 1.
We can rewrite the decision function as a probability:
p(y = 1| x) = 1 / (1 + e ^ {-w^{T}*x})
While this probability would range from 0 to 1, 0.5 is effectively a threshold between y being equal to 1 (if p > 0.5) or 0 (if p < 0.5).
While the decision function has been altered for the perceptron model to be differentiable, we must adjust the loss function accordingly. This function would be in the form of a Negative Log Likelihood. We can derive our loss function as such:
if y^{i} = 1, l(x^{i}, y^{i}) = -ln(p^{i})
if y^{i} = 0, l(x^{i}, y^{i}) = -ln(1 - p^{i})Both statements are such that a probability closer to y^{i} results in the loss approaching 0, while a probability towards the opposite class results in a loss approaching negative infinity.Combining these two statements, in relation to each class, we get: l(x^{i}, y^{i}) = - y^{i} * ln(p^{i}) - (1 - y^{i})*-ln(1 - p^{i})
To minimize this loss function as our objective, we can run gradient descent and reach an optimal arrangement of weights for our perceptron model to classify data regardless of if it requires linear or non-linear separation.
This differentiable perceptron concept can be used in classification problems that involve various features. For example, if we were to take a greyscale image of 0s and 1s and develop an ML model to classify images of these numbers, we could use a differentiable perceptron using the sigmoid decision function as a layer in a neural network to do so.
(Credit: Prof Lerrel Pinto)