The basic principles required to solve classification tasks with neural networks are used as building blocks in more complicated deep learning problems such as object detection and instance segmentation. Thus, it is important to understand the reasoning behind choosing one or another activation and loss functions. This post will answer the question "What activation and loss functions do you need to use to solve binary classification task?" with an . In the following articles, I'll extend the classification problem to multi-class and multi-label classification and show that you need to add very few modifications to your code to switch between the three classification problems. example of pytorch implementation that you can run yourself in google colab Why is it important to understand activation function and loss used for binary classification? Traditionally binary classification models use sigmoid activation and binary cross-entropy loss (BCE). These two functions are broadly used in more complicated neural networks, such as object detection CNN models and recurrent neural networks. YOLOX object detection model, for example, uses sigmoid activation and BCE in two of its branches as you can see in the figure below. Recurrent neural networks with gated units, such as LSTM, use sigmoid to help the recurrent NN decide whether to update or forget the data. If you know the logic behind applying sigmoid activation and BCE loss you are one step closer to understanding and building more complicated NN models. 2 Classification problem formulation In supervised machine learning the classification problems can be represented as a set of samples , where is an m-dimensional vector that contains features of sample and is the class to which belongs. The goal is to build a model that predicts the label y_i for each input sample . There are three types of classification problems: {(x_1, y_1), (x_2, y_2),...,(x_n, y_n)} x_i i y_i x_i x_i - the label can assume one of two values (0 - negative class, 1 - positive class) binary classification y_i - the label can assume one of the values, where k is the number of classes higher than 2 multi-class classification y_i k - the label can assume zero or more than one of the values, where k is the number of class labels multi-label classification y_i k Moreover, there are two main types of classifiers: - output probability of each class and the class label is assigned based on the highest class probability. Examples - , , neural networks probabilistic classifiers Naive Bayes logistic regression - output class label without probability estimates. Examples of such classifiers are , and deterministic classifiers k-nearest neighbors SVM Examples of binary classification tasks Binary classification can be applied to real-life problems: Classifying emails as spam or not spam Classifying objects in an image between two classes (dog or cat) Classifying a patient as having a certain disease or not Identifying a customer as a returning or new customer Determining whether a loan application will default or be repaid 3 Activation and loss functions for binary classification As discussed before, in the binary classification you are given: a set of samples {(x_1, y_1), (x_2, y_2),...,(x_n, y_n)} is an m-dimensional vector that contains features of sample x_i i is the class to which belongs y_i x_i can assume one of two values (0 - negative class, 1 - positive class). y_i To build a binary classification neural network as a probabilistic classifier we need: an output linear layer with a size of 1 output values should be in the range [0,1]. The model outputs the probability that the input sample belongs to the positive class. Note that if is the probability that the input sample belongs to class 1 (positive class), then is the probability that input belongs to class 0 (negative class) p p (1-p) a loss function that has the lowest values when the prediction and the ground truth are the same: (0,0) and (1,1) 3.1 The Sigmoid Activation Function The final linear layer of a neural network outputs a vector of "raw output values". In the case of classification, the output values represent the model's confidence that the input belongs to one of the classes. As discussed before the output layer needs to be the size of 1 and the output value should be converted into a probability . To obtain the probability you can use the sigmoid activation function which maps the input to the output between 0 and 1. The sigmoid function is defined as p An example of input-output values for sigmoid is provided in the table below. Input -5 -4 -3 -2 -1 0 1 2 3 4 5 Output 0.007 0.018 0.047 0.119 0.269 0.5 0.731 0.881 0.953 0.982 0.993 Let's plot this table with input values as the x-axis and output values as the y-axis to visualize the sigmoid function. As you can see sigmoid is a function that maps all input values into a range from 0 to 1 and we can use it for the binary classification task with the output layer of size 1. 3.2 Binary Cross-Entropy Loss The most common loss function for probabilistic binary classifiers is the binary cross-entropy loss, which is defined as Where N is the number of input samples, y is the ground truth, and is the predicted probability. p The table below shows loss values if the ground truth is 1 and input values range from 0 to 1. From the table we can make several observations: BCE loss has a very big value when the prediction has the opposite of the ground truth value if the ground truth and prediction have the same value the loss is 0 is undefined, to fix it we can add a very small value of 0.0000001 (called epsilon) to 0 log(0) ground truth 1 1 1 1 1 1 prediction 0 0.2 0.4 0.6 0.8 1 BCE loss inf 1.609 0.916 0.511 0.223 0 Let's remove the sum from the equation and analyze the term inside: The plot of -log(x) below shows that the function has the minimum value at =1. x There are two things that can be observed from the plot and the formula: If y=0 then the loss function is reduced to -log(1-p) and -log(1-p) has the minimum value when =0 (the same value as the ground truth) p if y=1 then the loss function is reduced to -log(p) and -log(p) has the minimum value when =1 (the same value as the ground truth) p The observed properties make BCE a perfect loss function for binary classification problems. 4 Binary Classification NN example with PyTorch Before heading to the code let's summarize what we need to implement a probabilistic binary classification NN: ground truth and predictions should have dimensions [N,1] where N is the number of input samples the final linear layer size should be 1 outputs from the final layer should be processed with sigmoid activation to obtain class probability BCE loss should be applied to predicted class probabilities and ground truth values Let's code a neural network for binary classification with the PyTorch framework. First, install - this package will be used later to compute classification accuracy and confusion matrix. torchmetrics # used for accuracy metric and confusion matrix !pip install torchmetrics Import packages that will be used later in the code from sklearn.datasets import make_classification import numpy as np import torch import torchmetrics import matplotlib.pyplot as plt import seaborn as sn import pandas as pd from sklearn.decomposition import PCA 4.1 Dataset functions Set global variable with the number of classes number_of_classes=2 I will use to generate a binary classification dataset: sklearn.datasets.make_classification - is the number of generated samples n_samples - sets the number of dimensions of generated samples X n_features - the number of classes in the generated dataset. In the binary classification problem, there should be only 2 classes n_classes The generated dataset will have X with shape and Y with shape . [n_samples, n_features] [n_samples, ] def get_dataset(n_samples=10000, n_features=20, n_classes=2): # https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html#sklearn.datasets.make_classification data_X, data_y = make_classification(n_samples=n_samples, n_features=n_features, n_classes=n_classes, n_informative=n_classes, n_redundant=0, n_clusters_per_class=2, random_state=42, class_sep=2) return data_X, data_y Define functions to visualize and print out dataset statistics. show_dataset function uses to reduce the dimensionality of X from any number down to 2 for simplicity of visualization of X with the 2D plot. PCA def print_dataset(X, y): print(f'X shape: {X.shape}, min: {X.min()}, max: {X.max()}') print(f'y shape: {y.shape}') print(y[:10]) def show_dataset(X, y, title=''): if X.shape[1] > 2: X_pca = PCA(n_components=2).fit_transform(X) else: X_pca = X fig = plt.figure(figsize=(4, 4)) plt.scatter(x=X_pca[:, 0], y=X_pca[:, 1], c=y, alpha=0.5) # generate colors for all classes colors = plt.cm.rainbow(np.linspace(0, 1, number_of_classes)) # iterate over classes and visualize them with the dedicated color for class_id in range(number_of_classes): class_mask = np.argwhere(y == class_id) X_class = X_pca[class_mask[:, 0]] plt.scatter(x=X_class[:, 0], y=X_class[:, 1], c=np.full((X_class[:, 0].shape[0], 4), colors[class_id]), label=class_id, alpha=0.5) plt.title(title) plt.legend(loc="best", title="Classes") plt.xticks() plt.yticks() plt.show() Scale the dataset features X to range [0,1] with min max scaler. This is usually done for faster and more stable training. def scale(x_in): return (x_in - x_in.min(axis=0))/(x_in.max(axis=0)-x_in.min(axis=0)) Let's print out the generated dataset statistics and visualized it with the functions from above. X, y = get_dataset(n_classes=number_of_classes, n_features=2) print('before scaling') print_dataset(X, y) show_dataset(X, y, 'before') X_scaled = scale(X) print('after scaling') print_dataset(X_scaled, y) show_dataset(X_scaled, y, 'after') The outputs you should get are below. before scaling X shape: (10000, 2), min: -6.049090666105036, max: 5.311074029997754 y shape: (10000,) [0 0 1 1 0 1 1 0 1 0] after scaling X shape: (10000, 2), min: 0.0, max: 1.0 y shape: (10000,) [0 0 1 1 0 1 1 0 1 0] As you can see min max scaling does not distort dataset features, it just transforms them into the range [0,1]. Create PyTorch data loaders. generates the dataset as two numpy arrays. To create PyTorch dataloaders we need to transform the numpy dataset into torch.tensor first. sklearn.datasets.make_classification def get_data_loaders(dataset, batch_size=32, shuffle=True): data_X, data_y = dataset # https://pytorch.org/docs/stable/data.html#torch.utils.data.TensorDataset torch_dataset = torch.utils.data.TensorDataset(torch.tensor(data_X, dtype=torch.float32), torch.tensor(data_y, dtype=torch.float32)) # https://pytorch.org/docs/stable/data.html#torch.utils.data.random_split train_dataset, val_dataset = torch.utils.data.random_split(torch_dataset, [int(len(torch_dataset)*0.8), int(len(torch_dataset)*0.2)], torch.Generator().manual_seed(42)) # https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader loader_train = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=shuffle) loader_val = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size, shuffle=shuffle) return loader_train, loader_val Test PyTorch data loaders dataloader_train, dataloader_val = get_data_loaders(get_dataset(n_classes=number_of_classes), batch_size=32) train_batch_0 = next(iter(dataloader_train)) print(f'Batches in the train dataloader: {len(dataloader_train)}, X: {train_batch_0[0].shape}, Y: {train_batch_0[1].shape}') val_batch_0 = next(iter(dataloader_val)) print(f'Batches in the validation dataloader: {len(dataloader_val)}, X: {val_batch_0[0].shape}, Y: {val_batch_0[1].shape}') The output: Batches in the train dataloader: 250, X: torch.Size([32, 20]), Y: torch.Size([32]) Batches in the validation dataloader: 63, X: torch.Size([32, 20]), Y: torch.Size([32]) Create pre and postprocessing functions. As you may have noted before current Y shape is [N,], we need it to be [N,1]. To do that we can expand the Y shape to [N,1] with or depending on the type of Y. numpy.expand_dims torch.unsqueeze def preprocessing(y): ''' expland input labels shape [N,] to [N,1] input: y - [N,] numpy array or pytorch Tensor output: [N, 1] the same type as input ''' assert type(y) == np.ndarray or torch.is_tensor( y), f'input should be numpy array or torch tensor. Received input is: {type(y)}' assert len(y.shape) == 1, f'input shape should be [N,]. Received input shape is: {y.shape}' if torch.is_tensor(y): return torch.unsqueeze(y, dim=1) else: return np.expand_dims(y, axis=1) Postprocessing is simply thresholding input values: if the value is larger than the threshold, set it to 1, if it's lower then set it to 0. Postprocessing is used to output class 0 or 1 based on the model's output probability. def postprocessing(y, threshold=0.5): ''' set input y with values larger than threshold to 1 and lower than threshold to 0 input: y - [N,1] numpy array or pytorch Tensor output: int array [N,1] the same class type as input ''' assert type(y) == np.ndarray or torch.is_tensor( y), f'input should be numpy array or torch tensor. Received input is: {type(y)}' assert len(y.shape) == 2, f'input shape should be [N,classes]. Received input shape is: {y.shape}' if torch.is_tensor(y): return (y >= threshold).int() else: return (y >= threshold).astype(int) Test the defined pre and postprocessing functions. y = np.random.rand(10, ) y_preprocessed = preprocessing(y) print(f'y shape: {y.shape}, y preprocessed shape: {y_preprocessed.shape}') y_postprocessed = postprocessing(y_preprocessed, threshold=0.5) print(f'y preprocessed shape: {y_preprocessed.shape},y postprocessed shape: {y_postprocessed.shape}') print('Postprocessing sets array elements>=threshold to 1 and elements<threshold to 0:') for i in range(10): print(f'\t{y_preprocessed[i, 0]:.2f} >> {y_postprocessed[i, 0]}') The output: y shape: (10,), y preprocessed shape: (10, 1) y preprocessed shape: (10, 1),y postprocessed shape: (10, 1) Postprocessing sets array elements>=threshold to 1 and elements<threshold to 0: 0.81 >> 1 0.67 >> 1 0.66 >> 1 0.10 >> 0 0.39 >> 0 0.50 >> 1 0.54 >> 1 0.06 >> 0 0.92 >> 1 0.93 >> 1 4.2 Creating and training Binary classification model This section shows an implementation of all functions required to train a binary classification model. 4.2.1 Sigmoid activation The PyTorch-based implementation of the sigmoid formula def sigmoid(x): return 1/(1+torch.exp(-x)) Let's test sigmoid: generate numpy array in the range [-10, 10] with step 1 test_input preprocess it - extend shape from [21,] to [21,1] test_input process with the implemented function and PyTorch default implementation test_input sigmoid torch.nn.functional.sigmoid compare the results (they should be identical) plot processed by sigmoid test_input test_input = torch.arange(-10, 11, 1, dtype=torch.float32) test_input = preprocessing(test_input) sigmoid_output = sigmoid(test_input) print(f'Input data shape: {test_input.shape}') print(f'input data range: [{test_input.min():.3f}, {test_input.max():.3f}]') print(f'sigmoid output data range: [{sigmoid_output.min():.3f}, {sigmoid_output.max():.3f}]') print(test_input[:2]) print(sigmoid_output[:2]) # compare the sigmoid implementation with pytorch implementation torch_sigmoid_output = torch.nn.functional.sigmoid(test_input) print(f'sigmoid output is the same with pytorch implementation: {(torch_sigmoid_output == sigmoid_output).all().numpy()}') fig = plt.figure(figsize=(4, 2), facecolor=(0.0, 1.0, 0.0)) ax = fig.add_subplot(1, 1, 1) ax.plot(test_input, sigmoid_output, color='red') ax.set_ylim([0, 1]) ax.set_title('sigmoid') ax.set_facecolor((0.0, 1.0, 0.0)) fig.show() The output of the code above: Input data shape: torch.Size([21, 1]) input data range: [-10.000, 10.000] sigmoid output data range: [0.000, 1.000] tensor([[-10.], [ -9.]]) tensor([[4.5398e-05], [1.2339e-04]]) sigmoid output is the same with pytorch implementation: True 4.2.2 Loss function: Binary-cross-entropy The PyTorch-based implementation of the BCE formula To make sure that the inner term of is never 0 use with and . log torch.clamp min=epsilon max=1-epsilon def binary_cross_entropy(pred, y): # log(0)=-inf # to prevent that clamp NN output values into [eps, 1-eps] values eps = 1e-8 pred = torch.clamp(pred, min=eps, max=1 - eps) loss = -y * torch.log(pred) - (1 - y) * torch.log(1 - pred) return loss.mean() Test BCE implementation: generate an array with shape [10,1] and values in the range [0,1) with test_input torch.rand threshold to set all values to 0 or 1 and use it as ground truth values test_input compute loss with the implemented function and PyTorch implementation binary_cross_entropy torch.nn.functional.binary_cross_entropy compare the results (they should be identical) test_input = torch.rand(10, 1, dtype=torch.float32) # get "ground truth" for test input by thresholding test_input test_input_gt = postprocessing(test_input).float() print(f'test input shape: {test_input.shape}, gt shape: {test_input_gt.shape}') print(f'test_input range: [{test_input.min().numpy():.2f}, {test_input.max().numpy():.2f}]') print(f'test_input gt range: [{test_input_gt.min().numpy()}, {test_input_gt.max().numpy()}]') # get loss with the binary_cross_entropy implementation loss = binary_cross_entropy(test_input, test_input_gt) # get loss with pytorch binary_cross_entropy implementation loss_pytorch = torch.nn.functional.binary_cross_entropy(test_input, test_input_gt) print(f'loss outputs are the same: {(loss == loss_pytorch).numpy()}') The expected output test input shape: torch.Size([10, 1]), gt shape: torch.Size([10, 1]) test_input range: [0.02, 0.80] test_input gt range: [0.0, 1.0] loss outputs are the same: True 4.2.3 Accuracy metric I will use implementation to compute accuracy based on model predictions and ground truth. torchmetrics To create binary classification accuracy metric two parameters are required: task="binary" the threshold value that will be used to threshold model predictions # https://torchmetrics.readthedocs.io/en/stable/classification/accuracy.html#module-interface accuracy_metric=torchmetrics.classification.Accuracy(task="binary", threshold=0.5) def compute_accuracy(y_pred, y): return accuracy_metric(y_pred, y) 4.2.4 NN model The NN used in this example is a deep NN with 2 hidden layers. Input and hidden layers use ReLU activation and the final layer uses the activation function provided as the class input (it will be the sigmoid activation function that was implemented before). class ClassifierNN(torch.nn.Module): def __init__(self, loss_function, activation_function, input_dims=2, output_dims=1): super().__init__() self.linear1 = torch.nn.Linear(input_dims, input_dims * 4) self.linear2 = torch.nn.Linear(input_dims * 4, input_dims * 8) self.linear3 = torch.nn.Linear(input_dims * 8, input_dims * 4) self.output = torch.nn.Linear(input_dims * 4, output_dims) self.loss_function = loss_function self.activation_function = activation_function def forward(self, x): x = torch.nn.functional.relu(self.linear1(x)) x = torch.nn.functional.relu(self.linear2(x)) x = torch.nn.functional.relu(self.linear3(x)) x = self.activation_function(self.output(x)) return x 4.2.5 Train model for a single epoch The figure above depicts the binary classification training logic for a single batch. Later the train_epoch function will be called multiple times (chosen number of epochs). def train_epoch(model, optimizer, dataloader_train): # set the model to the training mode # https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.train model.train() losses = [] accuracies = [] for step, (X_batch, y_batch) in enumerate(dataloader_train): ### forward propagation # get model output and use loss function y_pred = model(X_batch) # get class probabilities with shape [N,1] # apply loss function on predicted probabilities and ground truth loss = model.loss_function(y_pred, y_batch) ### backward propagation # set gradients to zero before backpropagation # https://pytorch.org/docs/stable/generated/torch.optim.Optimizer.zero_grad.html optimizer.zero_grad() # compute gradients # https://pytorch.org/docs/stable/generated/torch.Tensor.backward.html loss.backward() # update weights # https://pytorch.org/docs/stable/optim.html#taking-an-optimization-step optimizer.step() # update model weights # calculate batch accuracy acc = compute_accuracy(y_pred, y_batch) # append batch loss and accuracy to corresponding lists for later use accuracies.append(acc) losses.append(float(loss.detach().numpy())) # compute average epoch accuracy train_acc = np.array(accuracies).mean() # compute average epoch loss loss_epoch = np.array(losses).mean() return train_acc, loss_epoch 4.2.6 Evaluate the model with the provided data loader The evaluate function iterates over provided PyTorch dataloader and computes current model accuracy and returns average loss and average accuracy. def evaluate(model, dataloader_in): # set the model to the evaluation mode # https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.eval model.eval() val_acc_epoch = 0 losses = [] accuracies = [] # disable gradient calculation for evaluation # https://pytorch.org/docs/stable/generated/torch.no_grad.html with torch.no_grad(): for step, (X_batch, y_batch) in enumerate(dataloader_in): # get predictions y_pred = model(X_batch) # calculate loss loss = model.loss_function(y_pred, y_batch) # calculate batch accuracy acc = compute_accuracy(y_pred, y_batch) accuracies.append(acc) losses.append(float(loss.detach().numpy())) # compute average accuracy val_acc = np.array(accuracies).mean() # compute average loss loss_epoch = np.array(losses).mean() return val_acc, loss_epoch 4.2.7 Get predictions for the provided dataloader function iterates over the provided dataloader, collects post-processed model predictions and ground truth values into [N,1] PyTorch arrays, and returns both arrays. Later this function will be used to compute the confusion matrix and visualize predictions. predict def predict(model, dataloader): # set the model to the evaluation mode # https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.eval model.eval() xs, ys = next(iter(dataloader)) y_pred = torch.empty([0, ys.shape[1]]) x = torch.empty([0, xs.shape[1]]) y = torch.empty([0, ys.shape[1]]) # disable gradient calculation for evaluation # https://pytorch.org/docs/stable/generated/torch.no_grad.html with torch.no_grad(): for step, (X_batch, y_batch) in enumerate(dataloader): # get predictions y_batch_pred = model(X_batch) y_pred = torch.cat([y_pred, y_batch_pred]) y = torch.cat([y, y_batch]) x = torch.cat([x, X_batch]) # print(y_pred.shape, y.shape) y_pred = postprocessing(y_pred) y = postprocessing(y) return y_pred, y, x 4.2.8 Model training for the given number of epochs To train the model we just need to call the function N times, where N is the number of epochs. The function is called to log the current model accuracy on the validation dataset. Finally, the best model is updated based on the validation accuracy. The function returns the best validation accuracy and the training history. train_epoch evaluate model_train def model_train(model, optimizer, dataloader_train, dataloader_val, n_epochs=50): best_acc = 0 best_weights = None history = {'loss': {'train': [], 'validation': []}, 'accuracy': {'train': [], 'validation': []}} for epoch in range(n_epochs): # train on dataloader_train acc_train, loss_train = train_epoch(model, optimizer, dataloader_train) # evaluate on dataloader_val acc_val, loss_val = evaluate(model, dataloader_val) print(f'Epoch: {epoch} | Accuracy: {acc_train:.3f} / {acc_val:.3f} | ' + f'loss: {loss_train:.5f} / {loss_val:.5f}') # save epoch losses and accuracies in history dictionary history['loss']['train'].append(loss_train) history['loss']['validation'].append(loss_val) history['accuracy']['train'].append(acc_train) history['accuracy']['validation'].append(acc_val) # Save the best validation accuracy model if acc_val >= best_acc: print(f'\tBest weights updated. Old accuracy: {best_acc:.4f}. New accuracy: {acc_val:.4f}') best_acc = acc_val torch.save(model.state_dict(), 'best_weights.pt') # restore model and return best accuracy model.load_state_dict(torch.load('best_weights.pt')) return best_acc, history 4.2.9 Plot training history def plot_history(history): fig = plt.figure(figsize=(8, 4), facecolor=(0.0, 1.0, 0.0)) ax = fig.add_subplot(1, 2, 1) ax.plot(np.arange(0, len(history['loss']['train'])), history['loss']['train'], color='red', label='train') ax.plot(np.arange(0, len(history['loss']['validation'])), history['loss']['validation'], color='blue', label='validation') ax.set_title('Loss history') ax.set_facecolor((0.0, 1.0, 0.0)) ax.legend() ax = fig.add_subplot(1, 2, 2) ax.plot(np.arange(0, len(history['accuracy']['train'])), history['accuracy']['train'], color='red', label='train') ax.plot(np.arange(0, len(history['accuracy']['validation'])), history['accuracy']['validation'], color='blue', label='validation') ax.set_title('Accuracy history') ax.legend() fig.tight_layout() ax.set_facecolor((0.0, 1.0, 0.0)) fig.show() 4.3 Get the dataset, create the model, and train it Let's put everything together and train the binary classification model. ######################################### # Get the dataset X, y = get_dataset(n_classes=number_of_classes) print(f'Generated dataset shape. X:{X.shape}, y:{y.shape}') # change y numpy array shape from [N,] to [N, 1] for binary classification y = preprocessing(y) print(f'Dataset shape prepared for binary classification with sigmoid activation and BCE loss.') print(f'X:{X.shape}, y:{y.shape}') # Get train and validation dataloaders dataloader_train, dataloader_val = get_data_loaders(dataset=(scale(X), y), batch_size=32) # get a batch from dataloader and output intput and output shape X_0, y_0 = next(iter(dataloader_train)) print(f'Model input data shape: {X_0.shape}, output (ground truth) data shape: {y_0.shape}') ######################################### # Create ClassifierNN for binary classification problem # input dims: [N, features] # output dims: [N, 1] # activation - sigmoid to output probability p in range [0,1] # loss - binary cross-entropy model = ClassifierNN(loss_function=binary_cross_entropy, activation_function=sigmoid, input_dims=X.shape[1], output_dims=y.shape[1]) ######################################### # create optimizer and train the model on the dataset optimizer = torch.optim.Adam(model.parameters(), lr=0.0001) print(f'Model size: {sum([x.reshape(-1).shape[0] for x in model.parameters()])} parameters') print('#' * 10) print('Start training') acc, history = model_train(model, optimizer, dataloader_train, dataloader_val, n_epochs=20) print('Finished training') print('#' * 10) print("Model accuracy: %.2f%%" % (acc * 100)) plot_history(history) The expected output should be similar to the one provided below. Generated dataset shape. X:(10000, 20), y:(10000,) Dataset shape prepared for binary classification with sigmoid activation and BCE loss. X:(10000, 20), y:(10000, 1) Model input data shape: torch.Size([32, 20]), output (ground truth) data shape: torch.Size([32, 1]) Model size: 27601 parameters ########## Start training Epoch: 0 | Accuracy: 0.690 / 0.952 | loss: 0.65095 / 0.53560 Best weights updated. Old accuracy: 0.0000. New accuracy: 0.9524 Epoch: 1 | Accuracy: 0.956 / 0.970 | loss: 0.33146 / 0.18328 Best weights updated. Old accuracy: 0.9524. New accuracy: 0.9702 Epoch: 2 | Accuracy: 0.965 / 0.973 | loss: 0.14162 / 0.11417 Best weights updated. Old accuracy: 0.9702. New accuracy: 0.9732 Epoch: 3 | Accuracy: 0.970 / 0.975 | loss: 0.10551 / 0.09519 Best weights updated. Old accuracy: 0.9732. New accuracy: 0.9752 Epoch: 4 | Accuracy: 0.972 / 0.976 | loss: 0.09295 / 0.09127 Best weights updated. Old accuracy: 0.9752. New accuracy: 0.9762 Epoch: 5 | Accuracy: 0.974 / 0.977 | loss: 0.08666 / 0.08467 Best weights updated. Old accuracy: 0.9762. New accuracy: 0.9772 Epoch: 6 | Accuracy: 0.976 / 0.977 | loss: 0.08243 / 0.08312 Best weights updated. Old accuracy: 0.9772. New accuracy: 0.9772 Epoch: 7 | Accuracy: 0.977 / 0.979 | loss: 0.07981 / 0.08914 Best weights updated. Old accuracy: 0.9772. New accuracy: 0.9787 Epoch: 8 | Accuracy: 0.977 / 0.981 | loss: 0.07876 / 0.08224 Best weights updated. Old accuracy: 0.9787. New accuracy: 0.9807 Epoch: 9 | Accuracy: 0.978 / 0.979 | loss: 0.07692 / 0.08362 Epoch: 10 | Accuracy: 0.979 / 0.979 | loss: 0.07478 / 0.07739 Epoch: 11 | Accuracy: 0.980 / 0.980 | loss: 0.07375 / 0.07708 Epoch: 12 | Accuracy: 0.980 / 0.980 | loss: 0.07253 / 0.07613 Epoch: 13 | Accuracy: 0.981 / 0.979 | loss: 0.07119 / 0.07788 Epoch: 14 | Accuracy: 0.982 / 0.982 | loss: 0.07148 / 0.07483 Best weights updated. Old accuracy: 0.9807. New accuracy: 0.9816 Epoch: 15 | Accuracy: 0.982 / 0.981 | loss: 0.06973 / 0.07474 Epoch: 16 | Accuracy: 0.981 / 0.982 | loss: 0.06900 / 0.07401 Best weights updated. Old accuracy: 0.9816. New accuracy: 0.9821 Epoch: 17 | Accuracy: 0.982 / 0.979 | loss: 0.06850 / 0.08130 Epoch: 18 | Accuracy: 0.982 / 0.980 | loss: 0.06796 / 0.07966 Epoch: 19 | Accuracy: 0.982 / 0.981 | loss: 0.06714 / 0.07458 Finished training ########## Model accuracy: 98.21% 4.4 Evaluate the model acc_train, _ = evaluate(model, dataloader_train) acc_validation, _ = evaluate(model, dataloader_val) print(f'Accuracy - Train: {acc_train:.4f} | Validation: {acc_validation:.4f}') Accuracy - Train: 0.9816 | Validation: 0.9816 val_preds, val_y, _ = predict(model, dataloader_val) print(val_preds.shape, val_y.shape) binary_confusion_matrix = torchmetrics.classification.ConfusionMatrix('binary') cm = binary_confusion_matrix(val_preds, val_y) print(cm) df_cm = pd.DataFrame(cm) plt.figure(figsize=(6, 5), facecolor=(0.0,1.0,0.0)) sn.heatmap(df_cm, annot=True, fmt='d') plt.show() val_preds, val_y, val_x = predict(model, dataloader_val) show_dataset(val_x.numpy(), postprocessing(val_y).numpy(), 'Ground Truth') show_dataset(val_x.numpy(), postprocessing(val_preds).numpy(), 'Predictions') Conclusion Binary classification is a foundation for many deep learning tasks. For binary classification, you need to use sigmoid activation and binary cross-entropy loss. If you understand how these two functions work you will be able to understand not only classification NN models but more complicated NN architectures.