DOI: https://doi.org/10.37811/cl_rcm.v6i5.3158

Occult mathematics under the learning process in

convolutional neural networks

Ricardo Borja-Robalino

[email protected]

https://orcid.org/0000-0002-3899-1140

Antonio Monleón-Getino

[email protected]

https://orcid.org/0000-0001-8214-3205

José Rodellar

[email protected]

https://orcid.org/0000-0002-1514-7713

ABSTRACT

In the last decade, artificial intelligence has transformed the world. Big Data and large software companies are prompting researchers to create new algorithms that surpass human intelligence more quickly and efficiently. In 2012 convolutional neural networks (CNN) captured the attention of researchers on the subject of image recognition; becoming popular and efficient. However, the lack of information on a meticulous mathematical process and the tendency of authors to describe this method as a black box, implied that a defined architecture, parameters and hyperparameters generate results with an unknown internal process. The present study develops the detailed mathematical model of forward and backward propagation in a CNN for three-dimensional images (in color), providing researchers with solid tools when developing optimizations in the algorithm. In addition, the models were applied to a proposed architecture that allowed recognizing the stages of the process, the number of network learning parameters and complexity focused on computational expenditure. Finally, tables are included with the most used cost, activation and optimization functions that allow the reader to formulate their own model depending on the architecture, functions and hyperparameters chosen.

Keywords: convolutional neural networks; mathematical model; forward and backward propagation; cost function; optimization function.

Correspondencia: [email protected]

Artículo recibido 10 agosto 2022 Aceptado para publicación: 10 septiembre 2022

Conflictos de Interés: Ninguna que declarar

Todo el contenido de Ciencia Latina Revista Científica Multidisciplinar, publicados en este sitio están disponibles bajo Licencia Creative Commons .

Cómo citar Borja Robalino, R., Monleón Getino, A., & Rodellar, J. (2022). Matemática oculta bajo el proceso de aprendizaje en redes neuronales convolucionales. Ciencia Latina Revista Científica Multidisciplinar, 6(5), 1031-1063. https://doi.org/10.37811/cl_rcm.v6i5.3158

Matemática oculta bajo el proceso de aprendizaje en redes neuronales convolucionales

RESUMEN

En la última década la inteligencia artificial ha transformado el mundo. El Big Data y grandes empresas de software impulsan a investigadores a crear nuevos algoritmos que superan la inteligencia humana con mayor rapidez y eficiencia. En el año 2012 las redes neuronales convolucionales (CNN) captaron la atención de investigadores en el tema del reconocimiento de imágenes; volviéndose populares y eficientes. Sin embargo, la falta de información de un proceso matemático minucioso y la tendencia de autores a describir este método como una caja negra, implicaron que una arquitectura, parámetros e hiperparámetros definidos generen resultados con un proceso interno desconocido. El presente estudio desarrolló un modelo matemático detallado de forward y backward propagation en una CNN para imágenes tridimensionales (a color), dotando a investigadores de herramientas solidas al momento de generar optimizaciones en el algoritmo. Además, los modelos se aplicaron a una arquitectura planteada que permitió reconocer las etapas del proceso, cantidad de parámetros de aprendizaje de la red y complejidad enfocada al gasto computacional. Se incluye tablas con funciones de costo, activación y optimización más utilizadas que permitan al lector formular su propio modelo dependiendo de la arquitectura, funciones e hiperparámetros seleccionados.

Palabras clave: redes neuronales convolucionales; modelo matemático; propagación hacia adelante y atrás; función de costo; función de optimización.

INTRODUCTION

The rapid technological growth has become the artificial intelligence (AI) and its machine learning algorithms in the best option when developing systems which executes actions of decision, reasoning, problem solving and pattern learning or characteristics in similar pattern to the human brain even with greater precision (Namikawa et al., 2020),(Pesapane et al., 2020),(Ghosh & Kandasamy, 2020).

One of the algorithms currently blooming is the artificial neural networks (ANN) and especially the convolutional neural networks (CNN) as the result of highly efficiency and effectiveness (LeCun et al., 2015). The ANN and CNN usage became popular in the tech giants such as Google, Facebook, Amazon, Microsoft and others. Since 2009, Google has used ANN in the captchas problems through the reCAPTCHA application. Stanford University managed to generate captions automatically meanwhile Microsoft developed BrainMaker application to attract future buyer customers (Van Houdt et al., 2020). Additionally, there are applications addressed to translations, personal assistants, freelance vehicles among others (Perconti & Plebe, 2020).

The existence of various classification algorithms such as: logistic regression, support vector machine, decision trees, XGBoost, random forests, neural networks and deep learning, offers numerous alternatives to researches depending on the type of problem to be solved (Lazzeri, 2020). However, CNN's (deep learning) epic breakthrough in the computer area by vision was in 2012 where a contestants group led by Geoffrey Hinton won first place in the “ImageNet” image classification contest using the AlexNet architecture and achieving results that once again captured the attention of researchers in this algorithm (Krizhevsky et al., 2017).

Various authors describe the CNN as algorithms whose process based on logic and mathematical models are presented as a kind of black box where their responses must be accepted regardless of unfamiliarity of their process (Buhrmester et al., 2019)(Sturm et al., 2016)(Baselli et al., 2020). From the mathematical and scientific perspective, it is important the researcher possesses the necessary tools and information in relation to this type of network, in such a way that it allows them to analyze the architecture and its established learning process; recognizing critical points of failure in the algorithm and allowing the possibility of optimizing the posed methodology and programming.

The present research aims to demystify CNNs as black boxes, through the detailed mathematical development of learning process of a network (forward and backward propagation), providing researchers with a solid and complete mathematical knowledge of this type of networks developed for the three-dimensional case, considering that there is no detailed information only for the two-dimensional case. In addition, the approach of the models on an architecture of a CNN is exemplified; applying a standard activation, cost and optimization function. Finally, tables are presented with the most used functions, easing readers the possibility of introducing any of them to the model depending on the need.

THEORETICAL BACKGROUND

For the mathematical development, the basic knowledge about: convolutional neural network, mathematical formulas for the convolution process, introduction of non-linearity, loss functions, optimization and network learning process were considered.

Convolutional Neural Networks (CNN)

Convolutional neural networks (CNN) seek to imitate the behavior of the human visual cortex, which processes simple and highly complex images at extraordinary speed (Asghar et al., 2019). CNNs are quite similar to artificial neural networks with the following differences: the network assumes an image directly as input, a convolution operation is executed between the input matrix and the established kernel, there is a pooling layer that allows reducing the dimension and thus the number of parameters (Wadawadagi & Pagi, 2020).

CNNs are generally composed of three essential layers allowing the network to learn different levels of abstraction. The first is a convolution layer, in many cases followed by a pooling layer and finally a fully connected layer (exactly the same as an ANN).

Layers

Convolutional Layer

The convolution operation (Equation 1) allows the network to learn spatial hierarchies of local patterns, working under regions of established dimensions by the user, which facilitates the detection of certain visual features such as contours, edges, color drops, lines, etc.(Dhillon & Verma, 2020). This process allows recognizing a certain characteristic in a specific place in the image and then being able to detect it in any location it is being found, unlike ANNs that only learn global patterns. In this layer the parameters are a set of three-dimensional filters that are adjusted as the network learns, allowing data to be transformed in such a way that certain characteristics become more dominant in the output image (Minaee et al., 2020). Its mathematical formulation is presented in the following equation:

where: W = weight or kernel; X = input matrix; b = bias; = convolution operation; g = activation function.

Pooling layer

They are often located after a convolutional layer and their function is to generate a simplification process or parameters reduction progressively, avoiding over-fitting the network. As a result, a region with condensed information is obtained, maintaining the input spatial relationship (Lopez-Pinaya et al., 2020). This type of layer generally works under two possible options:

§ Max-pooling (Equation 2). - It is the most widely used, the size of the sub- region is variable for convenience, its objective is to find the maximum value a two-dimensional window sample and pass it as a result. Its process is very similar to that used by the human visual cortex (Wu, 2017).

; = two-dimensional window in C output.

§ Average-pooling (Equation 3). – unlike max-pooling, its result comes from the mean (average) value within the two-dimensional window (Lee et al., 2009; Wu, 2017).

where: Po = pooling window size;

Fully or densely connected layer

In this layer, the operating process is similar to an artificial neural network (Equation 4), using the perceptron as the fundamental unit. The sum of inputs multiplied by the associated weights and added a bias (polarization) constitutes the neuron linear activation (Martineau et al., 2020). The non-linearity to the system is introduced through an activation function, becoming in this case (CNN) in the final phase of image classification. A convolutional network architecture can contain one or more fully connected layers. The convolutional layer as well as the densely connected one have an identical functional form which allows conversions between them, taking into account that in the first case the connections are made to a local region in the input while in the second one works with the global region (Dhillon & Verma, 2020)(Kim, 2014).

; = matrix product

Parameters and hyperparameters

In a CNN, the parameters constitute those variables internal to the mathematical model of the network (weights and biases), which can be estimated or learned from the input data. Hyper-parameters are external elements configuration adjusting the learning process and are established by the programmer through experience and intuition depending on the type of architecture or problem to be analyzed. Among the best known hyper-parameters at the structure, topology or algorithm level we have: number of layers, function activation functions, loss functions, neurons number, learning rate, decay, momentum, nesterov, epochs, batch size, dilation, dropout, normalization, kernel size, filter numbers, padding, stride, etc (Rawat & Wang, 2017).

Funciones de costo

It allows to quantify the error (cost) coming between the expected and predicted output during the network learning process, in short; assesses how well the algorithm predicts (Janocha & Czarnecki, 2017). Loss functions can be regression or classification; among the most used are (See Table 1):

Table 1. Cost Functions.

Identification	Formula	Characteristics
Regression
Mean squared errors (MSE)(Parmar, 2018)		It is the average of the squared difference between the predicted and actual values.
Mean absolute errors (MAE)(Parmar, 2018)		It is the average of the absolute differences between the predicted and actual values.
Binary classification
Binary cross-entropy (Log loss) (Parmar, 2018).		It is commonly used where the loss increases as the predicted probability differs from the real one, also called negative logarithmic probability.

Multiclass classification

Multi-class cross entropy (categorical-crossentropy) (Haykin, 2009).

Probability of i_th element in j.

It is a generalization of binary cross entropy.

= 1 If the i_th element is in j

= 0 other case.

Kullback-leibler divergence () (Kullback & Leibler, 1951).

Real probability of i_th element in j.

Predicted probability of i_th element

in j.

Similar to multi-class entropy and is called relative entropy of P with respect to Q.

If the divergence is 0, distributions are equal.

Activation functions (FA)

The activation functions are used in each convolutional and densely connected layer, introducing non-linearity to the mathematical model (activation potential z) of the posed architecture [26]. The process is necessary knowing that an image has many non-linear elements. This type of model is not subject to the superposition principle, where a linear problem might be decomposed into 2 or more simpler sub-problems (Berzal, 2018). FAs are chosen depending on the task that the neurons must perform; Among the widely used are (See Table 2):

Table 2. Activation functions.

Identification

Formula

Derivative

Characteristics

Sigmoid(Berzal, 2018).

g(z) tends to 0 when z tends to and tends to 1 when tends to .

The function range is: .

Hyperbolic tangent (TanH)(Berzal, 2018).

It is a Sigmoid’s re-scaled version and has higher derivatives, which contributes to learning.

The function range is: .

Rectified linear unit

(ReLU)(Gu et al., 2018).

(14)

(15)

It is the most used in Deep Learning.

The function range is:.

LeakyReLU (Gu et al., 2018).

(16)

(17)

It is similar to ReLU, with the difference that for negative values there is a constant slope (generally 0.01).

The function range is:.

SoftPlus(Glorot et al., 2011).

(18)

(19)

It is a ReLU’s smooth approximation and is distinguishable throughout.

In some cases, it is used in the last layer.

The function range is: .

Softmax(Zhang et al., 2018).

(20)

J = number of classes.

(21)

It takes as input a vector z and normalizes it to a probability distribution.

The function range is: .

Optimization functions (FO)

The network optimization goal is to minimize the cost function, reducing the error between the desired and predicted output; by estimating and adjusting the model parameters (architecture) (Rao et al., 2018). The FO uses the partial derivatives (gradient) of the loss function evaluated with respect to the current weight , getting iteratively and gradually to determine the best combinations of weights and biases that allows reaching the global minimum of the function whereand ; fulfilling the optimality condition where . The main algorithm used for optimization in convolutional neural networks is the Stochastic Descending Gradient (Bottou, 2012) (See Table 3).

Table 3. Stochastic gradient.

Identification

Formula

Characteristics

Stochastic gradient descent (SGD)(Bottou, 2012)

(22)

Similar to the descending gradient, with the difference it is updated by small random samples of the training set (mini batches), which allows it to be more scalable and faster.

There are different optimization algorithms for the stochastic descending gradient, among the best known are the following (See Table 4):

Table 4. Optimization function of SGD.

Identification

Formula

Characteristics

Momentum (M)(Ruder, 2017)

(23)

Speed SGD in the convergence direction and dampen oscillations

Generally

Nesterov accelerated gradient (NAG)(Ruder, 2017)

(24)

Optimizes SGD with moments, avoiding very long jumps through a correction in the previous gradient . Avoid going too fast and reaching a divergence.

Gradient based algorithm (Adagrad)(Ruder, 2017)

(25)

Adapt the learning rate through small updates in parameters that act within common and unusual image characteristics.

Scales with respect to the accumulated square gradient in each iteration t in each dimension.

Adadelta(Ruder, 2017)

(26)

It is an Adagrad extension that reduces the monotonically and decreasing aggressive learning rate. Adadelta restricts the accumulated past gradients to a fixed size of the last gradients.

Adaptive learning algorithm

(RMSprop) (Ruder, 2017)

(27)

It is similar to Adadelta.

Generally, it recommends

Adam(Ruder, 2017)

(28)

Considering as one of the fastest by fastair, combines RMSprop and moments. The authors posed ; are initialized to zero.

How learning is done on CNN?

At the beginning of the training, the previously converted and normalized images go through the convolution, pooling and classification (forward propagation) layers, obtaining the prediction of the target of each image or group of images. This is the starting point of learning using the backward propagation process and the cost function propagates the error throughout the entire network (from back to front); by optimizing function. This process allows adjusting and updating network weights and biases in each batch or mini batch established by the user. The parameters adjustment o will allow the network to learn the image characteristics, improving its prediction capacity at each epoch (Liu et al., 2015), (Koushik, 2016).

MODELING AND APPLICATION

Mathematical modeling of the learning process in CNN

The forward y backward propagation process updates the patterns and entries in each convolutional neuronal network layer (See Figure 1).

Figure 1. Convolutional neural network with mathematical model nomenclature.

Represents the nomenclature used for modeling in relation to the input, number of channels and number of filters used for the convolution and pooling layers. Similarly for the conversion to be applied in the densely connected layer.

The modeling used the following nomenclature:

§ = number of layers (generally made up of a convolution and pooling layer).

§ = three-dimensional input (large, width and depth), where .

§ = i, j belongs to the positioning of X in the two-dimensional plane and k is its depth.

§ = kernel (filters) positioning in its three dimensions.

§ = number of output filters.

§ = pooling window positioning (.

§ = pooling reduction scale.

§ = vector dimensions in the densely connected layer (rows, columns).

§ = learning rate.

§ = activation function (non-linear operation).

§ fully connected layer.

§ = number of labels (output classes).

§ = number of output feature maps.

Forward propagation

Forward propagation in a CNN begins with the image, considered as an input matrix X of dimensions . The extraction of general characteristics to specific was carried out through the different layers established in an architecture, taking into account that:

§ The convolution operation was restructured for a color image.

The non-linearized activation function in the generally is represented like . For the application, it was used the activation functions from Table 2.

§ Using and modifying Equations 2 y 3, the pooling layer is represented by:

§ Basing on Equation 4, the densely connected layer model is:

Backward propagation

The backpropagation objective is to update the weights and biases in each layer, starting from back to front with the propagation of the error between the output of the forward propagation and the desired output . For the mathematical modeling in the present study it was considered:

§ Error function:

§ In several convolutional layers the input of each layer is (Koushik, 2016):

Next, the mathematical models for the convolution, pooling and densely connected layers’ backpropagation are developed; considering like optimization function the stochastic descending gradient.

Output layer – densely connected

Weight updating in densely connected layer.

Whereby.

Applying the chain rule.

Where:

The actual weight spread is:

Therefore, the weight updating in a densely connected layer would be:

1. Bias updating densely connected layer.

Whereby.

Applying the chain rule, it has:

The actual bias spread is:

Therefore, the bias updating in a densely connected layer would be:

Input updating in the densely connected layer.

The input updating would be:

Whereby.

Applying the chain rule, it has:

Therefore, to make the matrix product, the following must be taken into account:

The input spread is:

The input updating in a densely connected layer would be:

Hidden or intermediate densely connected layers

Weigh updating in densely connected layer.

Whereby.

Applying the chain rule.

The actual weigh spread is:

Therefore, the weigh updating in a densely connected layer would be:

1. Bias updating densely connected layer.

Whereby.

Applying the chain rule, it is:

The actual bias spread is:

Therefore, bias updating in a densely connected layer would be:

Input updating in the densely connected layer.

In the backpropagation process, the last convolution or pooling layer becomes the input of the densely connected layer; therefore, updating this input would be:

Applying the chain rule, it is:

The spread to the input of a densely connected hidden or intermediate layer is:

Therefore, the input updating e in a densely connected layer would be:

Vector to matrix

When considering a pooling layer localization before a densely connected one; the values of are the result of vectorization and concatenation in the forward propagation process. Executing the inverse process, it obtains a matrix that represents pooling layer output.

Pooling layer

§ For average pooling case, each average returns to all matrix size cells :

§ In the case, it is a max-pooling, the maximum will be placed in the position that was extracted in the forward propagation, while the other cells will take a value of zero:

where: = extraction position in forward; = maximum value to spread.

By propagating the pooling layer, becomes the convolutional layer output ()

Convolutional layer

1. Weight updating (Kernel).

Whereby:

Using a triple summation so that the weight slides over the entire characteristics map, the following is posed:

Convolution is an operation that turns two functions into a third one, representing a magnitude that superimposes both. When becoming a translated and inverted version, in backpropagation the weight (kernel) must rotate 180 degrees (in theory translate rows and columns) to return to the original matrix. Therefore, weight updating is:

2. Bias updating .

Whereby:

Applying the chain rule:

The bias updating is:

3. Updating inputs in convolutional layers .

If it works with several convolutional layers, it is necessary to update the input ; solving the problem of obtaining the desired output in intermediate layers for the error calculation. However, if it is the first layer, only weights and biases will be updated.

Whereby:

Considering that:

Replacing in the equation, it would have:

When having a convolution, we must rotate to .

The input updating is:

Application of mathematical modeling to an architecture CNN

To exemplify the mathematical modeling backpropagation, an architecture with two convolutional layers, two Pooling layers, a densely connected layer and an output layer was designed. The following was used: Binary cross-entropy as cost function, tanh and sigmoid as activation functions (FA), learning rate and the stochastic descending gradient SGD as optimization function. The network aimed to classify images of motorcycles and planes (binary classification) (See Figure 2).

Figure 2. Architecture of a CNN with two convolutions, two pooling, connected and an output layer. In each phase the number of layers, filters and dimensionality (two-dimensional or three-dimensional) are indicated.

Based on equations 29,30 and 32, the simplified mathematical model of the architecture proposed in Figure 2 is represented by:

Where:

§ C = two-layer convolutional and two-layer pooling process.

§ P = application of the pooling layer.

§ g = activation function.

Whereby:

Where:

§ flattening process of phase C.

§ corresponds to the number of neurons in the hidden layer.

§ a single target (motorcycle or plane).

The step-by-step process of forward and backward propagation is presented below in sections 3.2.1 and 3.2.2.

Forward propagation

Convolution layer C₁

Considering that input images are of three channels RGB (red, green, blue), with a dimension . The first convolutional layer has a kernel =5x5x3, FA= tanh and 64 output filters. Using the equations 12 and 29, it is:

Pooling layer P₁

Size 2x2, Pr=2 and Max-pooling. With an input of (24x24x64), getting an output of (12x12x64). Basing on the Equation 30, it is.

Convolutional layer C₂

With a dimension , a kernel=5x5x64, FA= tanh and 128 output filter. Using Equations 12 and 29.

Pooling layer P₂

From size 2x2, Pr=2 and Max-pooling. With an input of (8x8x128), getting an output of (4x4x128). Basing in Equation 30, it is.

Vectorization and concatenation.

If leaving the second pooling layer the input size is X = (4x4x128), the flattening to one dimension becomes in (2048,1).

Densely connected layer C3.

The entry X is a matrix (2048,1) and the weight W (1,2048), the activation function is tanh, 2048 neurons come in and 500 out. Applying Equations 12 and 32 it is:

Output layer C4.

The entry X is a matrix (500,1) and the weight W (1,500), the activation function is Sigmoid which allows to perform a binary classification, with an input of 500 and output of a neuron. Applying Equations 10 and 32.

Cost function

Equation 7 of the Binary cross-entropy cost function is commonly used in binary classification problems as in this case.

Deriving as a function of the predicted output by the network :

Backward propagation

Backpropagation process updates the following parameters number in each network layer (See Table 5).

Table 5. Parameters network numbers calculation.

Layer	Calculation	Total c/layer	Weight	Bias
Convolutional C₁		4864	4800	64
Convolutional C₂		204928	204800	128
Densely connected C₃		1024500	1024000	500
Densely connected C₄ – output		501	500	1
Total - parameters		1234793	1234100	693

Densely connected layer C₄

Weigh updating through Equations 11, 34 and 46.

Bias updating through Equations 11, 35 and 46.

Input updating , replacing the former values getting in using Equation 36.

Densely connected layerC₃

Weigh updating through Equation 37. It considers from C₄ layer.

Bias updating through equation 38.

Input updating , using Equation 39.

Vectorization and concatenation inverse

Through Equation 40, the flattened matrix becomes filters with dimension ; where is the second pooling layer output with dimensions (4x4x128).

Pooling layer P₂

For pooling layer, it keeps the position (i,j) of maximum value in each region from forward propagation process, which awards for each filter a value of e (See Equation 42). The other positions will take a value of zero.

Convolutional layer C₂

Weigh updating through Equation 43.

where: , therefore:

Bias updating using Equation 44.

Input updating through Equation 45.

Pooling layer P₁

By storing the position (i,j) of the maximum value of the region in the forward propagation process, each filter is assigned a value of (See Equation 42). The other positions take a value of zero.

Convolutional layer C₁

Weight update through Equation 43.

where: initial matrix , therefore:

Bias update using Equation 44.

1. DISCUSSION

The development of mathematical modeling in a convolutional neural network (CNN), allowed to demystify and solve several authors’ problem such as Buhrmester (Buhrmester et al., 2019), Sturm (Sturm et al., 2016) y Baselli (Baselli et al., 2020). They describe the process of a CNN as a black box; forcing researchers in the absence of other studies to work only under their general equations and limited information.

The detailed mathematical process of a CNN confirms Buhrmester's idea (Buhrmester et al., 2019) outlined in his article "Analysis of Explainers of Black Box Deep Neural Networks for Computer Vision: A Survey". It emphasizes the importance of a complete knowledge of the mathematical process that a CNN network pursues on effective optimization.

The modeling of the convolution, pooling and densely connected layers for three-dimensional images in the forward propagation process has several modifications for the case (color image). However, it agrees essentially with the proposed final equations by Dumoulin (Dumoulin & Visin, 2018), Kuo (Kuo, 2016) and Koushik (Koushik, 2016) used in the case of a two-dimensional image (black and white). In addition, a contribution was made to the updating of the network parameters (backpropagation in the three-dimensional case) using the stochastic gradient descent optimization equation (SGD) posed by Bottou (Bottou, 2012).

CONCLUSIONS AND SOCIAL IMPLICATIONS

§ From the exhaustive bibliographic review, it is concluded that there are no articles that present a detailed step-by-step mathematical process of forward and backward propagation in a CNN for three-dimensional images (color), even for the two-dimensional case, the information is concise and scarce, preventing its understanding by researchers who do not master the mathematical area.

§ This work allows reaching readers with different levels of knowledge (beginners, intermediate and specialists), providing consistent, effective and understandable information on the process of one of the algorithms with the greatest impact, such as convolutional neural networks. It becomes a relevant contribution to the scientific community that allows identifying intermediate processes and in turn posing optimizations with solid bases in relation to its mathematical model.

§ The exemplification of the modeled formulas for the forward and backward propagation process allowed to demonstrate its easy application in any architecture used by the researchers. Additionally, it allows to verify the amount of learning parameters of the network; understanding the complexity of the model and the differences in execution times either by the CPU or GPU.

§ The development of tables with the compilation of the main cost, activation and optimization functions, facilitates the reader the modeling of different architectures with specific functions depending on the problem to be solved; without the need to modify the variables identification posed by various authors according to the models posed in this article.

REFERENCES

Asghar, M. Z., Habib, A., Habib, A., Khan, A., Ali, R., & Khattak, A. (2019). Exploring deep neural networks for rumor detection. Journal of Ambient Intelligence and Humanized Computing, 4315–4333. https://doi.org/10.1007/s12652-019-01527-4

Baselli, G., Codari, M., & Sardanelli, F. (2020). Opening the black box of machine learning in radiology: Can the proximity of annotated cases be a way? European Radiology Experimental, 4(1), 30. https://doi.org/10.1186/s41747-020-00159-0

Berzal, F. (2018). Redes Neuronales & Deep Learning. In Redes Neuronales & Deep Learning (pp. 194–200).

Bottou, L. (2012). Stochastic Gradient Descent Tricks. In G. Montavon, G. B. Orr, & K.-R. Müller (Eds.), Neural Networks: Tricks of the Trade (Vol. 7700, pp. 421–436). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-35289-8_25

Buhrmester, V., Münch, D., & Arens, M. (2019). Analysis of Explainers of Black Box Deep Neural Networks for Computer Vision: A Survey. ArXiv:1911.12116 [Cs]. http://arxiv.org/abs/1911.12116

Dhillon, A., & Verma, G. K. (2020). Convolutional neural network: A review of models, methodologies and applications to object detection. Progress in Artificial Intelligence, 9(2), 85–112. https://doi.org/10.1007/s13748-019-00203-0

Dumoulin, V., & Visin, F. (2018). A guide to convolution arithmetic for deep learning. ArXiv:1603.07285 [Cs, Stat]. http://arxiv.org/abs/1603.07285

Ghosh, A., & Kandasamy, D. (2020). Interpretable Artificial Intelligence: Why and When. American Journal of Roentgenology, 214(5), 1137–1138. https://doi.org/10.2214/AJR.19.22145

Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep Sparse Rectiﬁer Neural Networks. 15, 9.

Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., Wang, G., Cai, J., & Chen, T. (2018). Recent advances in convolutional neural networks. Pattern Recognition, 77, 354–377. https://doi.org/10.1016/j.patcog.2017.10.013

Haykin, S. (2009). Neural Networks and Learning Machines. In Neural Networks and Learning Machines (Third, pp. 478–481). Pearson-Prentice Hall.

Janocha, K., & Czarnecki, W. M. (2017). On Loss Functions for Deep Neural Networks in Classification. ArXiv:1702.05659. http://arxiv.org/abs/1702.05659

Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. ArXiv:1408.5882 [Cs]. http://arxiv.org/abs/1408.5882

Koushik, J. (2016). Understanding Convolutional Neural Networks. ArXiv:1605.09081 [Stat]. http://arxiv.org/abs/1605.09081

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. https://doi.org/10.1145/3065386

Kullback, S., & Leibler, R. A. (1951). On Information and Sufficiency. Annals of Mathematical Statistics, 22(1), 79–86. https://doi.org/10.1214/aoms/1177729694

Kuo, C.-C. J. (2016). Understanding convolutional neural networks with a mathematical model. Journal of Visual Communication and Image Representation, 41, 406–413. https://doi.org/10.1016/j.jvcir.2016.11.003

Lazzeri, F. (2020, May 7). Selección de un algoritmo de Machine Learning. Azzure Machine Learning. https://docs.microsoft.com/es-es/azure/machine-learning/how-to-select-algorithms

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539

Lee, H., Grosse, R., Ranganath, R., & Ng, A. Y. (2009). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. Proceedings of the 26th Annual International Conference on Machine Learning - ICML ’09, 1–8. https://doi.org/10.1145/1553374.1553453

Liu, T., Fang, S., Zhao, Y., Wang, P., & Zhang, J. (2015). Implementation of Training Convolutional Neural Networks. ArXiv:1506.01195-Computer Vision and Pattern Recognition, 10. arXiv:1506.01195.

Lopez-Pinaya, W., Vieira, S., Garcia-Dias, R., & Mechelli, A. (2020). Chapter 10—Convolutional neural networks. In A. Mechelli & S. Vieira (Eds.), Machine Learning (pp. 173–191). Academic Press. https://doi.org/10.1016/B978-0-12-815739-8.00010-9

Martineau, M., Raveaux, R., Conte, D., & Venturini, G. (2020). A Convolutional Neural Network into graph space. ArXiv:2002.09285 [Cs]. http://arxiv.org/abs/2002.09285

Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., & Gao, J. (2020). Deep Learning Based Text Classification: A Comprehensive Review. ArXiv:2004.03705. http://arxiv.org/abs/2004.03705

Namikawa, K., Hirasawa, T., Yoshio, T., Fujisaki, J., Ozawa, T., Ishihara, S., Aoki, T., Yamada, A., Koike, K., Suzuki, H., & Tada, T. (2020). Utilizing artificial intelligence in endoscopy: A clinician’s guide. Expert Review of Gastroenterology & Hepatology, 1–18. https://doi.org/10.1080/17474124.2020.1779058

Parmar, R. (2018, September 2). Common Loss functions in machine learning. Medium. https://towardsdatascience.com/common-loss-functions-in-machine-learning-46af0ffc4d23

Perconti, P., & Plebe, A. (2020). Deep learning and cognitive science. Cognition, 203, 104365. https://doi.org/10.1016/j.cognition.2020.104365

Pesapane, F., Tantrige, P., Patella, F., Biondetti, P., Nicosia, L., Ianniello, A., Rossi, U. G., Carrafiello, G., & Ierardi, A. M. (2020). Myths and facts about artificial intelligence: Why machine- and deep-learning will not replace interventional radiologists. Medical Oncology, 37(5), 40. https://doi.org/10.1007/s12032-020-01368-8

Rao, G. A., Syamala, K., Kishore, P. V. V., & Sastry, A. S. C. S. (2018). Deep convolutional neural networks for sign language recognition. 2018 Conference on Signal Processing And Communication Engineering Systems (SPACES), 194–197. https://doi.org/10.1109/SPACES.2018.8316344

Rawat, W., & Wang, Z. (2017). Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review. Neural Computation, 29(9), 2352–2449. https://doi.org/10.1162/neco_a_00990

Ruder, S. (2017). An overview of gradient descent optimization algorithms. ArXiv:1609.04747 [Cs]. http://arxiv.org/abs/1609.04747

Sturm, I., Lapuschkin, S., Samek, W., & Müller, K.-R. (2016). Interpretable deep neural networks for single-trial EEG classification. Journal of Neuroscience Methods, 274, 141–145. https://doi.org/10.1016/j.jneumeth.2016.10.008

Van Houdt, G., Mosquera, C., & Napoles, G. (2020). A review on the long short-term memory model. Artificial Intelligence Review. https://doi.org/10.1007/s10462-020-09838-1

Wadawadagi, R., & Pagi, V. (2020). Sentiment analysis with deep neural networks: Comparative study and performance assessment. Artificial Intelligence Review. https://doi.org/10.1007/s10462-020-09845-2

Wu, J. (2017). Introduction to Convolutional Neural Networks. 2–28. https://cs.nju.edu.cn/wujx/paper/CNN.pdf

Zhang, L., Wang, S., & Liu, B. (2018). Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4). https://doi.org/10.1002/widm.1253