DOI: https://doi.org/10.37811/cl_rcm.v6i5.3159
Method of bayesian concordance and its application in problems of multiclass classification with unbalanced categories
Ricardo Borja-Robalino
https://orcid.org/0000-0002-3899-1140
Antonio Monleón-Getino
https://orcid.org/0000-0001-8214-3205
José Rodellar
https://orcid.org/0000-0002-1514-7713
ABSTRACT
The Machine or Deep Learning classification techniques use several performance evaluations measures. The kappa index is a highly undervalued measure regardless of its reliability in problems with unbalanced classes. On the other hand, Bayesian methods generate great contributions to statistics, adding uncertainty to the probabilistic model that allows estimating parameters with better adjustments. This research offers an innovative alternative for researchers by designing a free access library in the RStudio environment that evaluates classifiers through a measure of Bayesian-frequentist agreement. It uses three Bayesian models (Dirichlet, Multinomial-Dirichlet and Beta) with the Markov Monte Carlo chain method. The library was applied to the classification of leukemic cells at the Hospital Clínic (Barcelona), demonstrating its effectiveness in using the Bayesian kappa index for unbalanced data in relation to other measures, as well as the robustness and sensitivity of the design. For teaching use, the library has an additional function that simulates classifiers through a multinomial distribution, allowing them to be evaluated.
Keywords: Markov chains Monte Carlo; Bayesian inference; kappa index; Bayes Theorem; Decision Theory.
Correspondencia: [email protected]
Artículo recibido 10 agosto 2022 Aceptado para publicación: 10 septiembre 2022
Conflictos de Interés: Ninguna que declarar
Todo el contenido de Ciencia Latina Revista Científica Multidisciplinar, publicados en este sitio están disponibles bajo Licencia Creative Commons .
Cómo citar: Borja Robalino, R., Monleón Getino, A., & Rodellar, J. (2022). Método de concordancia bayesiana y su aplicación en problemas de clasificación multiclase con categorías desequilibradas. Ciencia Latina Revista Científica Multidisciplinar, 6(5), 1064-1090. https://doi.org/10.37811/cl_rcm.v6i5.3159
Método de concordancia bayesiana y su aplicación en problemas de clasificación multiclase con categorías desequilibradas
RESUMEN
Las técnicas de clasificación Machine o Deep Learning utilizan varias medidas de evaluación del rendimiento. La índice kappa es una medida muy infravalorada independientemente de su fiabilidad en problemas con clases desequilibradas. Por otro lado, los métodos bayesianos generan grandes aportes a la estadística, agregando incertidumbre al modelo probabilístico que permite estimar parámetros con mejores ajustes. Esta investigación ofrece una alternativa innovadora para los investigadores al diseñar una biblioteca de libre acceso en el entorno RStudio que evalúa clasificadores a través de una medida de concordancia bayesiana-frecuentista. Utiliza tres modelos Bayesianos (Dirichlet, Multinomial-Dirichlet y Beta) con el método de cadena Markov Monte Carlo. La biblioteca se aplicó a la clasificación de células leucémicas en el Hospital Clínic (Barcelona), demostrando su eficacia en el uso del índice bayesiano kappa para datos desequilibrados en relación con otras medidas, así como la robustez y sensibilidad del diseño. Para uso docente, la biblioteca cuenta con una función adicional que simula clasificadores a través de una distribución multinomial, lo que permite evaluarlos.
Palabras clave: cadenas de markov monte carlo; inferencia bayesiana; indice kappa; teorema de bayes; teoría de la decisión.
1. INTRODUCTION
Currently, multiclass classification problems are essentially focused on the development and continuous improvement of machine learning algorithms applied to large volumes of data (Maxwell et al., 2018). However, the efficiency of these classifiers is affected by the continuous occurrence of cases with categories significantly less represented than others (unbalanced). This has promoted techniques that reduce the distance in relation to proportions between classes, such as the algorithm Adaboost and others. In the same way, these alternative solutions still generate a crossroad for the research with very few options when evaluating classifiers (Shuo & Xin, 2012). Making accuracy the most widely used discriminatory measure when comparing observers or classification algorithms. In conceptual terms, the accuracy represent an overall measure of how close the result is with respect to a reference (Westgard, 2008).
An alternative is the Cohen's proposal with the kappa index as a measure of the concordance analysis between two human observers or classifiers. It offers a reliable comparison for the unbalanced multiclass case to know which is the best or worst classifier through the agreement observed, corrected by random effects that present susceptibility to biased classes. It is applicable to all areas, especially in medical areas such as the case of diagnosis and interpretation of findings related to examinations. Kappa formulation relates to the agreement observed as the number of elements correctly classified and what would be expected by chance to the instances of each class together with the elements that the observer or classifier agreed with the absolute truth (McHugh, 2012). The kappa index is taken as a parameter of reliability (accuracy) in cases where the absolute truth is known (gold standard). Otherwise, its validity is treated under the sensitivity and specificity (Brennan & Prediger, 1981).
Based on frequentist theory, the kappa index uses the information of a sample, based on the probability that an event occurs in relation to the pattern observed so far. However, the appearance of the Bayesian method, which introduces information such as the degree of belief that is given to an event through previous knowledge, expectations, experience of the researcher and others, make this alternative approach highly promising.
The Bayesian inference assumes the parameter as a random variable under a certain distribution, computed through the product of a prior distribution by the probabilistic model (Likelihood), which describes the prior knowledge of the variable of interest contained in the a posteriori distribution. The estimation of the desired value is a decision problem that uses the Bayes Theorem, allowing to achieve solid and robust results.
This paper focuses on the analysis of agreement for the case of unbalanced classes using Cohen's kappa index, using Bayesian methods, showing robustness and effectiveness when making specific calculations according to observers or classifiers. The specific objective is the design of the KFreqBay library in RStudio, which allows concordance analysis (kappa index), applicable to multiclass data with unbalanced categories. The library includes both the frequentist and Bayesian method with Markov Monte Carlo chains (MCMC), which can simulate a gold standard and classifiers with multinomial distribution or work with a set of data preset by the user. Three Bayesian models are applied: two Dirichlet distributions, one Dirichlet mixture with one Multinomial, and finally the use of two Beta distributions. They allow to effectively estimate the kappa index.
It is important to emphasize that the library developed is unique at the level of the R software, considering that it merges the two methods (frequentist and Bayesian) in the evaluation of classifiers. The user does not need the tedious job of installing several additional libraries. The library presents an additional report and relevant images in pdf format with the basic and necessary statistics of each pair of existing classifiers in the database, thus reducing working time of researchers when issuing a decision based on the best or worst observer or classifier algorithm.
The library was validated in two ways: 1) applied to a set of values obtained by simulation; and 2) applied to the classification results of an unbalanced database of digital microscope images of leukemic cells from peripheral blood of patients of the Hospital Clinic (Barcelona - Spain). Classifications were performed using machine learning techniques such as Linear Discriminant Analysis, Support Vector Machine and Random Forest.
2. THEORETICAL BACKGROUND
2.1 Concordance in categorical data
The concordance can be defined as the analysis that allows to measure the degree of agreement between two or more classifiers or observers, determining to what extent their results coincide in relation to the same phenomenon (Fleiss et al., 2003).
The Cohen kappa index is used in dichotomous cases and constitutes the observed agreement corrected for the effects of chance, proposing a standardized measure between [-1, 1]. It is formulated from the contingency table or well-known as the confusion matrix, which represents the frequency of hits and disagreements between methods for each category analyzed (See Table 1).
Table 1. Table of frequencies - dichotomous variables
OBSERVER 2 |
|
OBSERVER 1 |
|||
POSITIVE |
NEGATIVE |
TOTAL |
|
||
POSITIVE |
f11 |
f10 |
F1T |
|
|
NEGATIVE |
f01 |
f00 |
F0T |
|
|
TOTAL |
f1T |
f0T |
N |
|
The most common agreement measures are (Borja, 2019):
· Index Kappa:
where:
Po = Proportion of observed agreement.
Pe = Proportion of expected agreement by chance (product of marginal frequencies).
· Classification error or average error. - Proportion of misclassified cases.
· Positive True or sensitive. - Proportion of positive cases well classified.
· Negative True or specificity. - Proportion of well-ranked negative cases.
· False Positive. - Proportion of misclassified positive cases (Type I error).
· False Negative. - Proportion of badly classified negative cases (Type II error).
· Accuracy. - It indicates the degree of reproducibility of responses between observers.
· Confidence interval of the kappa coefficient (95%). - Corresponds to kappa ± the approximate standard error.
· McNemar test. - Determines whether or not there is a systematic difference between two observers.
The value of the agreement increases while the classes are distributed asymmetrically by the observers. Its effect is contrary to the measure that increases the number of classes and even more if they are biased, showing great sensitivity to unbalanced cases (Watson & Petrie, 2010). For the assessment of the degree of agreement, the proposal of Landis & Koch (1977) is used, assuming that the agreement is exactly what was expected by chance in the case of having (see Table 2).
Table 2. Valuation table of kappa index
Kappa |
Degree of agreement |
< 0 |
Without agreement (less than expected by chance) |
(0 – 0.2] |
Insignificant. |
(0.2 – 0.4] |
Low |
(0.4 – 0.6] |
Moderate |
(0.6 – 0.8] |
Good |
(0.8 – 1] |
Very good |
In cases of comparison of more than two evaluators (multinomial), the main measure of agreement is the Fleiss kappa index. It represents the corrected observed agreement between classifiers in the case where all the evaluators take a random result (Garabedian et al., 2017). It is formulated as:
where:
· r= number of categories; p = proportion of positive agreements; q= proportion of negative agreements.
· n= number of samples; m = the number of trials of each evaluator for each case.
· = the number of observers who assign the i-th subject to the j-th category.
A hypothetical example in the dichotomous case may be the need to know if a new image processing equipment, which allows detecting lung cancer more economically and quickly, can replace the old device (gold standard). For this, 900 images have been analyzed with the two teams, obtaining the following results (see Table 3):
Table 3. Results of the illustrative example
EQUIPMENT 2 |
|
EQUIPMENT 1 |
||
|
POSITIVE |
NEGATIVE |
TOTAL |
|
POSITIVE |
59 |
12 |
71 |
|
NEGATIVE |
4 |
825 |
829 |
|
TOTAL |
63 |
837 |
900 |
Applying the equations (2.1, 2.5, 2.6, 2.7, 2.8, 2.9 and 2.10) we have:
Kappa: 0.8712 ; Confidence interval:
Sensitivity: =0.93 ; Specificity: 0.98; Accuracy: 0.982
The results show that the new equipment has a high accuracy. However, taking into account that we work with unbalanced classes, we observe the kappa index where we can assess as a very good agreement between two devices (87.12%), correcting the effect of chance (86.21%). It indicates that there is greater accuracy when the team gives negative the existence of lung cancer (98%), more than in a positive response (93%). This concludes that it is an excellent option to replace the old equipment.
2.2 Probability distributions
This section presents a summary of the main probability distributions related to categorical variables in the Bayesian environment.
Multinomial Distribution
The multinomial distribution (Sheldom, 2014) is a generalization of the binomial for a multinomial random variable , with excluding events , respective probabilities :
, . . . . . , ,
The probability that the event happens times, successively forming a partition of the sample space , is called multinomial distribution and its mass function is:
where:
·
· For k=2 it is reduced to a binomial distribution.
Beta Distribution
It is widely used in continuous variables with restrictions in a range of length (0.1), and the most used in Bayesian inference as a priori distribution, due to its good adjustment to a wide variety of empirical distributions (Gupta & Nadarajah, 2004).
In the beta distribution its density function is:
where:
· is the gamma function.
· .
· y are profile parameters.
· It is asymmetric if , with have positive asymmetry and negative asymmetry.
· If then .
Dirichlet distribution
This distribution is one of the most used within Bayesian inference as a priori distribution representing uncertainty in results of categorical and multinomial distributions (Blei, Nigle, & Jordan, 2003). It is the multivariate generalization of the beta distribution (k = 2) and of a continuous multivariate family. Its density function is:
where:
· for all
· Probability of each category= number of categories.
·
2.3 Bayesian inference
It is a statistical method that allows obtaining a more precise prediction of a parameter of interest (posteriori), adding previous information of the event (priori) to the probabilistic model (likelihood). The Bayesian inference is characterized by assuming as a random variable under a certain distribution. The estimation of the desired value is a decision problem that uses the Bayes Theorem, allowing to achieve solid and robust results. This type of inference allows introduce uncertainty into the data and regulate predictions (Shridhar et al., 2019), adjusting the parameters of the distribution in the continuous case or by depending on the prevalence of the classes in the categorical case. It is very useful for unbalanced multiclass cases (Sanjib et al., 2000).
2.3.1 Decision theory
It is a process based on established criteria that allows responding with the highest reliability to an observer who faces a decision problem in an environment of uncertainty.
It is defined as a quatrain , which starts with a problem that comprises a set of decisions, associated with a set of relevant uncertain events . Each event has a consequence , which if there is more than one the order relation is used , which determines which is the most appropriate (Laurence & Pascal, 2009) (see Figure 1).
Figure 1: Decision tree
2.3.2 Bayes theorem
Fulfilling the assumptions of disjoint and exhaustive events, Bayes proposes the following theorem that presents the probability of a random event mutually exclusive given in terms of the conditional probability of the event given and the marginal distribution of (Bradley 2013; Press, 2009):
where = a posteriori probability, = conditional probability, = a priori probability and = total probability.
In a simpler and concise way, it can be stated:
The assigned a priori distribution can be of three types: informative if it incorporates information from previous analyzes, not informative if it is constructed based on subjective considerations and finally of structural type in the case that it incorporates information on relationships between parameters (D`Agostini, 2003).
The calculation of a posteriori probability starting from a priori generates multiple numerical difficulties that can trigger illogical results and with great complexity in their interpretation. However, this shortcoming can be covered by working with conjugated distributions that comply with the following property:
A family of a prior distributions on is said to be conjugated for sampling if: for any prior in , the corresponding posteriori also belongs to (Cristóbal, 2000).
The Table 4 presents different mixtures of conjugated families.
Table 4. Conjugated families (Cristóbal, 2000).
Priori |
Likelihood |
Posteriori |
Non-informative Prior parameter |
Beta |
Bernoulli |
Beta |
|
Dirichlet |
Multinomial |
Dirichlet |
|
Multinomial |
Dirichlet |
Beta |
|
Gamma |
Poisson |
Gamma |
|
Normal |
Normal |
Normal |
|
Gamma |
Normal |
Gamma |
|
Beta |
Binomial |
Beta |
|
Binomial |
Binomial |
Beta |
|
2.3.3 Markov Chains Monte Carlo
Markov Chains describe a discrete stochastic process that evolves probabilistically over time (Hillier & Lieberman, 2010), where, the probability of a subsequent event depends on the immediately preceding event (markovian property). Generating a short memory effect in the chains that allow conditioning future probabilities:
Monte Carlo simulation is defined as the way to estimate a fixed parameter through the repeated generation of random numbers (Chib et al., 2002).
Monte Carlo Markov Chains (MCMC) are defined as a simulation method that allows generate samples of the distribution afterwards, estimating quantities or parameters of interest through random sampling in a probabilistic space. MCMC are used in Bayesian inference to solve the difficult task of calculating the a posteriori probability of the Bayes Theorem, in cases with complex distributions. MCMC perform a series of repetitions of points of the M-dimensional space through a random number generator, recognizing the behavior of the system (Lebreton et al., 2004). Calculations can be developed through several algorithms, the most common being Gibbs sampling, which is considered as a particular case of the Hasting Metropolis.
The algorithm has a burn-in phase, which is the process that accelerates the convergence of the chain by eliminating points that are outside the contour of the stationary process, due to its low probability when starting the algorithm. For the diagnosis of convergence of one or more Markov chains to the estimated value, Gelman Rubin scale reduction factors are commonly used to compare variations within and between the chains. Figure 2 represents a two-dimensional MCMC algorithm.
Figure 2. Two-dimensional MCMC algorithm (Ford, 2015).
3. DEVELOPMENT OF THE LIBRARY
The motivation for the library is to help solve common real-life problems in relation to multiclass classification with unbalanced categories due to the continuous development of new machine learning algorithms. Focused on the methods of concordance with the application of statistical inference and the punctual estimation of the kappa index and other general statistics, the library allows a robust and efficient concordance analysis. It is applicable to data with dichotomous and politomic variables with unbalanced categories using the Frequentist and Bayesian method using Monte Carlo Markov chains (MCMC), either by creating a standard gold and classifiers with multinomial distribution by simulation or through a set of preset data. The KfreqBay library has a wide range of use either for research, educational or other applications, in general related to the evaluation of unbalanced multiclass classifiers based on concordance analysis, allowing both frequent and Bayesian perspectives in a robust, efficient way and fast.
In practice, Bayesian inference was implemented to estimate the Cohen's kappa index, by designing a library in the R language, obtaining as a result a frequentist and Bayesian concordance analysis, very effective in the unbalanced multiclass case.
The software JAGS (Just Another Gibbs Sampler) linked to the integrated development environment RStudio with the rjags library, made it possible to analyze Bayesian hierarchical models by applying Monte Carlo Markov Chains using the Gibbs sampling algorithm. We worked with the Cohen kappa index comparing several classifiers in pairs, because Fleiss multinomial kappa can sometimes return low values even when the agreement is really high (Powers, 2012).
A number of primary and secondary functions were programmed that, in the first instance, convert the input data into the appropriate format for the respective calculation. The frequentist analysis was made by extracting the general statistics, and a descriptive graphical analysis of the proportion of the classes was obtained. Three Bayesian models were developed that estimate the parameter of interest, demonstrating robustness and sensitivity of the proposed model with the significant contribution of the chosen distributions for the likelihood. The library allows to add information about the prevalence of the classes in the form of probability at the moment of performing the Bayesian calculation.
For educational and experimental purposes, we have the option of simulating the response of a classifier through a multinomial distribution, building the gold standard and several observers according to the characteristics pre-established by the user. Consequently, the analysis of frequentist and Bayesian concordance is carried out, simultaneously.
3.1 Bayesian models
In order to achieve logical and interpretable results, the Bayesian models were based on the mixture of conjugated families in Table 4 proposed by Cristóbal (2000). It was designed in text format using the function textconnection (model) for its analysis within the JAGS environment.
For the three models, the following likelihood function was proposed:
This is formulated starting from a Bernoulli distribution , with , considering that we work with categorical variables in the dichotomous case and with values generally between (0,1).
For the first model we used a Dirichlet distribution:
Applying the Bayes theorem we may write:
Regarding the programming of the models, only the first will be detailed. Using equation (2.1) of the kappa index, the following algorithm was proposed:
#Programming in R
Model <- "model {
# Verosimilitud
kappa <- (p_agreement - expected_agreement) / (1 -expected_agreement)
expected_agreement <- sum(p1 * p2)
for (i in 1:n_ratings) {
rater1[i] ~ dcat(p1)
rater2[i] ~ dcat(p2)
agreement[i] ~ dbern(p_agreement) }
# Parámetros priori
p1 ~ ddirch(alpha)
p2 ~ ddirch(alpha)
p_agreement ~ dbeta(1, 1)
alpha <- prob }"
For the second model a function is proposed that represents the mixture of a Dirichlet - Multinomial distributions, described and developed by Monleón-Getino (2018) and Monleón-Getino et al. (2019):
Therefore, by using the Bayes theorem, we have:
For the third model, we worked with two prior Beta distributions with density function:
Then,
Therefore, it was assumed that the responses of each classifier follow a previous distribution (beliefs), with prevalence introduced by the parameters of the priori. Together with our probabilistic model, they adjust the kappa estimate to different realities.
3.2 Library in the RStudio environment
The library created in R language has the name of KfreqBay, is free access, in zip format, installable in RStudio and downloadable from the address: https://github.com/RicardoBorja. It has the function K_Freq_Bay that allows running a frequentist and Bayesian concordance analysis with either a specific database or by simulating a gold standard and observers. It also includes a help menu (? K_Freq_Bay), which allows the user to know information about the parameters and illustrative examples that familiarize them with the process.
The K_Freq_Bay function has default values, with accessibility to changes according to the needs of the user. The function has the following form, where the arguments used are included:
K_Freq_Bay(data=FALSE,setseed=1234,num_mult=1000,burn_in=10000,chains=2,updat=1000,thin_=1,iter_thin_=20000,models=1, DIC_=0)
The designed library avoids the user the tedious activity of installing additional functions that a package requires for its proper functioning, automatically installing or activating everything necessary for the optimal execution of K_freq_Bay. In addition, it has a friendly environment that guides the process step by step, in the two cases of simulation or the use of a specific database. The required information is entered through the keyboard in numerical form.
In the case of simulation of a gold standard and classifiers, the process starts with previous information, followed by two options to create the sample: enter the size of the categories or their probabilities (see Figure3).
Figure 3: Generation of data through the number of categories and sample size
Consequently, the number of observers and the desired precision are chosen. Once the database is created by simulation, the frequentist and Bayesian analysis is carried out, requesting if the user wishes that the prevalence of the a priori distribution is equiprobable or not. Figure 4 shows the case of information addition.
Figure 4. Bayesian analysis including a prior information
For the case of analyzing a given database, the previous information is known and the user goes directly to the option of Figure 4.
The library in any of the two cases presents as outputs a descriptive graph of the proportion of the classes, density graphs of the frequentist versus Bayesian kappa index, graphs of convergence diagnosis of Gelman Rubin, self-correlation, stationarity and final report of statistical values frequentist and Bayesian generals. They are generated for all possible pairs of classifiers (with the gold standard and with each other), with pdf format in the work folder. In addition, in the environment RStudio returns a list with:
1. Report of Gelman Rubin, Raftery Lewis and Cramer Von Mises (Methods to assessing Markov Chain Convergence).
2. Final report of the general statistics.
3. Final report in case of sample size changes.
In the case of simulation, at the end of the process the user has the option of changing the sample size while maintaining the same probabilities in the classes. This allows to know the different variations according to the increase or decrease of data.
A more detailed explanation is available in the thesis project through the link: https://upcommons.upc.edu/handle/2117/127344.
4. RESULTS
We evaluated the accuracy and sensitivity in the estimate of the Bayesian kappa index with the KFreqBay library, presenting different use cases. Three observers were simulated with a gold standard and five categories, under two scenarios: the first with frequencies of 200, 300, 400, 20, 1; in the second, the probability of each class was retained and the sample size was changed from 921 to 9000.
In the Bayesian part of each process, tests were carried out assuming equiprobability and with prevalence of 0.15, 0.40, 0.05, 0.20 and 0.20 in the classes for a prior distribution. Tables 5 and 6 summarize the results obtained in the estimation of the kappa index only of the gold standard compared to the first classifier. Final reports and graphs of all pairs of observers can be observed at: https://github.com/RicardoBorja.
Table 5. Results frequentist method - validation process
FREQUENTIST METHOD |
|
|||||
SAMPLE SIZE |
PAIR OF OBSER. |
KAPPA LOWER |
KAPPA |
KAPPA UPPER |
ACCURACY |
|
921 |
1-2 |
0.8622 |
0.8873 |
0.9124 |
0.9251 |
|
9000 |
1-2 |
0.8720 |
0.8802 |
0.8885 |
0.92 |
|
Table 6. Bayesian method results with three models - validation process
BAYESIAN METHOD |
|||||||
SAMPLE SIZE |
MODEL |
PAIR OF OBSER. |
KAPPA LOWER |
KAPPA |
KAPPA UPPER |
P-VALUE 2 CHAINS |
EQUIPROBABLE CATEGORIES |
921 |
DI-DI |
1-2 |
0.8591 |
0.8867 |
0.9106 |
(0.37;0.13) |
YES |
921 |
DI-DI |
1-2 |
0.8590 |
0.8864 |
0.9104 |
(0.33;0.41) |
NOT |
921 |
DI-MUL |
1-2 |
0.8030 |
0.9084 |
0.9339 |
(0.64;0.30) |
YES |
921 |
DI-MUL |
1-2 |
-2.7271 |
0.9120 |
0.9372 |
(0.60;0.72) |
NOT |
921 |
BE-BE |
1-2 |
0.9062 |
0.9244 |
0.9404 |
(0.27;0.05) |
YES |
921 |
BE-BE |
1-2 |
0.9061 |
0.9245 |
0.9404 |
(0.43;0.96) |
NOT |
9000 |
DI-DI |
1-2 |
0.8716 |
0.8802 |
0.8884 |
(0.95;0.89) |
YES |
9000 |
DI-DI |
1-2 |
0.8718 |
0.8802 |
0.8884 |
(0.34;0.70) |
NOT |
9000 |
DI-MUL |
1-2 |
0.7984 |
0.9047 |
0.9215 |
(0.20;0.39) |
YES |
9000 |
DI-MUL |
1-2 |
-2.7848 |
0.9108 |
0.9239 |
(0.84;0.88) |
NOT |
9000 |
BE - BE |
1-2 |
0.9142 |
0.9199 |
0.9254 |
(0.17;0.56) |
YES |
9000 |
BE-BE |
1-2 |
0.9142 |
0.9200 |
0.9255 |
(0.11;0.09) |
NOT |
As observed in Tables 5 and 6, the reports generated by the three models were analyzed by checking the robustness and sensitivity of the proposed distributions within the probabilistic model and a prior probability in unbalanced multiclass cases. The Dirichlet-Dirichlet model presented a mesokurtic density with greater stability in both sample sizes, whereas the Dirichlet - Multinomial was leptokurtic in the equiprobable case and totally opposed when entering information. Finally, the Beta - Beta model presented a very narrow credibility interval that makes it very restrictive. In all cases, the chains converge to the estimated value with good precision in relation to the frequentist method. However, the Dirichlet - Dirichlet distribution is considered the most optimal and stable distribution for calculating the Bayesian kappa index.
4.1 Application of the Bayesian concordance analysis by K_Freq_Bay to the database of classification of leukemic cells in peripheral blood – Hospital Clinic.
The library was applied to the results of the automatic classification of peripheral blood digital images used for the initial diagnosis of leukemias and lymphomas (Boldú et al., 2019). They were obtained by the Cellsilab group formed by researchers from the CORE Laboratory of the Biomedical Diagnostic Center of the Hospital Clinic of Barcelona and the Mathematics Department of the Technical University of Catalonia.
The classifications were generated by three types of machine Learning algorithms (Linear Discriminant Analysis LDA, Support Vector Machine SVM and Random Forest RF) before and after the application of techniques of down – sampling and up – sampling to compensate for unbalanced classes. In this way six classification results were available for our study. For the gold standard we worked with 4365 data distributed in four categories: CLR reactive cells (338), acute lymphoid leukemia LAL (521), acute myeloid leukemia LAM (2839) and acute myeloid leukemia promyelocytic LAM_ PROM (667) (see Figure 5).
Figure 5. Frequency chart - gold standard
Figure 6 shows a decision tree that expresses the problem posed, taking the following considerations: CLR =L1, LAL= L2, LAM=L3 Y LAM_PROM = L4, CT = The classifier gave a positive interpretation of the cell when it was correct, CF = The classifier gave a negative interpretation of the cell when it was incorrect (see Figure 6).
Figure 6. Decision tree – application to leukemic cells
The positive and negative classification of the four cell types analyzed, applying equation (2.19) of the Bayes Theorem, follow the same pattern as that described below for the reactive cells (L1):
Next, we present the results of the best classifier analyzed with the Dirichlet - Dirichlet model, in this case the Linear Discriminant Analysis (LDA) (see Figure 7). In addition, the final and graphic reports of all the pairs of classifiers are published at: https://github.com/RicardoBorja.
Figure 7. Algorithm results LDA-TRUE
A slight increase in the credibility intervals (K Bayesian) and greater shoring in the posteriori kappa distribution are visualized, considering that a 95% credibility interval was worked on, representing the interval where there is a probability equal to 0.95 that contain kappa. In addition, it is observed that the two chains do not show correlation and converge with a burn-in of 10000.
Two more tests were performed adding information in the a priori distribution, taking into account the prevalence of each leukemic cell at Hospital Clinic level (inside) and Spain (outside). Tables 7 and 8 summarize the results obtained in the estimation of the kappa index only of the gold standard compared to LDA. Final reports and graphs of all pairs of observers can be observed at: https://github.com/RicardoBorja.
Table 7. Results frequentist method - application to leukemic cells LDA-TRUE.
FREQUENTIST METHOD |
|||||
SAMPLE SIZE |
PAIR OF OBSER. |
KAPPA LOWER |
KAPPA |
KAPPA UPPER |
ACCURACY |
4365 |
1-2 |
0.7251 |
0.7444 |
0.7639 |
0.8685 |
Table 8. Bayesian method results with prevalence of leukemic cells LDA-TRUE.
BAYESIAN METHOD |
||||||||
SAMPLE SIZE |
MODEL |
PAIR OF OBSER. |
KAPPA LOWER |
KAPPA |
KAPPA UPPER |
P-VALUE 2 CHAINS |
PREVALENCE |
|
4365 |
DI-DI |
1-2 |
0.7236 |
0.7444 |
0.7641 |
(0.84;0.94) |
NOT |
|
4365 |
DI-DI |
1-2 |
0.7239 |
0.7443 |
0.7644 |
(0.97;0.14) |
INSIDE |
|
4365 |
DI-DI |
1-2 |
0.7236 |
0.7422 |
0.7638 |
(0.97;0.77) |
OUTSIDE |
|
4365 |
DI-MUL |
1-2 |
0.5451 |
0.8339 |
0.8707 |
(0.31;0.53) |
NOT |
|
4365 |
DI-MUL |
1-2 |
-55.26 |
0.8264 |
0.8747 |
(0.05;0.67) |
INSIDE |
|
4365 |
DI-MUL |
1-2 |
- |
- |
- |
- |
OUTSIDE |
|
4365 |
BE-BE |
1-2 |
0.8582 |
0.8683 |
0.8782 |
(0.51;0.80) |
NO |
|
4365 |
BE-BE |
1-2 |
0.8581 |
0.8683 |
0.8783 |
(0.97;0.80) |
INSIDE |
|
4365 |
BE-BE |
1-2 |
0.8583 |
0.8683 |
0.8782 |
(0.11;0.41) |
OUTSIDE |
|
The study showed that the Dirichlet - Dirichlet model was the most optimal and robust in the estimation of the kappa index, demonstrating a high convergence value of its two chains in all cases of prevalence, especially in the most extreme (outside – p-value = (0.97;0.77)), unlike the other two models. In addition, their credibility intervals become more leptokurtic while the prevalence is more extreme, adjusting kappa effectively for unbalances cases. The percentage of variation of kappa in each model is small due to the high amount of sample data with which we work. The best algorithm was LDA, which presented a good agreement in relation to the gold standard with an observed agreement of 86.8% and expected by chance of 48.55%; while SVM and RF had a moderate agreement.
4.2 Index Kappa versus accuracy
The KFreqBay library was applied using the Dirichlet - Dirichlet model to the algorithm with higher and lower accuracy (LDA and RF), in the leukemic cell database, randomly selecting 10%, 25%, 50%, 75% and 100% of the sample size (4365). We worked under three scenarios: equiprobable, prevalence of the Hospital and of Spain. Figures 8 represent the results obtained.
Figure 8. Kappa evolution graph and accuracy by sample size – LDA (1-2) y RF (1-6).
It was confirmed that in both classifiers with high (observer: 1-2) or low precision (observer: 1-6) the kappa index especially Bayesian, when adding a prior information (Bay-in and Bay-out) shows a greater sensitivity to sample change and class proportion. Their credibility intervals increase in relation to the frequentist kappa by adding information, taking into account that in the Bayesian results in both algorithms the more critical the prevalence between them the credibility intervals decrease, thus improving the parameter estimation. However, it is evident that the accuracy is almost invariable, even more so in algorithms with lower performance. It was shown that the best way to compare classifiers, especially with unbalanced classes, is through Bayesian concordance methods.
5. CONCLUSIONS
The kappa index is a very efficient but underutilized metric, which allows to know the performance of a classifier especially in the case of unbalanced multiclass problems, in comparison with the accuracy, which is a widely used measure but does not provide a complete picture of the performance of the analyzed classifiers.
The Bayesian kappa index (BKI), that we propose, is the optimal tool to evaluate the degree of agreement between two observers or classifiers in the unbalanced multiclass case, due to the correction of the chance effect. It allows enter information of the prevalence within the prior distribution. The three Bayesian models implemented in this paper demonstrated the robustness and sensitivity of the KFreqBay library executable in the R environment with free access. It allows to develop a frequentist and Bayesian concordance analysis either with a pre-established database or through the simulation of a gold standard and observers through a multinomial distribution. When the sample size decreases and the frequencies of the classes are more extreme, the kappa index shows sensitivity and experiences a widening of the credibility intervals. The expected agreement by chance reacts inversely proportional to kappa index.
The Bayesian concordance analysis applied in the case study of the classification of leukemic cells highlights the advantages and effectiveness of the method proposed with the designed library. In fact, the adjustments in the value of kappa under extreme prevalence scenarios allowed us to know the differences when evaluating a classifier depending on the reality to which it is exposed, in this case at the level of the Hospital Clinic or at the level of Spain.
REFERENCES
Blei, D. M., Nigle, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993-1022.
Boldú, L., Merino, A., Alférez, S., Molina, A., Acevedo, A., Rodellar, J. (2019). Automatic recognition of different types of acute leukemia in peripheral blood by images analysis. Journal of clinical Pathology, DOI: 10.1136/jclinpath-2019-205949.
Borja, R. S. (2019). Método de concordancia bayesiano y su aplicación en problemas de clasificación multiclase con categorías desequilibradas (Universidad Politécnica de Cataluña - UB). Available at: https://upcommons.upc.edu/handle/2117/127344
Bradley, E. (2013). Bayes’ Theorem in the 21st Century. Science, 340(6137), 1177-1178, DOI:10.1126/science.1236536.
Brennan, R. L., & Prediger, D. J. (1981). Coefficient Kappa: Some Uses, Misuses, and Alternatives. Educational and Psychological Measurement, 41(3), 687-699, DOI:10.1177/001316448104100307.
Chib, S., Nardari, F., & Shephard, N. (2002). Markov chain Monte Carlo methods for stochastic volatility models. Journal of Econometrics, 108(2), 281-316, DOI: 10.1016/S0304-4076(01)00137-3.
Cristobal, A. (2000). Inferencia Estadística (2da ed.). Zaragoza: Prensas Universitarias de Zaragoza.
D`Agostini, G. (2003). Bayesian inference in processing experimental data: Principles and basic applications. Reports on Progress in Physics, 66(9), 1383–1419, DOI: 10.1088/0034-4885/66/9/201
Fleiss, J., Levin, B., & Cho, M. (2003). Statistical Methods for Rates and Proportions (3era ed.). New Jersey: John Wiley & Sons.
Ford, E. B. (2015, junio 5). Convergence Diagnostics For Markov Chain Monte Carlo [online]. Available at: https://astrostatistics.psu.edu/RLectures/diagnosticsMCMC.pdf
Garabedian, C., Butruille, L., Drumez, E., Servan Schreiber, E., Bartolo, S., Bleu, G., … Houfflin-Debarge, V. (2017). Inter-observer reliability of 4 fetal heart rate classifications. Journal of Gynecology Obstetrics and Human Reproduction, 46(2), 131-135, DOI:10.1016/j.jogoh.2016.11.002.
Gupta, A., & Nadarajah, S. (2004). Handbook of Beta Distribution and Its Applications. Broken Sound Parkway NW: CRC Press.
Hillier, F., & Lieberman, G. (2010). Introducción a la Investigación de Operaciones (9na ed.). México: McGraw Hill.
Koch, K. (1990). Bayes Theorem (Vol. 31). Berlin: Springer.
Landis, J. R., & Koch, G. G. (1977). The Measurement of Observer Agreement for Categorical Data. Biometrics, 33(1), 159-174, DOI: 10.2307/2529310.
Laurence, M., & Pascal, M. (2009). Bayesian decision theory as a model of human visual perception: Testing Bayesian transfer. Visual Neuroscience, 26(1), 147-155, DOI:10.1017/S0952523808080905.
Lebreton, J. M., Ployhart, R. E., & Ladd, R. T. (2004). A Monte Carlo Comparison of Relative Importance Methodologies. Organizational Research Methods, 7(3), 258-282, DOI:10.1177/1094428104266017
Maxwell, A. E., Warner, T. A., & Fang, F. (2018). Implementation of machine-learning classification in remote sensing: An applied review: International Journal of Remote Sensing: Vol 39, No 9. International Journal of Remote Sensing, 39(9), 2784-2817, DOI:10.1080/01431161.2018.1433343.
McHugh, M. (2012, 15). Interrater reliability: The kappa statistic. Biochemia Medica, 22(3), 276-282.
Monleón-Getino T, Rodríguez-Casado CI, Verde PE. 2019. Shannon Entropy Ratio, a Bayesian Biodiversity Index Used in the Uncertainty Mixtures of Metagenomic Populations. Journal of Advanced statistics 4(4) 1-23.
Powers, D. M. W. (2012). The Problem with Kappa. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 345–355. Stroudsburg, PA, USA: Association for Computational Linguistics.
Press, J. (1989). Bayesian Statistics principles, models and applications. Califonia: John Wiley & Sons.
Press, James. (2009). Subjective and Objective Bayesian Statistics: Principles, Models, and Applications (2da ed.). New Yersey: John Wiley & Sons.
Sanjib, B., Mousumi, B., & Ananda, S. (2000). Bayesian Inference for Kappa from Single and Multiple Studies. Biometrics, 56(2), 577-582, DOI:10.1111/j.0006-341X.2000.00577.x
Sheldom, R. (2014). Introduction to Probability Models (11th ed.). Los Angeles, California: Academic Press.
Shridhar, K., Laumann, F., & Liwicki, M. (2019). A Comprehensive guide to Bayesian Convolutional Neural Network with Variational Inference. arXiv:1901.02731 [cs, stat].
Shuo, W., & Xin, Y. (2012). Multiclass Imbalance Problems: Analysis and Potential Solutions. IEEE Transactions on Systems, 42(4), 1119-1130, DOI:10.1109/TSMCB.2012.2187280
Watson, P., & Petrie, A. (2010). Method agreement analysis: A review of correct methodology. 73, 1167-1179.
Westgard, J. (2008). Basic method validation (3ra ed.). Wisconsin: Madison.