Phytopathometry

Author

Marcelo Gonçalves

Diagrammatic Scales Validation

This report is a simple guide for conducting and reproducing the analyses carried out in the development and validation of a diagrammatic scale. For this, we will assess the ability to estimate the severity of coffee rust using a scale developed by Mary Paz (2025);

Initially, we will load the dependencies and our data.

library(tidyverse)
library(gsheet)
library(r4pde)
library(patchwork)
library(purrr)

rust_unaided <- gsheet2tbl("https://docs.google.com/spreadsheets/d/1eJksk_X5suVChmpEMBQLiU6tyqHF8R5eLhY_MRQTodY/edit?gid=0#gid=0")

rust_old <- gsheet2tbl("https://docs.google.com/spreadsheets/d/1eJksk_X5suVChmpEMBQLiU6tyqHF8R5eLhY_MRQTodY/edit?gid=1615494808#gid=1615494808")

rust_new <- gsheet2tbl("https://docs.google.com/spreadsheets/d/1eJksk_X5suVChmpEMBQLiU6tyqHF8R5eLhY_MRQTodY/edit?gid=1806089061#gid=1806089061")

Some analyses require the data to be in long format. Therefore, we will reshape the evaluation dataset into this format, adding a column to identify the session (type of scale) and removing unnecessary columns.

unaided_long <- rust_unaided |>
  pivot_longer(3:8, names_to = "rater",
               values_to = "values") |>
  mutate(session = "Unaided") |>
  select(-2)

old_long <- rust_old |>
  pivot_longer(3:8, names_to = "rater",
               values_to = "values") |>
  mutate(session = "old") |>
  select(-2)

new_long <- rust_new |>
  pivot_longer(3:8, names_to = "rater",
               values_to = "values") |>
  mutate(session = "new") |>
  select(-2)

Now, we will combine the values from each evaluator with the corresponding leaf and its actual severity.

actual <- rust_unaided |>
  select(leaf, actual)


df_cf <- rbind(unaided_long, old_long, new_long)

df_cf2 <- left_join(df_cf, actual)

Using the purrr package, we generated a series of plots to compare the values estimated by the evaluators with the actual severity of coffee rust across different sessions (Unaided, Old SAD, and New SAD). In each plot, the dashed line represents perfect concordance, while the solid blue line shows the fitted linear regression.

sessions <- c("Unaided", "old", "new")
titles <- c("Unaided Assessment", "Old SAD", "New SAD")

plots <- map2(sessions, titles, ~ {
  df_cf2 |> 
    filter(session == .x) |> 
    ggplot(aes(actual, values)) +
    geom_abline(slope = 1, intercept = 0, linetype = 2, size = 1) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(x = "Actual", y = "Estimate") +
    ggtitle(.y) +
    theme_bw() +
    theme(plot.title = element_text(hjust = 0.5)) +
    facet_wrap(~rater)
})

plots[[1]]

plots[[2]]

plots[[3]]

We can compare the scales using each evaluator as a replicate, allowing us to observe the overall pattern.

df_cf2 <- df_cf2 |> 
  mutate(session = factor(session,
                          levels = c("Unaided", "old", "new")))
df_cf2 |> 
  ggplot(aes(actual, values, color = session)) +
  geom_abline(slope = 1, intercept = 0,
              linetype = 2, size = 1) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm") +
  labs(x = "Actual", y = "Estimate") +
  scale_color_manual(values = c(
    "Unaided" = "black",  
    "old" = "#ff7f0e",      
    "new" = "#3A5FCD"       
  ))+
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_wrap(~session)

#Inter-rater Reliability

Here we want to see whether different evaluators, using the same scale, arrive at similar results for the same sample.

A bar was plotted to assess the range of variation in the values estimated by the evaluators for the same leaf in relation to the actual observed severity. It can be seen that this variation is much greater in the assessments without aid (unaided), compared with those performed using the scales (old SAD and new SAD), especially in leaves with higher severity. This suggests that as severity increases, the discrepancy among evaluators also tends to be greater. On the other hand, for leaves with lower severity, this variation is less pronounced. Nevertheless, the positive effect of the scales in reducing this variability is evident.

df_cf2 |> 
  ggplot(aes(actual, values, color = session,
             group = leaf))+
  geom_line(color = "black")+
  geom_point(size = 2, alpha = 0.3)+
  labs(y = "Severity estimate (%)",
       x = "Real severity (%)") +
  scale_color_manual(values = c(
    "Unaided" = "black",  
    "old" = "#ff7f0e",      
    "new" = "#3A5FCD"       
  ))+
  theme_bw()+
  facet_wrap(~session)

This code takes the correlation matrix between evaluators, removes the diagonal values (self-correlations of 1), reshapes everything into a long format, calculates the R² (squared correlations) for all pairs of evaluators, and returns the mean of the coefficients of determination (R²). The R² is used to measure the accuracy of the evaluations, indicating how much of the estimated values can explain the actual values.

Based on our data, without the aid of a scale, lower accuracy is observed compared to using either the old or the new scale, with the new scale showing the highest R² value of 0.847.

library(GGally)

# sessions e nomes que você quer para os objetos
sessions <- c("Unaided", "old", "new")
names_list <- c("raters", "raters_old", "raters_new")
labels <- c("Unaided", "Old SAD", "New SAD")  # <- nomes bonitos p/ tabela

# cria os dataframes usando map2
raters_list <- map2(sessions, names_list, ~ {
  df_cf2 |> 
    filter(session == .x) |> 
    pivot_wider(
      names_from = rater,
      values_from = values
    ) |> 
    select(4:9)
})

names(raters_list) <- names_list

# calcular correlações e mean R²
raters_cor_list <- map(raters_list, ~ {
  cor_mat <- cor(.x)
  cor_melt <- reshape2::melt(cor_mat) |> 
    filter(value != 1)
  
  list(
    cor_matrix = cor_mat,
    cor_melt = cor_melt,
    mean_r2 = round(mean(cor_melt$value^2), 3)
  )
})

# cria tabela com labels bonitos
results_tbl <- map2_dfr(names(raters_cor_list), labels, ~ {
  tibble(
    session = .y,  # usa o label
    mean_r2 = raters_cor_list[[.x]]$mean_r2
  )
})

# exibe a tabela formatada
knitr::kable(results_tbl, digits = 3, caption = "Mean R² values by session")
Mean R² values by session
session mean_r2
Unaided 0.772
Old SAD 0.815
New SAD 0.847

#ICC - Intraclass Correlation Coefficient

The ICC (Intraclass Correlation Coefficient) is a measure that indicates the degree of consistency or agreement among evaluations made by different evaluators on the same set of objects, such as plant leaves.

There are different types of ICC, depending on two factors:

Whether you are using a single measurement or the average of multiple measurements;

Whether all evaluators assess all objects (two-way) or each object may be evaluated by different evaluators (one-way).

One-way ICC: each object (or leaf) can be evaluated by different evaluators. Here, it does not matter exactly who evaluated each object; what matters is whether the objects differ consistently. This model is ideal when not all evaluators can assess all objects.

Two-way ICC: all objects are evaluated by the same evaluators. In this case, the ICC accounts for who evaluated and which object was evaluated, separating the variation between objects from the variation between evaluators. This allows you to see how much of the observed difference is due to the objects and how much is due to the evaluators, assuming both are random.

When using the icc function from the irr package, you need to specify which model you are using (“oneway” or “twoway”).

Here, we will use the two-way model because all evaluators evaluated all leaves individually.

library(irr)
icc(raters_list$raters, "twoway")
 Single Score Intraclass Correlation

   Model: twoway 
   Type : consistency 

   Subjects = 40 
     Raters = 6 
   ICC(C,1) = 0.828

 F-Test, H0: r0 = 0 ; H1: r0 > 0 
  F(39,195) = 29.9 , p = 3.83e-63 

 95%-Confidence Interval for ICC Population Values:
  0.751 < ICC < 0.893
icc(raters_list$raters_old, "twoway")
 Single Score Intraclass Correlation

   Model: twoway 
   Type : consistency 

   Subjects = 40 
     Raters = 6 
   ICC(C,1) = 0.866

 F-Test, H0: r0 = 0 ; H1: r0 > 0 
  F(39,195) = 39.7 , p = 2.85e-73 

 95%-Confidence Interval for ICC Population Values:
  0.802 < ICC < 0.917
icc(raters_list$raters_new, "twoway")
 Single Score Intraclass Correlation

   Model: twoway 
   Type : consistency 

   Subjects = 40 
     Raters = 6 
   ICC(C,1) = 0.907

 F-Test, H0: r0 = 0 ; H1: r0 > 0 
  F(39,195) = 59.6 , p = 1.31e-88 

 95%-Confidence Interval for ICC Population Values:
  0.86 < ICC < 0.944

OCCC - Overall Concordance Correlation Coefficient

In the ICC, the focus is on the relative consistency of the measurements. In other words, if one evaluator gives a higher score to an object than another evaluator, ICC checks whether the ranking/order is maintained.

However, in the OCCC (Overall Concordance Correlation Coefficient), it also considers whether the scores are close to the “true” value, reflecting accuracy as well.

library(epiR)
epi.occc(raters_list$raters, na.rm = FALSE, pairs = TRUE)

Overall CCC           0.7899
Overall precision     0.8750
Overall accuracy      0.9027
epi.occc(raters_list$raters_old, na.rm = FALSE, pairs = TRUE)

Overall CCC           0.8414
Overall precision     0.9030
Overall accuracy      0.9318
epi.occc(raters_list$raters_new, na.rm = FALSE, pairs = TRUE)

Overall CCC           0.8844
Overall precision     0.9209
Overall accuracy      0.9603
df_cf2 |> 
  ggplot(aes(actual, values))+
  geom_point(size = 2, alpha = 0.7)+
  facet_wrap(~rater)+
  ylim(0,45)+
  xlim(0,45)+
  geom_abline(intercept = 0, slope =1)+
  theme_r4pde()+
  labs(x = "Actual severity (%)",
       y = "Estimate severity (%)")

sessions <- c("old", "new", "Unaided")

# cria a tabela empilhada
precision_tbl <- map(sessions, function(s) {
  df_cf2 |> 
    filter(session == s) |> 
    select(-leaf) |> 
    group_by(rater) |> 
    correlation::correlation() |> 
    mutate(session = s)   # adiciona a coluna session
})

knitr::kable(precision_tbl)
Group Parameter1 Parameter2 r CI CI_low CI_high t df_error p Method n_Obs session
Ana Rita values actual 0.9415788 0.95 0.8915791 0.9688996 17.23399 38 0 Pearson correlation 40 old
Carlos values actual 0.9560704 0.95 0.9179491 0.9766955 20.10531 38 0 Pearson correlation 40 old
Igor values actual 0.9406220 0.95 0.8898498 0.9683830 17.08141 38 0 Pearson correlation 40 old
Marcelo values actual 0.9260746 0.95 0.8637368 0.9604985 15.12880 38 0 Pearson correlation 40 old
Richard values actual 0.8832391 0.95 0.7887420 0.9369567 11.61097 38 0 Pearson correlation 40 old
Sibila values actual 0.9433311 0.95 0.8947497 0.9698452 17.52307 38 0 Pearson correlation 40 old
Group Parameter1 Parameter2 r CI CI_low CI_high t df_error p Method n_Obs session
Ana Rita values actual 0.9496017 0.95 0.9061363 0.9732223 18.67464 38 0 Pearson correlation 40 new
Carlos values actual 0.9446290 0.95 0.8971015 0.9705451 17.74572 38 0 Pearson correlation 40 new
Igor values actual 0.9551365 0.95 0.9162396 0.9761948 19.88027 38 0 Pearson correlation 40 new
Marcelo values actual 0.9261182 0.95 0.8638146 0.9605222 15.13380 38 0 Pearson correlation 40 new
Richard values actual 0.8891064 0.95 0.7988506 0.9402104 11.97468 38 0 Pearson correlation 40 new
Sibila values actual 0.9362112 0.95 0.8818970 0.9659982 16.42166 38 0 Pearson correlation 40 new
Group Parameter1 Parameter2 r CI CI_low CI_high t df_error p Method n_Obs session
Ana Rita values actual 0.9400131 0.95 0.8887500 0.9680541 16.98615 38 0 Pearson correlation 40 Unaided
Carlos values actual 0.9448193 0.95 0.8974465 0.9706476 17.77900 38 0 Pearson correlation 40 Unaided
Igor values actual 0.9002181 0.95 0.8181356 0.9463469 12.74418 38 0 Pearson correlation 40 Unaided
Marcelo values actual 0.9338765 0.95 0.8777000 0.9647339 16.09863 38 0 Pearson correlation 40 Unaided
Richard values actual 0.8493336 0.95 0.7313112 0.9179706 9.91869 38 0 Pearson correlation 40 Unaided
Sibila values actual 0.9419327 0.95 0.8922189 0.9690906 17.29134 38 0 Pearson correlation 40 Unaided
# combina todos os data frames da lista em um só
precision_all <- bind_rows(precision_tbl)

precision_all %>%
  group_by(session) %>%
  summarise(mean_r = mean(r))
# A tibble: 3 × 2
  session mean_r
  <chr>    <dbl>
1 Unaided  0.918
2 new      0.933
3 old      0.932

#Precision

##Absolute Errors

This analysis allows us to check for patterns of overestimation or underestimation in the evaluations when compared to the actual values.

Based on our data, it is possible to observe that, with the aid of the scales, all evaluators except Igor showed a tendency to underestimate the severity values.

# cria uma lista de gráficos, um por session, com títulos personalizados
plots <- map2(sessions, titles, function(s, t) {
  df_cf2 |> 
    filter(session == s) |> 
    ggplot(aes(actual, values - actual)) +
    geom_point(size = 3, alpha = 0.7) +
    geom_hline(yintercept = 0) +
    facet_wrap(~rater) +
    theme_r4pde() +
    labs(
      x = "Actual severity (%)",
      y = "Error (Estimate - Actual)",
      title = t
    )
})

# exibir, por exemplo, o gráfico da primeira session
plots[[1]]

plots[[2]]

plots[[3]]

#Lin’s Concordance Correlation (CCC)

In the validation of diagrammatic scales, Lin’s Concordance Correlation (CCC) is one of the most widely used metrics, as it combines different aspects of the relationship between estimated and actual values. The Agreement value (overall concordance) summarizes how close the estimates are to the actual values, considering both accuracy and precision. The Bias coefficient indicates whether there is a systematic tendency to overestimate or underestimate; the closer it is to 1, the lower the bias. Precision reflects the consistency of the estimates among evaluators, regardless of whether they are accurate. Additionally, Scale-shift shows whether the variability of the estimated values is greater or smaller than the actual variability, while Location-shift indicates whether there is a constant average difference between the estimates and the true values (for example, consistently estimating 5% higher).

library(epiR)
sessions <- unique(df_cf2$session)

# cria uma lista com os resultados por session
ccc_list <- map(sessions, function(s) {
  
  df_s <- df_cf2 |> filter(session == s)
  
  ccc_res <- epi.ccc(df_s$actual, df_s$values)
  
  rho <- ccc_res$rho.c[,1]         # Concordance coefficient
  Cb  <- ccc_res$C.b               # Bias coefficient
  r   <- rho / Cb                  # Precision
  ss  <- ccc_res$s.shift           # Scale-shift
  ls  <- ccc_res$l.shift           # Location-shift
  
  Metrics <- c("Agreement", "Bias coefficient", "Precision", "Scale-shift", "Location-shift")
  Value <- c(rho, Cb, r, ss, ls)
  
  res <- data.frame(Metrics, Value)
  res$session <- s  # adiciona coluna session
  return(res)
})

# combina tudo em um único data frame
ccc_all <- bind_rows(ccc_list)

# mostra em tabela
ccc_wide <- ccc_all |> 
  pivot_wider(
    names_from = session,   # as colunas serão os nomes das sessions
    values_from = Value     # os valores que preenchem a tabela
  )

ccc_wide
# A tibble: 5 × 4
  Metrics          Unaided    old   new
  <chr>              <dbl>  <dbl> <dbl>
1 Agreement          0.826  0.886 0.912
2 Bias coefficient   0.938  0.977 0.993
3 Precision          0.880  0.907 0.919
4 Scale-shift        1.09   0.854 0.976
5 Location-shift     0.351 -0.151 0.120

It is consistently possible to observe that the new diagrammatic scale (SAD) shows improved validation parameters. However, are these differences statistically significant?

Here, we can see a boxplot of the evaluations for each condition.

theme_set(theme_r4pde())
df_cf2 |> 
  ggplot(aes(session, values, color = session))+
  geom_boxplot(outlier.colour = NA)+
  geom_jitter(width = 0.23, size = 2, alpha = 0.5,)+
   scale_color_manual(values = c(
    "Unaided" = "black",  
    "old" = "#ff7f0e",      
    "new" = "#3A5FCD"       
  ))+
  theme_bw()

For each condition, a Shapiro-Wilk test was performed to check the normality of the data. The p-value is very small for each evaluation, indicating that the data do not follow a normal distribution.

# normality test
shapiro.test(unaided_long$values)

    Shapiro-Wilk normality test

data:  unaided_long$values
W = 0.87216, p-value = 2.684e-13
shapiro.test(old_long$values)

    Shapiro-Wilk normality test

data:  old_long$values
W = 0.7936, p-value < 2.2e-16
shapiro.test(new_long$values)

    Shapiro-Wilk normality test

data:  new_long$values
W = 0.8656, p-value = 1.135e-13

Additionally, homoscedasticity of the data was checked using the Levene’s test to verify whether the variances are equal across treatments. However, there is evidence that the variances among the three groups are different (heteroscedasticity).

library(car)
leveneTest(values ~ session, data = df_cf2)
Levene's Test for Homogeneity of Variance (center = median)
       Df F value   Pr(>F)   
group   2  6.2627 0.002012 **
      717                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

GLMM

Due to the lack of normality, and also considering the dependency of the data since the same evaluator assessed the same leaf at different times, a mixed-effects model was used.

library(glmmTMB)
mix2 <-  glmmTMB((values/100) ~ session + (1| rater), 
                 data = df_cf2, 
                 family = beta_family())

library(DHARMa)

plot(simulateResiduals(mix2))

An analysis of variance (Type II Wald test) was performed, which indicated a highly significant effect of the type of evaluation (with or without a scale) on disease severity (χ² = 45.08; df = 2; p < 0.001), showing that there are differences between the evaluation methods.

car::Anova(mix2)
Analysis of Deviance Table (Type II Wald chisquare tests)

Response: (values/100)
         Chisq Df Pr(>Chisq)    
session 45.078  2  1.627e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When these values were contrasted, significant differences were observed between all the different evaluation methods. All results were adjusted using the Tukey method.

library(emmeans)
em2 <- emmeans(mix2, ~ session)
pairs(em2)
 contrast      estimate     SE  df z.ratio p.value
 Unaided - old    0.551 0.0826 Inf   6.678  <.0001
 Unaided - new    0.306 0.0809 Inf   3.783  0.0005
 old - new       -0.245 0.0838 Inf  -2.925  0.0096

Results are given on the log odds ratio (not the response) scale. 
P value adjustment: tukey method for comparing a family of 3 estimates 

#Results Based on our results, it is possible to infer that the new diagrammatic scale (SAD) statistically improves the severity assessments of coffee leaf rust, performing even better than the previous scale, which had already shown good results.