Phytopathometry

Author

Marcelo Gonçalves

Diagrammatic Scales Validation

This report is a simple guide for conducting and reproducing the analyses carried out in the development and validation of a diagrammatic scale. For this, we will assess the ability to estimate the severity of coffee rust using a scale developed by Mary Paz (2025);

Initially, we will load the dependencies and our data.

library(tidyverse)
library(gsheet)
library(r4pde)
library(patchwork)
library(purrr)

rust_unaided <- gsheet2tbl("https://docs.google.com/spreadsheets/d/1eJksk_X5suVChmpEMBQLiU6tyqHF8R5eLhY_MRQTodY/edit?gid=0#gid=0")

rust_old <- gsheet2tbl("https://docs.google.com/spreadsheets/d/1eJksk_X5suVChmpEMBQLiU6tyqHF8R5eLhY_MRQTodY/edit?gid=1615494808#gid=1615494808")

rust_new <- gsheet2tbl("https://docs.google.com/spreadsheets/d/1eJksk_X5suVChmpEMBQLiU6tyqHF8R5eLhY_MRQTodY/edit?gid=1806089061#gid=1806089061")

Some analyses require the data to be in long format. Therefore, we will reshape the evaluation dataset into this format, adding a column to identify the session (type of scale) and removing unnecessary columns.

unaided_long <- rust_unaided |>
  pivot_longer(3:8, names_to = "rater",
               values_to = "values") |>
  mutate(session = "Unaided") |>
  select(-2)

old_long <- rust_old |>
  pivot_longer(3:8, names_to = "rater",
               values_to = "values") |>
  mutate(session = "old") |>
  select(-2)

new_long <- rust_new |>
  pivot_longer(3:8, names_to = "rater",
               values_to = "values") |>
  mutate(session = "new") |>
  select(-2)

Now, we will combine the values from each evaluator with the corresponding leaf and its actual severity.

actual <- rust_unaided |>
  select(leaf, actual)


df_cf <- rbind(unaided_long, old_long, new_long)

df_cf2 <- left_join(df_cf, actual)

Using the purrr package, we generated a series of plots to compare the values estimated by the evaluators with the actual severity of coffee rust across different sessions (Unaided, Old SAD, and New SAD). In each plot, the dashed line represents perfect concordance, while the solid blue line shows the fitted linear regression.

sessions <- c("Unaided", "old", "new")
titles <- c("Unaided Assessment", "Old SAD", "New SAD")

plots <- map2(sessions, titles, ~ {
  df_cf2 |> 
    filter(session == .x) |> 
    ggplot(aes(actual, values)) +
    geom_abline(slope = 1, intercept = 0, linetype = 2, size = 1) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(x = "Actual", y = "Estimate") +
    ggtitle(.y) +
    theme_bw() +
    theme(plot.title = element_text(hjust = 0.5)) +
    facet_wrap(~rater)
})

plots[[1]]

plots[[2]]

plots[[3]]

We can compare the scales using each evaluator as a replicate, allowing us to observe the overall pattern.

df_cf2 <- df_cf2 |> 
  mutate(session = factor(session,
                          levels = c("Unaided", "old", "new")))
df_cf2 |> 
  ggplot(aes(actual, values, color = session)) +
  geom_abline(slope = 1, intercept = 0,
              linetype = 2, size = 1) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm") +
  labs(x = "Actual", y = "Estimate") +
  scale_color_manual(values = c(
    "Unaided" = "black",  
    "old" = "#ff7f0e",      
    "new" = "#3A5FCD"       
  ))+
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5)) +
  facet_wrap(~session)

#Inter-rater Reliability

Here we want to see whether different evaluators, using the same scale, arrive at similar results for the same sample.

A bar was plotted to assess the range of variation in the values estimated by the evaluators for the same leaf in relation to the actual observed severity. It can be seen that this variation is much greater in the assessments without aid (unaided), compared with those performed using the scales (old SAD and new SAD), especially in leaves with higher severity. This suggests that as severity increases, the discrepancy among evaluators also tends to be greater. On the other hand, for leaves with lower severity, this variation is less pronounced. Nevertheless, the positive effect of the scales in reducing this variability is evident.

df_cf2 |> 
  ggplot(aes(actual, values, color = session,
             group = leaf))+
  geom_line(color = "black")+
  geom_point(size = 2, alpha = 0.3)+
  labs(y = "Severity estimate (%)",
       x = "Real severity (%)") +
  scale_color_manual(values = c(
    "Unaided" = "black",  
    "old" = "#ff7f0e",      
    "new" = "#3A5FCD"       
  ))+
  theme_bw()+
  facet_wrap(~session)

This code takes the correlation matrix between evaluators, removes the diagonal values (self-correlations of 1), reshapes everything into a long format, calculates the R² (squared correlations) for all pairs of evaluators, and returns the mean of the coefficients of determination (R²). The R² is used to measure the accuracy of the evaluations, indicating how much of the estimated values can explain the actual values.

Based on our data, without the aid of a scale, lower accuracy is observed compared to using either the old or the new scale, with the new scale showing the highest R² value of 0.847.

library(GGally)

# sessions e nomes que você quer para os objetos
sessions <- c("Unaided", "old", "new")
names_list <- c("raters", "raters_old", "raters_new")
labels <- c("Unaided", "Old SAD", "New SAD")  # <- nomes bonitos p/ tabela

# cria os dataframes usando map2
raters_list <- map2(sessions, names_list, ~ {
  df_cf2 |> 
    filter(session == .x) |> 
    pivot_wider(
      names_from = rater,
      values_from = values
    ) |> 
    select(4:9)
})

names(raters_list) <- names_list

# calcular correlações e mean R²
raters_cor_list <- map(raters_list, ~ {
  cor_mat <- cor(.x)
  cor_melt <- reshape2::melt(cor_mat) |> 
    filter(value != 1)
  
  list(
    cor_matrix = cor_mat,
    cor_melt = cor_melt,
    mean_r2 = round(mean(cor_melt$value^2), 3)
  )
})

# cria tabela com labels bonitos
results_tbl <- map2_dfr(names(raters_cor_list), labels, ~ {
  tibble(
    session = .y,  # usa o label
    mean_r2 = raters_cor_list[[.x]]$mean_r2
  )
})

# exibe a tabela formatada
knitr::kable(results_tbl, digits = 3, caption = "Mean R² values by session")

Mean R² values by session
session	mean_r2
Unaided	0.772
Old SAD	0.815
New SAD	0.847

#ICC - Intraclass Correlation Coefficient

The ICC (Intraclass Correlation Coefficient) is a measure that indicates the degree of consistency or agreement among evaluations made by different evaluators on the same set of objects, such as plant leaves.

There are different types of ICC, depending on two factors:

Whether you are using a single measurement or the average of multiple measurements;

Whether all evaluators assess all objects (two-way) or each object may be evaluated by different evaluators (one-way).

One-way ICC: each object (or leaf) can be evaluated by different evaluators. Here, it does not matter exactly who evaluated each object; what matters is whether the objects differ consistently. This model is ideal when not all evaluators can assess all objects.

Two-way ICC: all objects are evaluated by the same evaluators. In this case, the ICC accounts for who evaluated and which object was evaluated, separating the variation between objects from the variation between evaluators. This allows you to see how much of the observed difference is due to the objects and how much is due to the evaluators, assuming both are random.

When using the icc function from the irr package, you need to specify which model you are using (“oneway” or “twoway”).

Here, we will use the two-way model because all evaluators evaluated all leaves individually.

library(irr)
icc(raters_list$raters, "twoway")

 Single Score Intraclass Correlation

   Model: twoway 
   Type : consistency 

   Subjects = 40 
     Raters = 6 
   ICC(C,1) = 0.828

 F-Test, H0: r0 = 0 ; H1: r0 > 0 
  F(39,195) = 29.9 , p = 3.83e-63 

 95%-Confidence Interval for ICC Population Values:
  0.751 < ICC < 0.893

icc(raters_list$raters_old, "twoway")

 Single Score Intraclass Correlation

   Model: twoway 
   Type : consistency 

   Subjects = 40 
     Raters = 6 
   ICC(C,1) = 0.866

 F-Test, H0: r0 = 0 ; H1: r0 > 0 
  F(39,195) = 39.7 , p = 2.85e-73 

 95%-Confidence Interval for ICC Population Values:
  0.802 < ICC < 0.917

icc(raters_list$raters_new, "twoway")

 Single Score Intraclass Correlation

   Model: twoway 
   Type : consistency 

   Subjects = 40 
     Raters = 6 
   ICC(C,1) = 0.907

 F-Test, H0: r0 = 0 ; H1: r0 > 0 
  F(39,195) = 59.6 , p = 1.31e-88 

 95%-Confidence Interval for ICC Population Values:
  0.86 < ICC < 0.944

OCCC - Overall Concordance Correlation Coefficient

In the ICC, the focus is on the relative consistency of the measurements. In other words, if one evaluator gives a higher score to an object than another evaluator, ICC checks whether the ranking/order is maintained.

However, in the OCCC (Overall Concordance Correlation Coefficient), it also considers whether the scores are close to the “true” value, reflecting accuracy as well.

library(epiR)
epi.occc(raters_list$raters, na.rm = FALSE, pairs = TRUE)


Overall CCC           0.7899
Overall precision     0.8750
Overall accuracy      0.9027

epi.occc(raters_list$raters_old, na.rm = FALSE, pairs = TRUE)


Overall CCC           0.8414
Overall precision     0.9030
Overall accuracy      0.9318

epi.occc(raters_list$raters_new, na.rm = FALSE, pairs = TRUE)


Overall CCC           0.8844
Overall precision     0.9209
Overall accuracy      0.9603

df_cf2 |> 
  ggplot(aes(actual, values))+
  geom_point(size = 2, alpha = 0.7)+
  facet_wrap(~rater)+
  ylim(0,45)+
  xlim(0,45)+
  geom_abline(intercept = 0, slope =1)+
  theme_r4pde()+
  labs(x = "Actual severity (%)",
       y = "Estimate severity (%)")

sessions <- c("old", "new", "Unaided")

# cria a tabela empilhada
precision_tbl <- map(sessions, function(s) {
  df_cf2 |> 
    filter(session == s) |> 
    select(-leaf) |> 
    group_by(rater) |> 
    correlation::correlation() |> 
    mutate(session = s)   # adiciona a coluna session
})

knitr::kable(precision_tbl)

Group	Parameter1	Parameter2	r	CI	CI_low	CI_high	t	df_error	Method	n_Obs	session
Ana Rita	values	actual	0.9415788	0.95	0.8915791	0.9688996	17.23399	38	Pearson correlation	40	old
Carlos	values	actual	0.9560704	0.95	0.9179491	0.9766955	20.10531	38	Pearson correlation	40	old
Igor	values	actual	0.9406220	0.95	0.8898498	0.9683830	17.08141	38	Pearson correlation	40	old
Marcelo	values	actual	0.9260746	0.95	0.8637368	0.9604985	15.12880	38	Pearson correlation	40	old
Richard	values	actual	0.8832391	0.95	0.7887420	0.9369567	11.61097	38	Pearson correlation	40	old
Sibila	values	actual	0.9433311	0.95	0.8947497	0.9698452	17.52307	38	Pearson correlation	40	old

Group	Parameter1	Parameter2	r	CI	CI_low	CI_high	t	df_error	Method	n_Obs	session
Ana Rita	values	actual	0.9496017	0.95	0.9061363	0.9732223	18.67464	38	Pearson correlation	40	new
Carlos	values	actual	0.9446290	0.95	0.8971015	0.9705451	17.74572	38	Pearson correlation	40	new
Igor	values	actual	0.9551365	0.95	0.9162396	0.9761948	19.88027	38	Pearson correlation	40	new
Marcelo	values	actual	0.9261182	0.95	0.8638146	0.9605222	15.13380	38	Pearson correlation	40	new
Richard	values	actual	0.8891064	0.95	0.7988506	0.9402104	11.97468	38	Pearson correlation	40	new
Sibila	values	actual	0.9362112	0.95	0.8818970	0.9659982	16.42166	38	Pearson correlation	40	new

Group	Parameter1	Parameter2	r	CI	CI_low	CI_high	t	df_error	Method	n_Obs	session
Ana Rita	values	actual	0.9400131	0.95	0.8887500	0.9680541	16.98615	38	Pearson correlation	40	Unaided
Carlos	values	actual	0.9448193	0.95	0.8974465	0.9706476	17.77900	38	Pearson correlation	40	Unaided
Igor	values	actual	0.9002181	0.95	0.8181356	0.9463469	12.74418	38	Pearson correlation	40	Unaided
Marcelo	values	actual	0.9338765	0.95	0.8777000	0.9647339	16.09863	38	Pearson correlation	40	Unaided
Richard	values	actual	0.8493336	0.95	0.7313112	0.9179706	9.91869	38	Pearson correlation	40	Unaided
Sibila	values	actual	0.9419327	0.95	0.8922189	0.9690906	17.29134	38	Pearson correlation	40	Unaided

# combina todos os data frames da lista em um só
precision_all <- bind_rows(precision_tbl)

precision_all %>%
  group_by(session) %>%
  summarise(mean_r = mean(r))

# A tibble: 3 × 2
  session mean_r
  <chr>    <dbl>
1 Unaided  0.918
2 new      0.933
3 old      0.932

#Precision

##Absolute Errors

This analysis allows us to check for patterns of overestimation or underestimation in the evaluations when compared to the actual values.

Based on our data, it is possible to observe that, with the aid of the scales, all evaluators except Igor showed a tendency to underestimate the severity values.

# cria uma lista de gráficos, um por session, com títulos personalizados
plots <- map2(sessions, titles, function(s, t) {
  df_cf2 |> 
    filter(session == s) |> 
    ggplot(aes(actual, values - actual)) +
    geom_point(size = 3, alpha = 0.7) +
    geom_hline(yintercept = 0) +
    facet_wrap(~rater) +
    theme_r4pde() +
    labs(
      x = "Actual severity (%)",
      y = "Error (Estimate - Actual)",
      title = t
    )
})

# exibir, por exemplo, o gráfico da primeira session
plots[[1]]

plots[[2]]

plots[[3]]

#Lin’s Concordance Correlation (CCC)

In the validation of diagrammatic scales, Lin’s Concordance Correlation (CCC) is one of the most widely used metrics, as it combines different aspects of the relationship between estimated and actual values. The Agreement value (overall concordance) summarizes how close the estimates are to the actual values, considering both accuracy and precision. The Bias coefficient indicates whether there is a systematic tendency to overestimate or underestimate; the closer it is to 1, the lower the bias. Precision reflects the consistency of the estimates among evaluators, regardless of whether they are accurate. Additionally, Scale-shift shows whether the variability of the estimated values is greater or smaller than the actual variability, while Location-shift indicates whether there is a constant average difference between the estimates and the true values (for example, consistently estimating 5% higher).

library(epiR)
sessions <- unique(df_cf2$session)

# cria uma lista com os resultados por session
ccc_list <- map(sessions, function(s) {
  
  df_s <- df_cf2 |> filter(session == s)
  
  ccc_res <- epi.ccc(df_s$actual, df_s$values)
  
  rho <- ccc_res$rho.c[,1]         # Concordance coefficient
  Cb  <- ccc_res$C.b               # Bias coefficient
  r   <- rho / Cb                  # Precision
  ss  <- ccc_res$s.shift           # Scale-shift
  ls  <- ccc_res$l.shift           # Location-shift
  
  Metrics <- c("Agreement", "Bias coefficient", "Precision", "Scale-shift", "Location-shift")
  Value <- c(rho, Cb, r, ss, ls)
  
  res <- data.frame(Metrics, Value)
  res$session <- s  # adiciona coluna session
  return(res)
})

# combina tudo em um único data frame
ccc_all <- bind_rows(ccc_list)

# mostra em tabela
ccc_wide <- ccc_all |> 
  pivot_wider(
    names_from = session,   # as colunas serão os nomes das sessions
    values_from = Value     # os valores que preenchem a tabela
  )

ccc_wide

# A tibble: 5 × 4
  Metrics          Unaided    old   new
  <chr>              <dbl>  <dbl> <dbl>
1 Agreement          0.826  0.886 0.912
2 Bias coefficient   0.938  0.977 0.993
3 Precision          0.880  0.907 0.919
4 Scale-shift        1.09   0.854 0.976
5 Location-shift     0.351 -0.151 0.120

It is consistently possible to observe that the new diagrammatic scale (SAD) shows improved validation parameters. However, are these differences statistically significant?

Here, we can see a boxplot of the evaluations for each condition.

theme_set(theme_r4pde())
df_cf2 |> 
  ggplot(aes(session, values, color = session))+
  geom_boxplot(outlier.colour = NA)+
  geom_jitter(width = 0.23, size = 2, alpha = 0.5,)+
   scale_color_manual(values = c(
    "Unaided" = "black",  
    "old" = "#ff7f0e",      
    "new" = "#3A5FCD"       
  ))+
  theme_bw()

For each condition, a Shapiro-Wilk test was performed to check the normality of the data. The p-value is very small for each evaluation, indicating that the data do not follow a normal distribution.

# normality test
shapiro.test(unaided_long$values)


    Shapiro-Wilk normality test

data:  unaided_long$values
W = 0.87216, p-value = 2.684e-13

shapiro.test(old_long$values)


    Shapiro-Wilk normality test

data:  old_long$values
W = 0.7936, p-value < 2.2e-16

shapiro.test(new_long$values)


    Shapiro-Wilk normality test

data:  new_long$values
W = 0.8656, p-value = 1.135e-13

Additionally, homoscedasticity of the data was checked using the Levene’s test to verify whether the variances are equal across treatments. However, there is evidence that the variances among the three groups are different (heteroscedasticity).

library(car)
leveneTest(values ~ session, data = df_cf2)

Levene's Test for Homogeneity of Variance (center = median)
       Df F value   Pr(>F)   
group   2  6.2627 0.002012 **
      717                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

GLMM

Due to the lack of normality, and also considering the dependency of the data since the same evaluator assessed the same leaf at different times, a mixed-effects model was used.

library(glmmTMB)
mix2 <-  glmmTMB((values/100) ~ session + (1| rater), 
                 data = df_cf2, 
                 family = beta_family())

library(DHARMa)

plot(simulateResiduals(mix2))

An analysis of variance (Type II Wald test) was performed, which indicated a highly significant effect of the type of evaluation (with or without a scale) on disease severity (χ² = 45.08; df = 2; p < 0.001), showing that there are differences between the evaluation methods.

car::Anova(mix2)

Analysis of Deviance Table (Type II Wald chisquare tests)

Response: (values/100)
         Chisq Df Pr(>Chisq)    
session 45.078  2  1.627e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When these values were contrasted, significant differences were observed between all the different evaluation methods. All results were adjusted using the Tukey method.

library(emmeans)
em2 <- emmeans(mix2, ~ session)
pairs(em2)

 contrast      estimate     SE  df z.ratio p.value
 Unaided - old    0.551 0.0826 Inf   6.678  <.0001
 Unaided - new    0.306 0.0809 Inf   3.783  0.0005
 old - new       -0.245 0.0838 Inf  -2.925  0.0096

Results are given on the log odds ratio (not the response) scale. 
P value adjustment: tukey method for comparing a family of 3 estimates

#Results Based on our results, it is possible to infer that the new diagrammatic scale (SAD) statistically improves the severity assessments of coffee leaf rust, performing even better than the previous scale, which had already shown good results.