library(tidyverse)
library(gsheet)
library(r4pde)
library(patchwork)
library(purrr)
<- gsheet2tbl("https://docs.google.com/spreadsheets/d/1eJksk_X5suVChmpEMBQLiU6tyqHF8R5eLhY_MRQTodY/edit?gid=0#gid=0")
rust_unaided
<- gsheet2tbl("https://docs.google.com/spreadsheets/d/1eJksk_X5suVChmpEMBQLiU6tyqHF8R5eLhY_MRQTodY/edit?gid=1615494808#gid=1615494808")
rust_old
<- gsheet2tbl("https://docs.google.com/spreadsheets/d/1eJksk_X5suVChmpEMBQLiU6tyqHF8R5eLhY_MRQTodY/edit?gid=1806089061#gid=1806089061") rust_new
Phytopathometry
Diagrammatic Scales Validation
This report is a simple guide for conducting and reproducing the analyses carried out in the development and validation of a diagrammatic scale. For this, we will assess the ability to estimate the severity of coffee rust using a scale developed by Mary Paz (2025);
Initially, we will load the dependencies and our data.
Some analyses require the data to be in long format. Therefore, we will reshape the evaluation dataset into this format, adding a column to identify the session (type of scale) and removing unnecessary columns.
<- rust_unaided |>
unaided_long pivot_longer(3:8, names_to = "rater",
values_to = "values") |>
mutate(session = "Unaided") |>
select(-2)
<- rust_old |>
old_long pivot_longer(3:8, names_to = "rater",
values_to = "values") |>
mutate(session = "old") |>
select(-2)
<- rust_new |>
new_long pivot_longer(3:8, names_to = "rater",
values_to = "values") |>
mutate(session = "new") |>
select(-2)
Now, we will combine the values from each evaluator with the corresponding leaf and its actual severity.
<- rust_unaided |>
actual select(leaf, actual)
<- rbind(unaided_long, old_long, new_long)
df_cf
<- left_join(df_cf, actual) df_cf2
Using the purrr package, we generated a series of plots to compare the values estimated by the evaluators with the actual severity of coffee rust across different sessions (Unaided, Old SAD, and New SAD). In each plot, the dashed line represents perfect concordance, while the solid blue line shows the fitted linear regression.
<- c("Unaided", "old", "new")
sessions <- c("Unaided Assessment", "Old SAD", "New SAD")
titles
<- map2(sessions, titles, ~ {
plots |>
df_cf2 filter(session == .x) |>
ggplot(aes(actual, values)) +
geom_abline(slope = 1, intercept = 0, linetype = 2, size = 1) +
geom_point() +
geom_smooth(method = "lm") +
labs(x = "Actual", y = "Estimate") +
ggtitle(.y) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
facet_wrap(~rater)
})
1]] plots[[
2]] plots[[
3]] plots[[
We can compare the scales using each evaluator as a replicate, allowing us to observe the overall pattern.
<- df_cf2 |>
df_cf2 mutate(session = factor(session,
levels = c("Unaided", "old", "new")))
|>
df_cf2 ggplot(aes(actual, values, color = session)) +
geom_abline(slope = 1, intercept = 0,
linetype = 2, size = 1) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm") +
labs(x = "Actual", y = "Estimate") +
scale_color_manual(values = c(
"Unaided" = "black",
"old" = "#ff7f0e",
"new" = "#3A5FCD"
+
))theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
facet_wrap(~session)
#Inter-rater Reliability
Here we want to see whether different evaluators, using the same scale, arrive at similar results for the same sample.
A bar was plotted to assess the range of variation in the values estimated by the evaluators for the same leaf in relation to the actual observed severity. It can be seen that this variation is much greater in the assessments without aid (unaided), compared with those performed using the scales (old SAD and new SAD), especially in leaves with higher severity. This suggests that as severity increases, the discrepancy among evaluators also tends to be greater. On the other hand, for leaves with lower severity, this variation is less pronounced. Nevertheless, the positive effect of the scales in reducing this variability is evident.
|>
df_cf2 ggplot(aes(actual, values, color = session,
group = leaf))+
geom_line(color = "black")+
geom_point(size = 2, alpha = 0.3)+
labs(y = "Severity estimate (%)",
x = "Real severity (%)") +
scale_color_manual(values = c(
"Unaided" = "black",
"old" = "#ff7f0e",
"new" = "#3A5FCD"
+
))theme_bw()+
facet_wrap(~session)
This code takes the correlation matrix between evaluators, removes the diagonal values (self-correlations of 1), reshapes everything into a long format, calculates the R² (squared correlations) for all pairs of evaluators, and returns the mean of the coefficients of determination (R²). The R² is used to measure the accuracy of the evaluations, indicating how much of the estimated values can explain the actual values.
Based on our data, without the aid of a scale, lower accuracy is observed compared to using either the old or the new scale, with the new scale showing the highest R² value of 0.847.
library(GGally)
# sessions e nomes que você quer para os objetos
<- c("Unaided", "old", "new")
sessions <- c("raters", "raters_old", "raters_new")
names_list <- c("Unaided", "Old SAD", "New SAD") # <- nomes bonitos p/ tabela
labels
# cria os dataframes usando map2
<- map2(sessions, names_list, ~ {
raters_list |>
df_cf2 filter(session == .x) |>
pivot_wider(
names_from = rater,
values_from = values
|>
) select(4:9)
})
names(raters_list) <- names_list
# calcular correlações e mean R²
<- map(raters_list, ~ {
raters_cor_list <- cor(.x)
cor_mat <- reshape2::melt(cor_mat) |>
cor_melt filter(value != 1)
list(
cor_matrix = cor_mat,
cor_melt = cor_melt,
mean_r2 = round(mean(cor_melt$value^2), 3)
)
})
# cria tabela com labels bonitos
<- map2_dfr(names(raters_cor_list), labels, ~ {
results_tbl tibble(
session = .y, # usa o label
mean_r2 = raters_cor_list[[.x]]$mean_r2
)
})
# exibe a tabela formatada
::kable(results_tbl, digits = 3, caption = "Mean R² values by session") knitr
session | mean_r2 |
---|---|
Unaided | 0.772 |
Old SAD | 0.815 |
New SAD | 0.847 |
#ICC - Intraclass Correlation Coefficient
The ICC (Intraclass Correlation Coefficient) is a measure that indicates the degree of consistency or agreement among evaluations made by different evaluators on the same set of objects, such as plant leaves.
There are different types of ICC, depending on two factors:
Whether you are using a single measurement or the average of multiple measurements;
Whether all evaluators assess all objects (two-way) or each object may be evaluated by different evaluators (one-way).
One-way ICC: each object (or leaf) can be evaluated by different evaluators. Here, it does not matter exactly who evaluated each object; what matters is whether the objects differ consistently. This model is ideal when not all evaluators can assess all objects.
Two-way ICC: all objects are evaluated by the same evaluators. In this case, the ICC accounts for who evaluated and which object was evaluated, separating the variation between objects from the variation between evaluators. This allows you to see how much of the observed difference is due to the objects and how much is due to the evaluators, assuming both are random.
When using the icc function from the irr package, you need to specify which model you are using (“oneway” or “twoway”).
Here, we will use the two-way model because all evaluators evaluated all leaves individually.
library(irr)
icc(raters_list$raters, "twoway")
Single Score Intraclass Correlation
Model: twoway
Type : consistency
Subjects = 40
Raters = 6
ICC(C,1) = 0.828
F-Test, H0: r0 = 0 ; H1: r0 > 0
F(39,195) = 29.9 , p = 3.83e-63
95%-Confidence Interval for ICC Population Values:
0.751 < ICC < 0.893
icc(raters_list$raters_old, "twoway")
Single Score Intraclass Correlation
Model: twoway
Type : consistency
Subjects = 40
Raters = 6
ICC(C,1) = 0.866
F-Test, H0: r0 = 0 ; H1: r0 > 0
F(39,195) = 39.7 , p = 2.85e-73
95%-Confidence Interval for ICC Population Values:
0.802 < ICC < 0.917
icc(raters_list$raters_new, "twoway")
Single Score Intraclass Correlation
Model: twoway
Type : consistency
Subjects = 40
Raters = 6
ICC(C,1) = 0.907
F-Test, H0: r0 = 0 ; H1: r0 > 0
F(39,195) = 59.6 , p = 1.31e-88
95%-Confidence Interval for ICC Population Values:
0.86 < ICC < 0.944
OCCC - Overall Concordance Correlation Coefficient
In the ICC, the focus is on the relative consistency of the measurements. In other words, if one evaluator gives a higher score to an object than another evaluator, ICC checks whether the ranking/order is maintained.
However, in the OCCC (Overall Concordance Correlation Coefficient), it also considers whether the scores are close to the “true” value, reflecting accuracy as well.
library(epiR)
epi.occc(raters_list$raters, na.rm = FALSE, pairs = TRUE)
Overall CCC 0.7899
Overall precision 0.8750
Overall accuracy 0.9027
epi.occc(raters_list$raters_old, na.rm = FALSE, pairs = TRUE)
Overall CCC 0.8414
Overall precision 0.9030
Overall accuracy 0.9318
epi.occc(raters_list$raters_new, na.rm = FALSE, pairs = TRUE)
Overall CCC 0.8844
Overall precision 0.9209
Overall accuracy 0.9603
|>
df_cf2 ggplot(aes(actual, values))+
geom_point(size = 2, alpha = 0.7)+
facet_wrap(~rater)+
ylim(0,45)+
xlim(0,45)+
geom_abline(intercept = 0, slope =1)+
theme_r4pde()+
labs(x = "Actual severity (%)",
y = "Estimate severity (%)")
<- c("old", "new", "Unaided")
sessions
# cria a tabela empilhada
<- map(sessions, function(s) {
precision_tbl |>
df_cf2 filter(session == s) |>
select(-leaf) |>
group_by(rater) |>
::correlation() |>
correlationmutate(session = s) # adiciona a coluna session
})
::kable(precision_tbl) knitr
|
|
|
# combina todos os data frames da lista em um só
<- bind_rows(precision_tbl)
precision_all
%>%
precision_all group_by(session) %>%
summarise(mean_r = mean(r))
# A tibble: 3 × 2
session mean_r
<chr> <dbl>
1 Unaided 0.918
2 new 0.933
3 old 0.932
#Precision
##Absolute Errors
This analysis allows us to check for patterns of overestimation or underestimation in the evaluations when compared to the actual values.
Based on our data, it is possible to observe that, with the aid of the scales, all evaluators except Igor showed a tendency to underestimate the severity values.
# cria uma lista de gráficos, um por session, com títulos personalizados
<- map2(sessions, titles, function(s, t) {
plots |>
df_cf2 filter(session == s) |>
ggplot(aes(actual, values - actual)) +
geom_point(size = 3, alpha = 0.7) +
geom_hline(yintercept = 0) +
facet_wrap(~rater) +
theme_r4pde() +
labs(
x = "Actual severity (%)",
y = "Error (Estimate - Actual)",
title = t
)
})
# exibir, por exemplo, o gráfico da primeira session
1]] plots[[
2]] plots[[
3]] plots[[
#Lin’s Concordance Correlation (CCC)
In the validation of diagrammatic scales, Lin’s Concordance Correlation (CCC) is one of the most widely used metrics, as it combines different aspects of the relationship between estimated and actual values. The Agreement value (overall concordance) summarizes how close the estimates are to the actual values, considering both accuracy and precision. The Bias coefficient indicates whether there is a systematic tendency to overestimate or underestimate; the closer it is to 1, the lower the bias. Precision reflects the consistency of the estimates among evaluators, regardless of whether they are accurate. Additionally, Scale-shift shows whether the variability of the estimated values is greater or smaller than the actual variability, while Location-shift indicates whether there is a constant average difference between the estimates and the true values (for example, consistently estimating 5% higher).
library(epiR)
<- unique(df_cf2$session)
sessions
# cria uma lista com os resultados por session
<- map(sessions, function(s) {
ccc_list
<- df_cf2 |> filter(session == s)
df_s
<- epi.ccc(df_s$actual, df_s$values)
ccc_res
<- ccc_res$rho.c[,1] # Concordance coefficient
rho <- ccc_res$C.b # Bias coefficient
Cb <- rho / Cb # Precision
r <- ccc_res$s.shift # Scale-shift
ss <- ccc_res$l.shift # Location-shift
ls
<- c("Agreement", "Bias coefficient", "Precision", "Scale-shift", "Location-shift")
Metrics <- c(rho, Cb, r, ss, ls)
Value
<- data.frame(Metrics, Value)
res $session <- s # adiciona coluna session
resreturn(res)
})
# combina tudo em um único data frame
<- bind_rows(ccc_list)
ccc_all
# mostra em tabela
<- ccc_all |>
ccc_wide pivot_wider(
names_from = session, # as colunas serão os nomes das sessions
values_from = Value # os valores que preenchem a tabela
)
ccc_wide
# A tibble: 5 × 4
Metrics Unaided old new
<chr> <dbl> <dbl> <dbl>
1 Agreement 0.826 0.886 0.912
2 Bias coefficient 0.938 0.977 0.993
3 Precision 0.880 0.907 0.919
4 Scale-shift 1.09 0.854 0.976
5 Location-shift 0.351 -0.151 0.120
It is consistently possible to observe that the new diagrammatic scale (SAD) shows improved validation parameters. However, are these differences statistically significant?
Here, we can see a boxplot of the evaluations for each condition.
theme_set(theme_r4pde())
|>
df_cf2 ggplot(aes(session, values, color = session))+
geom_boxplot(outlier.colour = NA)+
geom_jitter(width = 0.23, size = 2, alpha = 0.5,)+
scale_color_manual(values = c(
"Unaided" = "black",
"old" = "#ff7f0e",
"new" = "#3A5FCD"
+
))theme_bw()
For each condition, a Shapiro-Wilk test was performed to check the normality of the data. The p-value is very small for each evaluation, indicating that the data do not follow a normal distribution.
# normality test
shapiro.test(unaided_long$values)
Shapiro-Wilk normality test
data: unaided_long$values
W = 0.87216, p-value = 2.684e-13
shapiro.test(old_long$values)
Shapiro-Wilk normality test
data: old_long$values
W = 0.7936, p-value < 2.2e-16
shapiro.test(new_long$values)
Shapiro-Wilk normality test
data: new_long$values
W = 0.8656, p-value = 1.135e-13
Additionally, homoscedasticity of the data was checked using the Levene’s test to verify whether the variances are equal across treatments. However, there is evidence that the variances among the three groups are different (heteroscedasticity).
library(car)
leveneTest(values ~ session, data = df_cf2)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 6.2627 0.002012 **
717
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
GLMM
Due to the lack of normality, and also considering the dependency of the data since the same evaluator assessed the same leaf at different times, a mixed-effects model was used.
library(glmmTMB)
<- glmmTMB((values/100) ~ session + (1| rater),
mix2 data = df_cf2,
family = beta_family())
library(DHARMa)
plot(simulateResiduals(mix2))
An analysis of variance (Type II Wald test) was performed, which indicated a highly significant effect of the type of evaluation (with or without a scale) on disease severity (χ² = 45.08; df = 2; p < 0.001), showing that there are differences between the evaluation methods.
::Anova(mix2) car
Analysis of Deviance Table (Type II Wald chisquare tests)
Response: (values/100)
Chisq Df Pr(>Chisq)
session 45.078 2 1.627e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When these values were contrasted, significant differences were observed between all the different evaluation methods. All results were adjusted using the Tukey method.
library(emmeans)
<- emmeans(mix2, ~ session)
em2 pairs(em2)
contrast estimate SE df z.ratio p.value
Unaided - old 0.551 0.0826 Inf 6.678 <.0001
Unaided - new 0.306 0.0809 Inf 3.783 0.0005
old - new -0.245 0.0838 Inf -2.925 0.0096
Results are given on the log odds ratio (not the response) scale.
P value adjustment: tukey method for comparing a family of 3 estimates
#Results Based on our results, it is possible to infer that the new diagrammatic scale (SAD) statistically improves the severity assessments of coffee leaf rust, performing even better than the previous scale, which had already shown good results.