Analysis of Data from International Educational Studies in R Using the Rrepest Package
1 Introduction
The Rrepest package in R (Ilizaliturri et al., 2025) is an extension of the approach used in the repest module for Stata, originally developed for analysts at the Organisation for Economic Co-operation and Development (OECD) (Avvisati & Keslair, 2014). It was designed to facilitate the analysis of data from international educational studies conducted by OECD and the International Association for the Evaluation of Educational Achievement (IEA), such as:
- PISA, PIAAC, TALIS, SSES (OECD),
- TIMSS, PIRLS, ICCS, ICILS (IEA).
The package enables the accurate analysis of results in accordance with the methodology used in the aforementioned studies, supporting functionalities such as:
- calculation of means, proportions, and grouped frequencies,
- correlation and covariance analysis and linear regression,
- comparative tests between groups and the calculation of dispersion measures.
Rrepest automates the complex aspects of analysing data from international educational surveys, taking into account the intricate sampling designs and scaling techniques used in those studies. The package incorporates sampling weights, replicate weights, and plausible values (PVs) in its calculations, ensuring the proper estimation of measurement uncertainty. As a result, it facilitates working with large-scale datasets and allows for faster analyses that are consistent with OECD and IEA methodological standards.
Full Documentation – a PDF file containing detailed descriptions of the package’s functions along with practical examples of its use in the analysis of international educational survey data.
README on CRAN – a brief installation guide and an overview of the package’s main features, available on the package’s CRAN page.
Repository/wiki with examples (OECD GitLab) – a site maintained by the OECD where you can find package updates and additional examples of analyses.
1.1 Why use Rrepest?
Rrepest package
- Automation: it simplifies the processing of complex data by automatically accounting for OECD and IEA methodological requirements, such as replicate weights and plausible values (PVs). This eliminates the need to manually implement complex statistical formulas.
- Convenience: the
Rrepest()function provides a clear and user-friendly syntax for calculating descriptive statistics, correlations, linear regressions, group comparisons, and generating output tables. - Flexibility: the package supports the analysis of data from multiple international educational studies.
- Support: it is actively used and continuously developed by the OECD.
- Speed: the package utilizes parallel computing, distributing tasks across multiple CPU cores, allowing for up to twice as fast data processing compared to alternative solutions - especially when working with large datasets.
1.2 Limitations of Rrepest
Rrepest package
- The package supports only basic types of analyses: means, frequency distributions, group comparisons, correlations, and linear and logistic regressions. It does not support such features as estimating skill levels (benchmarks), percentiles, or complex modeling (e.g., multilevel models).
- Logistic regression performed using the
est("log", ...)parameter is only supported for OECD studies (e.g., PISA, TALIS). - In linear regression performed using the
Rrepest()function, all predictors are treated as continuous variables – even if they have been recoded as factors. - If SPSS value labels remain in the dataset, column names in the output tables will reflect category labels and will be sorted alphabetically, which may reduce the readability of results.
1.3 Supported Studies
| Study (Organization) | Supported by Rrepest |
|---|---|
| PISA (OECD) | ✔ |
| PIAAC (OECD) | ✔ |
| TALIS (OECD) | ✔ |
| SSES (OECD) | ✔ |
| TIMSS (IEA) | ✔ |
| PIRLS (IEA) | ✔ |
| ICCS (IEA) | ✔ |
| ICILS (IEA) | ✔ |
2 Installing Packages and Preparing Data for Analysis
To get started, you need to install the package (this step is required only once):
# Install the Rrepest package
install.packages("Rrepest")
# Install additional packages for working with data
install.packages(c("haven", "tidyverse"))
# haven: a package for importing .sav, .sas7bdat, and .dta files into R
# tidyverse: a collection of packages that streamline data analysis,
# including dplyr, purrr, readr, stringr, and others We present several best practices that can help you work more effectively with the Rrepest package below.
Rrepest Package:
Standardize variable names: Convert all variable names to lowercase (e.g.,
names(df) <- tolower(names(df))) to avoid issues with case sensitivity in theRrepest()function. Some of its arguments (e.g.,over) are case-sensitive with respect to the variable names provided as values.Check your data: Verify the structure of the dataset and the presence of missing values:
- Data structure:
glimpse(df),summary(df) - Number of observations per category:
count(df, variable) - Number of missing values per column: e.g.,
colSums(is.na(df))
- Data structure:
Reproducibility: Keep your code in scripts and document file paths and data transformations to ensure the reproducibility of your analyses.
Cross-check results: Compare your outputs with official OECD/IEA reports or the IEA IDB Analyzer tool to ensure alignment with the survey methodology.
3 Syntax and Key Features of the Rrepest() Function
Basic function syntax with its main arguments:
Rrepest(
data = <data frame>,
svy = <name of the study ( "ALL"|"IALS"|"IELS"|"PIAAC"|
"PIRLS"|"PISA"|"PISAOOS"|"SSES"|"TALIS" ("TALISSCH"/"TALISTCH")|
"TALIS3S" ("TALISEC_STAFF"/ "TALISEC_LEADER")"|"TIMSS"| "ICCS" (ICCS_T)"|
"ICILS" | "SVY" (custom survey))>,
est = est(
<"mean"|"var"|"std"|"quant"|"iqr"|"freq"|"corr"|"lm"|"log"|"cov"|"gen">,
target = <variable>,
regressor = <variable | vector of variables>
# the "regressor" argument is used for "lm" and "log" estimators;
# when using the "gen" estimator, you can specify a custom model
# without needing to provide "target" or "regressor"
),
by = <variable | vector of variables>,
over = <variable | vector of variables>,
test = TRUE/FALSE,
n.pvs = <number of plausible values (PVs)>,
cm.weights = c(<main weight>, <replicate weights_1:K>),
var.factor = <variance scaling factor>,
show_na = TRUE/FALSE
)Key Arguments of the Rrepest() Function:
svy– the name of the study being used in the analysis.est()– built-in estimators include"mean","var","std","quant","iqr","freq","corr","lm","log","cov", and the generalgenoption for running custom R models.target– a single variable or multiple PV imputations using the@symbol syntax (e.g.,PVREAD@,PVNUM@,ASMMAT0@). This syntax can also be used inby,over, andregressorarguments to specify PV variables.by– specifies grouping variable(s) to be used in estimates/tables.over+test=TRUE– combined group analysis with statistical tests for differences between levels of the variable specified in theoverargument.n.pvs– number of plausible values used when averaging results across imputations (by default,Rrepest()uses 5 PVs).cm.weights– the main weight and a set of replication weights (e.g.,totwgt,jr1:jrK). Setting this parameter is necessary for some analyses where it’s not possible to use the weights included by default in theRrepest()function.var.factor– a variance scaling factor based on the sample design. Setting this parameter is necessary for some analyses that do not use the weights included by default in theRrepest()function.show_na– if set to TRUE, missing values will be treated as a separate category in frequency tables.
by vs. over
by– performs the analysis separately for each group, treating them as independent datasets. It does not allow for the direct testing of differences between groups.overcombined withtest=TRUE– analyzes all groups jointly and automatically computes differences between them, along with standard errors.
3.1 Data Structure in International Educational Studies
To ensure that Rrepest functions properly, key variables in the dataset—such as plausible values (PVs) and analytical weights—must follow specific naming conventions that match those used in a given study cycle.
3.1.1 Variable Names for Plausible Values (PVs)
Plausible values (PVs) are sets of multiple imputations of latent (unobservable) traits—such as mathematical ability—that cannot be directly measured. They are used to better estimate measurement error. The Rrepest() function recognizes PVs through variable names that include the @ symbol (e.g., pv@math, pv@read, pv@civ). The function assumes that multiple imputations are available for each latent trait (e.g., pv1math, …, pvKmath, where K is the number of PVs provided in a given study).
By default, the
Rrepest()function assumes that 5 imputations are available.If the study includes a different number of plausible values (for example, there are 10 PVs in PISA cycles from 2015 onward), you should explicitly specify the correct number using the
n.pvsargument (e.g., n.pvs = 10).
3.1.2 Names of Analytical Weight Variables
Analytical weights are another type of key variable in international educational studies. For the Rrepest() function to work correctly, the names of these variables must match the expected format for a given study. Each study includes a main (total/final) weight variable (e.g., w_fstuwt) and a set of replicate weights (e.g., w_fsturwt1, ..., w_fsturwt80). The names and number of replicate weights differ between studies. In most cases, the Rrepest() function automatically identifies the correct weight variables in the dataset based on the study name specified in the svy argument.
Rrepest() function
- PISA – main weight:
w_fstuwt; 80 replicate weights:w_fsturwt1, ..., w_fsturwt80(student-level weights). - ICCS/ICILS – main weight:
totwgts; 75 replicate weights:srwgt1, ..., srwgt75(student-level weights).
For studies such as TIMSS and PIRLS (as well as other studies using the Jackknife Repeated Replication (JRR) weighting scheme), replicate weights must be manually recalculated and added to the dataset. Instructions for doing so are provided in the practical section of this guide.
Some studies differentiate study designs based on the analyzed respondent groups. For example, in the TALIS study, in the
svyargument, we can specify"TALISTCH"if we are analyzing teacher data, or"TALISSCH"if we are analyzing school data. Similarly, in the ICCS study, we can specify the study name in thesvyargument with the suffix_T(i.e.,"ICCS_T") if we are analyzing teacher data. Depending on the value assigned to thesvyargument, Rrepest will automatically apply the appropriate analytical weights.The methodology of TIMSS and PIRLS changed for study cycles conducted after 2011. In cycles prior to and including 2011, 150 replicate weights and one plausible value were used, whereas in subsequent cycles, 250 replicate weights and five plausible values were implemented.
4 Downloading Data
Data and documentation (e.g., codebooks, questionnaires and guides) for the studies can be accessed through the following websites:
- PISA: https://www.oecd.org/pisa/data/
- TIMSS/PIRLS/ICCS/ICILS: https://www.iea.nl/data-tools/repository
- PIAAC: https://www.oecd.org/skills/piaac/data/
- TALIS: https://www.oecd.org/en/about/programmes/talis.html#data
- SSES: https://www.oecd.org/en/about/programmes/oecd-survey-on-social-and-emotional-skills.html#data
In the case of OECD studies, you may be required to complete a short registration form before downloading the datasets.
4.1 Data Files Structure
Before starting the analysis, data from the international studies must be imported into the R environment, which requires an understanding of the complex structure of the files containing them. This structure differs between OECD and IEA studies.
4.1.1 IEA Data Structure
IEA datasets are typically divided into a large number of files, grouped by country, grade level, and the specific assessment instrument used. For example, after downloading the TIMSS 2023 Grade 4 data, you may find over 500 files in the extracted folder.
Example: the asapolm8.sav file contains data from Polish students from the TIMSS 2023 study for grade 4 in .sav format.
Where:
asa: student responses and achievement scores (Grade 4)pol: country code (Poland)m8: study cycle (here: cycle 8, conducted in 2023)
The first letter in the file name indicates the grade level:
a– Grade 4b– Grade 8
Subsequent letters specify the data type:
asa/bsa– student results and plausible values (PVs)asp/bsp– process data (e.g., response times)ash– home questionnaire dataasg/bsg– student questionnaire dataacg/bcg– school questionnaire dataatg/btg– teacher questionnaire data
4.1.2 OECD Data Structure
In OECD studies, data from all participating countries are typically combined into a single, large file, rather than being split by country. The files are organized by study cycle and data type.
Example: the file CY08MSP_STU_QQQ.sav contains student questionnaire data from all countries participating in the PISA 2022 study.
Where:
CY08– study cycle (here: cycle 8, conducted in 2022)MSP– Main StudySTU_QQQ– student questionnaire data
Other file identifiers include:
SCH_QQQ– school questionnaire dataTCH_QQQ– teacher questionnaire dataSTU_COG– student cognitive test results (e.g., reading, mathematics, science, etc.)STU_FLT– financial literacy test resultsSTU_ICT– ICT familiarity questionnaireSTU_WBQ– student well-being questionnaire
5 PISA 2022 Data Analysis
5.1 Preparing the Data
# Step 1: Specify the folder containing the SPSS files
dir_pisa_2022 <- "C:/DATA/PISA 2022/Data_SPSS"
# Step 2: Load the student questionnaire data (STU_QQQ)
data_students <- read_sav(paste0(dir_pisa_2022, "/CY08MSP_STU_QQQ.sav"))
# Step 3: Select countries for the analysis (Australia, Canada, Peru, Poland)
# and filter the dataset
countries <- c("AUS", "CAN", "PER", "POL")
pisa_2022 <- filter(data_students, CNT %in% countries)
# Step 4: Convert variable names to lowercase
names(pisa_2022) <- tolower(names(pisa_2022))5.2 Example 1 – Mean and Standard Deviation of Reading Scores by Country and Gender
Analysis_01 <- Rrepest(
data = pisa_2022,
svy = "PISA",
est = est(c("mean", "std"), target = "pv@read"),
by = c("cnt", "st004d01t"),
n.pvs = 10 # `Rrepest()` uses 5 PVs by default. For PISA 2022,
# set `n.pvs = 10` to reflect the correct number
# of plausible values.
)
print(Analysis_01)5.3 Example 2 – Frequency Table of Students’ Gender
Analysis_02 <- Rrepest(
data = pisa_2022,
svy = "PISA",
est = est("freq", target = "st004d01t"),
by = "cnt",
show_na = TRUE
)
print(Analysis_02, width = 200)5.4 Example 3 – Linear Regression: The Effect of Socio-Economic Status on Student Performance in Mathematics
Analysis_03 <- Rrepest(
data = pisa_2022,
svy = "PISA",
by = "cnt",
est = est("lm", target = "pv@math", regressor = "escs"),
n.pvs = 10,
show_na = TRUE
)
print(Analysis_03, width = 200)An Example Code for Improving the Readability of the Output Table
Table_03<- Analysis_03 %>%
transmute(
Country = cnt,
Intercept_Estimate = rowMeans(select(., starts_with("b.reg_pv") &
ends_with("intercept"))),
Intercept_SE = rowMeans(select(., starts_with("se.reg_pv") &
ends_with("intercept"))),
Intercept_t = Intercept_Estimate / Intercept_SE,
ESCS_Estimate = rowMeans(select(., starts_with("b.reg_pv") &
ends_with("escs"))),
ESCS_SE = rowMeans(select(., starts_with("se.reg_pv") &
ends_with("escs"))),
ESCS_t = ESCS_Estimate / ESCS_SE,
R2_Estimate = rowMeans(select(., starts_with("b.reg_pv") &
ends_with("rsqr"))),
R2_SE = rowMeans(select(., starts_with("se.reg_pv") &
ends_with("rsqr"))),
R2_t = R2_Estimate / R2_SE
)
make_reg_table <- function(row) {
data.frame(
Estimate = c(row$Intercept_Estimate,
row$ESCS_Estimate,
row$R2_Estimate),
`Std. Error` = c(row$Intercept_SE,
row$ESCS_SE,
row$R2_SE),
`t value` = c(row$Intercept_t,
row$ESCS_t,
row$R2_t),
row.names = c("(Intercept)", "ESCS", "R-squared")
)
}
final_table_03 <- setNames(
lapply(1:nrow(Table_03),
function(i) make_reg_table(Table_03[i, ])),
Table_03$Country
)
print(final_table_03)5.5 Example 4 - Logistic Regression: The Impact of Family Socioeconomic Status on Student Achievement in Mathematics (Obtaining a Score of at least 626)
# To implement logistic regression, we first need to create a new binary
# variable taking the value "1" or "0", depending on whether a given
# student achieved a score equal to or higher than 626 points. A separate
# variable is created for each PV imputation.
for (i in 1:10) {
varname <- paste0("pv", i, "math")
newvar <- paste0("y", i)
pisa_2022[[newvar]] <- ifelse(pisa_2022[[varname]] >= 626, 1, 0)
}
Analysis_04 <- Rrepest(
data = pisa_2022,
svy = "PISA",
est = est("log", target = "y@", regressor = "escs"),
n.pvs = 10,
by = "cnt"
)
print(Analysis_04, width = 200)An Example Code for Improving the Readability of the Output Table
Table_04 <- Analysis_04 %>%
transmute(
Country = cnt,
Coef = rowMeans(select(., starts_with("b.log") &
ends_with("intercept"))),
Std_Error = rowMeans(select(., starts_with("se.log") &
ends_with("intercept"))),
t_value = round(Coef / Std_Error, 2),
ESCS = rowMeans(select(., starts_with("b.log") &
ends_with("escs"))),
ESCS_SE = rowMeans(select(., starts_with("se.log") &
ends_with("escs"))),
ESCS_t = round(ESCS / ESCS_SE, 2),
OR = round(exp(Coef), 2),
CI95low = round(exp(Coef - 1.96 * `Std_Error`), 2),
CI95up = round(exp(Coef + 1.96 * `Std_Error`), 2),
OR_ESCS = round(exp(ESCS), 2),
CI95low_ESCS = round(exp(ESCS - 1.96 * `ESCS_SE`), 2),
CI95up_ESCS = round(exp(ESCS + 1.96 * `ESCS_SE`), 2)
)
make_logreg_table <- function(row) {
data.frame(
Coef = c(row$Coef, row$ESCS),
`Std. Error`= c(row$Std_Error, row$ESCS_SE),
`t value` = c(row$t_value, row$ESCS_t),
OR = c(row$OR, row$OR_ESCS),
CI95low = c(row$CI95low, row$CI95low_ESCS),
CI95up = c(row$CI95up, row$CI95up_ESCS),
row.names = c("(Intercept)", "ESCS")
)
}
final_table_04 <- setNames(lapply(1:nrow(Table_04),
function(i) make_logreg_table(Table_04[i, ])),
Table_04$Country)
print(final_table_04)6 TIMSS 2023 (Grade 4) Data Analysis
6.1 Step-by-Step Data Preparation
Preparing TIMSS data for analysis with the Rrepest() function is a multi-step process. This is due to both the structure of the data files and the specific methodological requirements of the TIMSS study.
6.1.1 Merging the Data
# Step 1: Specify the folder containing the SPSS files
dir_timss_2023 <- "C:/DATA/TIMSS2023_IDB_SPSS_G4/2_Data Files/SPSS Data"
# Step 2: Create lists of files for selected countries (Armenia, Australia,
# Bahrain, Poland) and specific datasets:
# Student achievement files (ASA), containing test scores
timss_2023_asa_files <- list.files(dir_timss_2023,
pattern = ".*ASA(POL|AUS|ARM|BHR).*\\.sav$",
ignore.case = TRUE, full.names = TRUE)
# Student context files (ASG), containing student questionnaire data
timss_2023_asg_files <- list.files(dir_timss_2023,
pattern = ".*ASG(POL|AUS|ARM|BHR).*\\.sav$",
ignore.case = TRUE, full.names = TRUE)
# School context files (ACG), containing school questionnaire data
timss_2023_acg_files <- list.files(dir_timss_2023,
pattern = ".*ACG(POL|AUS|ARM|BHR).*\\.sav$",
ignore.case = TRUE, full.names = TRUE)
# Home context files (ASH), containing parent questionnaire data
timss_2023_ash_files <- list.files(dir_timss_2023,
pattern = ".*ASH(POL|AUS|ARM|BHR).*\\.sav$",
ignore.case = TRUE, full.names = TRUE)
# Step 3: Merge questionnaire data from selected countries
# using the previously created file lists
timss_2023_asa <- timss_2023_asa_files %>% map(read_sav) %>%
bind_rows()
timss_2023_asg <- timss_2023_asg_files %>% map(read_sav) %>%
bind_rows()
timss_2023_acg <- timss_2023_acg_files %>% map(read_sav) %>%
bind_rows()
timss_2023_ash <- timss_2023_ash_files %>% map(read_sav) %>%
bind_rows()When merging datasets from different questionnaires using the left_join() function, the "by" parameter should include all variables that are duplicated in the datasets being merged. Depending on the datasets, these may be variables such as school or student identifiers (e.g., “IDSCHOOL” and “IDSTUD”), but also variables containing PVs. Otherwise, the left_join() function will create duplicate columns for the overlapping variables, appending suffixes like .x and .y to the column names (e.g., “variable01.x”, “variable01.y”). This can lead to confusion and complicate your analysis later on.
# Step 4: Merging datasets from all questionnaires
timss_2023 <- left_join(
timss_2023_asa,
select(timss_2023_acg, -any_of(names(timss_2023_asa)),
CTY,
IDSCHOOL),
by = c("CTY", "IDSCHOOL")
) %>% left_join(
select(timss_2023_asg, -any_of(names(timss_2023_asa)),
CTY,
IDSCHOOL,
IDSTUD),
by = c("CTY", "IDSCHOOL", "IDSTUD")
) %>% left_join(
select(timss_2023_ash, -any_of(names(timss_2023_asa)),
CTY,
IDCNTRY,
IDSCHOOL,
IDSTUD),
by = c("CTY", "IDCNTRY", "IDSCHOOL", "IDSTUD")
)6.1.2 Creation of 250 replicate weights using the Jackknife Repeated Replication (JRR) method applied in the TIMSS 2023 study
# Step 5: Creation of 250 replicate weights using the JRR method
for (i in 1:125) {
j <- 125 + i
timss_2023 <- timss_2023 %>%
mutate(!!paste0("jr", i) := case_when(
JKZONE == i & JKREP == 1 ~ 2 * TOTWGT,
JKZONE == i & JKREP == 0 ~ 0,
TRUE ~ TOTWGT
)) %>%
mutate(!!paste0("jr", j) := case_when(
JKZONE == i & JKREP == 0 ~ 2 * TOTWGT,
JKZONE == i & JKREP == 1 ~ 0,
TRUE ~ TOTWGT
))}6.1.3 Converting variable names to lowercase
# Step 6: Convert variable names to lowercase
names(timss_2023) <- tolower(names(timss_2023))In TIMSS/PIRLS analyses you should use the custom weights parameter cm.weights = c("totwgt", paste0("jr", 1:250)), specifying the main weight totwgt and the previously created replicate weights (jr1, ..., jr250). To correctly calculate standard errors that account for the replicate weights, it is also necessary to specify the variance factor argument: var.factor = 125. The value of the variance factor depends on the sampling design applied. To ensure the validity of your analyses, make sure to review the documentation on the replicate weight construction method used in the given study cycle.
6.2 Example 1 – Mean of Mathematics Scores by Country and Gender
Analysis_01 <- Rrepest(
data = timss_2023,
svy = "TIMSS",
cm.weights = c("totwgt", paste0("jr", 1:250)),
est = est(c("mean", "std"), target = "asmmat0@"),
by = c("cty", "itsex"),
var.factor = 125
)
print(Analysis_01)6.3 Example 2 – Mean of Mathematics Scores by Country and Gender, Using the over Parameter and Showing the Difference in Scores Between Genders (Boys - Girls) in Each Country
Analysis_02 <- Rrepest(
data = timss_2023,
svy = "TIMSS",
cm.weights = c("totwgt", paste0("jr", 1:250)),
est = est("mean", target = "asmmat0@"),
by = "cty",
over = "itsex",
test = TRUE,
var.factor = 125
)
print(Analysis_02, width = 200)6.4 Example 3 – Frequency Table of Students’ Gender
Analysis_03 <- Rrepest(
data = timss_2023,
svy = "TIMSS",
cm.weights = c("totwgt", paste0("jr", 1:250)),
est = est("freq", "itsex"),
by = "cty",
show_na = TRUE,
var.factor = 125
)
print(Analysis_03)6.5 Example 4 – Linear Regression: The Effect of Student Gender and Lack of Early Counting Activities on Students’ Mathematics Achievement
Analysis_04 <- Rrepest(
data = timss_2023,
svy = "TIMSS",
cm.weights = c("totwgt", paste0("jr", 1:250)),
by = "cty",
est = est("lm", target = "asmmat0@", regressor = c("itsex", "asbh01k")),
var.factor = 125
)
print(Analysis_04)An Example Code for Improving the Readability of the Output Table
Table_04 <- Analysis_04 %>%
transmute(
Country = cty,
Intercept_Estimate = rowMeans(select(., starts_with("b.reg_asmmat") &
ends_with("intercept"))),
Intercept_SE = rowMeans(select(., starts_with("se.reg_asmmat") &
ends_with("intercept"))),
Intercept_t = Intercept_Estimate / Intercept_SE,
ITSEX_Estimate = rowMeans(select(., starts_with("b.reg_asmmat") &
ends_with("itsex"))),
ITSEX_SE = rowMeans(select(., starts_with("se.reg_asmmat") &
ends_with("itsex"))),
ITSEX_t = ITSEX_Estimate / ITSEX_SE,
ASBH01K_Estimate = rowMeans(select(., starts_with("b.reg_asmmat") &
ends_with("asbh01k"))),
ASBH01K_SE = rowMeans(select(., starts_with("se.reg_asmmat") &
ends_with("asbh01k"))),
ASBH01K_t = ASBH01K_Estimate / ASBH01K_SE,
R2_Estimate = rowMeans(select(., starts_with("b.reg_asmmat") &
ends_with("rsqr"))),
R2_SE = rowMeans(select(., starts_with("se.reg_asmmat") &
ends_with("rsqr"))),
R2_t = R2_Estimate / R2_SE
)
make_reg_table <- function(row) {
data.frame(
Estimate = c(row$Intercept_Estimate,
row$ITSEX_Estimate,
row$ASBH01K_Estimate,
row$R2_Estimate),
`Std. Error` = c(row$Intercept_SE,
row$ITSEX_SE,
row$ASBH01K_SE,
row$R2_SE),
`t value` = c(row$Intercept_t,
row$ITSEX_t,
row$ASBH01K_t,
row$R2_t),
row.names = c("(Intercept)", "ITSEX", "ASBH01K", "R-squared")
)
}
final_table_04 <- setNames(lapply(1:nrow(Table_04),
function(i) make_reg_table(Table_04[i, ])),
Table_04$Country)
print(final_table_04)6.6 Analyses of Mathematics Teachers’ Data (TIMSS 2023)
To analyze teacher data, we need to merge files with the prefix ATG (teacher context), containing data from the teacher questionnaire, with files prefixed AST (student-teacher linkage), because this dataset includes variables with main weights assigned to teachers (“TCHWGT” – all teachers, “MATWGT” – mathematics teachers, “SCIWGT” – science teachers).
6.6.1 Data Preparation and Construction of Replicate Weights for Teachers in Accordance with the Jackknife Repeated Replication (JRR) Method Used in the TIMSS 2023 Study
# Step 1: Create lists of files containing teacher and student-teacher linkage
# data for selected countries
# ATG files – teacher context
# AST files - student-teacher linkage
timss_2023_atg_files <- list.files(dir_timss_2023,
pattern = ".*ATG(POL|AUS|ARM|BHR).*\\.sav$",
ignore.case = TRUE,
full.names = TRUE)
timss_2023_ast_files <- list.files(dir_timss_2023,
pattern = ".*AST(POL|AUS|ARM|BHR).*\\.sav$",
ignore.case = TRUE,
full.names = TRUE)
timss_2023_atg <- timss_2023_atg_files %>% map(read_sav) %>% bind_rows()
timss_2023_ast <- timss_2023_ast_files %>% map(read_sav) %>% bind_rows()
# Step 2: Merge the data from selected countries using the previously
# created file lists
timss_2023_teacher <- left_join(
timss_2023_atg,
select(timss_2023_ast, -any_of(names(timss_2023_atg)),
CTY,
IDCNTRY,
IDSCHOOL,
IDTEACH,
IDTEALIN),
by = c("CTY", "IDCNTRY", "IDSCHOOL", "IDTEACH", "IDTEALIN")
)
# Step 3: Create 250 replicate weights with the JRR method,
# using the main weight for mathematics teachers (MATWGT)
for (i in 1:125) {
j <- 125 + i
timss_2023_teacher <- timss_2023_teacher %>%
mutate(!!paste0("jr", i) := case_when(
JKZONE == i & JKREP == 1 ~ 2 * MATWGT,
JKZONE == i & JKREP == 0 ~ 0,
TRUE ~ MATWGT
)) %>%
mutate(!!paste0("jr", j) := case_when(
JKZONE == i & JKREP == 0 ~ 2 * MATWGT,
JKZONE == i & JKREP == 1 ~ 0,
TRUE ~ MATWGT
))
}
# Step 4: Keep only mathematics teachers in the dataset for further analysis
# (remove all rows where the variable "MATWGT" has a missing value)
timss_2023_teacher <- timss_2023_teacher %>% filter(!is.na(MATWGT))
# Step 5: Convert variable names to lowercase
names(timss_2023_teacher) <- tolower(names(timss_2023_teacher))In analyzing mathematics teachers in the TIMSS 2023 study, you should use the custom weights parameter cm.weights = c("matwgt", paste0("jr", 1:250)), specifying the main weight matwgt and the previously created replicate weights (jr1, ..., jr250). To correctly calculate standard errors that account for the applied weights, you must also set the variance factor argument: var.factor = 125.
6.6.2 Example 5 – Percentage of Mathematics Teachers by Age Group
# recoding age into a categorical variable
timss_2023_teacher <- timss_2023_teacher %>%
dplyr::mutate(
atbg03 = factor(atbg03, levels = 1:6,
labels = c(
"Under 25",
"25–29",
"30–39",
"40–49",
"50–59",
"60 or more"))
)
Analysis_05 <- Rrepest(
data = timss_2023_teacher,
svy = "TIMSS",
cm.weights = c("matwgt", paste0("jr", 1:250)),
est = est("freq", target = "atbg03"),
by = "cty",
show_na = TRUE,
var.factor = 125
)
print(Analysis_05)6.6.3 Example 6 – Mean Years of Service for Mathematics Teachers
Analysis_06 <- Rrepest(
data = timss_2023_teacher,
svy = "TIMSS",
cm.weights = c("matwgt", paste0("jr", 1:250)),
est = est("mean", target = "atbg01"),
by = "cty",
var.factor = 125
)
print(Analysis_06)7 ICCS 2022 Data Analysis
Some countries participating in the ICCS 2022 study conducted it for the first time using computer-based testing. In these countries, an additional component of the study was implemented in the form of paper-based questionnaires, the so-called “bridge study”. The data from the “bridge” sample are not included in the official ICCS 2022 reporting. Therefore, in our analyses, we exclude them by selecting only files with the suffix “C4” in their name.
7.1 Preparing the Data
# Step 1: Specify the folder containing the SPSS files
dir_iccs_2022 <- "C:/DATA/ICCS 2022/SPSS Data/ICCS_2022_G8"
# Step 2: Create lists of files for selected countries (Croatia, Poland,
# Romania, Spain, Sweden) and specific datasets:
# Note: Exclude files from the "bridge" edition by selecting only files
# with the "C4" suffix in the filename
# Student Civic Knowledge files (ISA), containing scores relating
# to civic competence
iccs_isa_files <- list.files(dir_iccs_2022,
pattern = ".*ISA(POL|HRV|ESP|SWE|ROU)C4.*\\.sav$",
ignore.case = TRUE, full.names = TRUE)
# International Student Questionnaire files (ISG), containing student
# questionnaire data
iccs_isg_files <- list.files(dir_iccs_2022,
pattern = ".*ISG(POL|HRV|ESP|SWE|ROU)C4.*\\.sav$",
ignore.case = TRUE, full.names = TRUE)
# School Questionnaire files (ICG), containing school questionnaire data
iccs_icg_files <- list.files(dir_iccs_2022,
pattern = ".*ICG(POL|HRV|ESP|SWE|ROU)C4.*\\.sav$",
ignore.case = TRUE, full.names = TRUE)
# Teacher Questionnaire files (ITG), containing teacher questionnaire data
iccs_itg_files <- list.files(dir_iccs_2022,
pattern = ".*ITG(POL|HRV|ESP|SWE|ROU)C4.*\\.sav$",
ignore.case = TRUE, full.names = TRUE)
# Step 3: Merge questionnaire data from selected countries
# using the previously created file lists
iccs_2022_isa <- iccs_isa_files %>% map(read_sav) %>% bind_rows()
iccs_2022_isg <- iccs_isg_files %>% map(read_sav) %>% bind_rows()
iccs_2022_icg <- iccs_icg_files %>% map(read_sav) %>% bind_rows()
iccs_2022_itg <- iccs_itg_files %>% map(read_sav) %>% bind_rows()When merging datasets from the ICCS 2022 study containing student civic competence scores (files prefixed ISA) and student survey results (files prefixed ISG), we include plausible values, the main weight variable, and the student replicate weights) in the by argument of the left_join() function. This is necessary because these variables appear in both datasets.
# Step 4: Merge student datasets (civic competence results and student
# survey data) with the school questionnaire dataset
iccs_2022 <- left_join(
iccs_2022_isa,
iccs_2022_isg,
by = c("COUNTRY", "IDSCHOOL", "IDSTUD", "std_GENDER",
"TOTWGTS", paste0("PV", 1:5, "CIV"), paste0("SRWGT0", 1:9),
paste0("SRWGT", 10:75))
) %>% left_join(
iccs_2022_icg,
by = c("COUNTRY", "IDSCHOOL")
)The ICCS study allows us to treat the teacher dataset as independent from the student dataset and analyze teachers using the teacher weights.
# We do not merge the teacher data with other datasets because
# it is not needed in our example analyses.
iccs_2022_teacher <- iccs_2022_itg
# Step5: Convert variable names to lowercase
names(iccs_2022) <- tolower(names(iccs_2022))
names(iccs_2022_teacher) <- tolower(names(iccs_2022_teacher))In the case of the ICCS study, the Rrepest() function expects that the replicate weight variables take names without leading zeros for replicates numbered from 1 to 9 (e.g., “srwgt1,…, srwgt9” or “trwgt1,…, trwgt9”). However, in the ICCS 2022 dataset provided by the IEA, the names of the replicate weight variables include leading zeros, appearing as “srwgt01,…, srwgt09”, “trwgt01,…, trwgt09”. These variables must be manually renamed for the function to work correctly.
# Step 6: Remove leading zeros from replicate weight variable names in the
# student and teacher datasets
colnames(iccs_2022) <- gsub("^srwgt0([1-9])$", "srwgt\\1", colnames(iccs_2022))
colnames(iccs_2022_teacher) <- gsub(
"^trwgt0([1-9])$",
"trwgt\\1",
colnames(iccs_2022_teacher))7.2 Example 1 – Mean Student Scores in Civic Knowledge and Understanding of Civic Issues, Broken Down by Gender
Analysis_01 <- Rrepest(
data = iccs_2022,
svy = "ICCS",
est = est(c("mean", "std"), target = "pv@civ"),
by = c("country", "std_gender")
)
print(Analysis_01)7.3 Example 2 - Linear Regression: The Effect Of Student Gender on Scores in Civic Knowledge and Understanding of Civic Issues
Analysis_02 <- Rrepest(
data = iccs_2022,
svy = "ICCS",
by = "country",
est = est("lm", target = "pv@civ", regressor = c("std_gender"))
)
print(Analysis_02, width = 200)An Example Code for Improving the Readability of the Output Table
Table_02 <- Analysis_02 %>%
transmute(
country,
Intercept_Estimate = rowMeans(select(., starts_with("b.reg_pv@civ") &
ends_with("intercept"))),
Intercept_SE = rowMeans(select(., starts_with("se.reg_pv@civ") &
ends_with("intercept"))),
Intercept_t = Intercept_Estimate / Intercept_SE,
gender_Estimate = rowMeans(select(., starts_with("b.reg_pv@civ") &
ends_with("std_gender"))),
gender_SE = rowMeans(select(., starts_with("se.reg_pv@civ") &
ends_with("std_gender"))),
gender_t = gender_Estimate / gender_SE,
R2_Estimate = rowMeans(select(., starts_with("b.reg_pv@civ") &
ends_with("rsqr"))),
R2_SE = rowMeans(select(., starts_with("se.reg_pv@civ") &
ends_with("rsqr"))),
R2_t = R2_Estimate / R2_SE
)
make_reg_table <- function(row) {
data.frame(
Estimate = c(row$Intercept_Estimate,
row$gender_Estimate,
row$R2_Estimate),
`Std. Error` = c(row$Intercept_SE,
row$gender_SE,
row$R2_SE),
`t value` = c(row$Intercept_t,
row$gender_t,
row$R2_t),
row.names = c("(Intercept)", "gender", "R-squared")
)
}
Table_02 <- setNames(lapply(1:nrow(Table_02),
function(i) make_reg_table(Table_02[i, ])),
Table_02$country)
print(Table_02)7.4 Example 3 – Student Participation in School Life According to School Principals
7.4.1 Example 3A - How Often Are Students Encouraged to Contribute to Activities Planning? Percentage of Students in Schools Where Principals Responded in a Specified Way
Analysis_03a <- Rrepest(
data = iccs_2022,
svy = "ICCS",
est = est("freq", target = "ic4g03d"),
by = "country"
)
print(Analysis_03a, width = 200)7.4.2 Example 3B – How Often Are Students Involved in Creating Rules and Regulations? Percentage of Students in Schools Where Principals Responded in a Specified Way
Analysis_03b <- Rrepest(
data = iccs_2022,
svy = "ICCS",
est = est("freq", target = "ic4g03b"),
by = "country"
)
print(Analysis_03b, width = 200)7.5 Example 4 - Teachers’ Responses to the Question: “How Important Are the Following Behaviors for Being a Good Citizen in Adult Life: Engaging in Political Discussions”
In order for the Rrepest() function to automatically retrieve the weights assigned to teachers, set the argument ‘svy = “ICCS_T”’. If we set the argument ‘svy = “ICCS”’, it is necessary to specify the custom weights using the argument: cm.weights = c("totwgtt", paste0("trwgt", 1:75)).
Analysis_04 <- Rrepest(
data = iccs_2022_teacher,
svy = "ICCS_T",
est = est("freq", target = "it4g17e"),
by = "country"
)
print(Analysis_04, width = 300)8 TALIS 2018 Data Analysis
In the case of the TALIS study, analyses can be conducted at two independent levels:
TALISTCH – Teachers (default weights:
TCHWGTas the main weight and 100 replicate weightsTRWGT1, ..., TRWGT100).TALISSCH – School principals/Schools (default weights: SCHWGT as the main weight and 100 replicate weights
SRWGT1, ..., SRWGT100).
In order for the Rrepest() function to use the correct weight variables, it is necessary to set the svy argument correctly: svy = "TALISTCH" for analyses performed on the teachers’ dataset, or svy = "TALISSCH" for analyses performed on the school dataset (data from the school principals’ questionnaire).
8.1 Preparing the Data
# Step 1: Specify the folder containing the SPSS files
dir_talis_2018 <- "C:/DATA/TALIS 2018/SPSS Data/SPSS_2018_international"
# Step 2: Load the datasets:
# Teacher Data – Files with the Prefix BTG:
talis_2018_tch <- read_sav(file.path(dir_talis_2018, "BTGINTT3.sav"))
# School Data – Files with the Prefix BCG:
talis_2018_sch <- read_sav(file.path(dir_talis_2018, "BCGINTT3.sav"))
# Step 3: Select countries for the analysis (Colombia, Czech Republic,
# Japan, Kazakhstan) and filter the dataset
countries <- c("JPN", "CZE", "COL", "KAZ")
talis_2018_tch <- talis_2018_tch %>% filter(CNTRY %in% countries)
talis_2018_sch <- talis_2018_sch %>% filter(CNTRY %in% countries)
# Step 4: Convert variable names to lowercase
names(talis_2018_tch) <- tolower(names(talis_2018_tch))
names(talis_2018_sch) <- tolower(names(talis_2018_sch))8.2 Analyses (TALISTCH – Teacher Level)
8.2.1 Example 1 - Frequency Table of Teachers’ Gender
Analysis_01 <- Rrepest(
data = talis_2018_tch,
svy = "TALISTCH",
est = est("freq", target = "tt3g01"),
by = "cntry"
)
print(Analysis_01)8.2.2 Example 2 - Mean Number of Hours Teachers Spend Weekly on Communication and Cooperation with Parents
Analysis_02 <- Rrepest(
data = talis_2018_tch,
svy = "TALISTCH",
est = est(c("mean","std"), target = "tt3g18h"),
by = "cntry"
)
print(Analysis_02)8.2.3 Example 3 - Mean Number of Hours Teachers Spend Weekly on Professional Development Activities
Analysis_03 <- Rrepest(
data = talis_2018_tch,
svy = "TALISTCH",
est = est(c("mean","std"), target = "tt3g18g"),
by = "cntry"
)
print(Analysis_03)8.2.4 Example 4 - Linear Regression: The Effect of the Opportunity to Actively Participate in School Decisions on Teachers’ Overall Job Satisfaction
Analysis_04 <- Rrepest(
data = talis_2018_tch,
svy = "TALISTCH",
by = "cntry",
est = est("lm", target = "t3jobsa", regressor = c("tt3g48a"))
)
print(Analysis_04, width = 200)An Example Code for Improving the Readability of the Output Table
Table_04 <- Analysis_04 %>%
transmute(
Country = cntry,
Intercept_Estimate = rowMeans(select(., starts_with("b.reg_t3jobsa") &
ends_with("intercept"))),
Intercept_SE = rowMeans(select(., starts_with("se.reg_t3jobsa") &
ends_with("intercept"))),
Intercept_t = Intercept_Estimate / Intercept_SE,
tt3g48a_Estimate = rowMeans(select(., starts_with("b.reg_t3jobsa") &
ends_with("tt3g48a"))),
tt3g48a_SE = rowMeans(select(., starts_with("se.reg_t3jobsa") &
ends_with("tt3g48a"))),
tt3g48a_t = tt3g48a_Estimate / tt3g48a_SE,
R2_Estimate = rowMeans(select(., starts_with("b.reg_t3jobsa") &
ends_with("rsqr"))),
R2_SE = rowMeans(select(., starts_with("se.reg_t3jobsa") &
ends_with("rsqr"))),
R2_t = R2_Estimate / R2_SE
)
make_reg_table <- function(row) {
data.frame(
Estimate = c(row$Intercept_Estimate,
row$tt3g48a_Estimate,
row$R2_Estimate),
`Std. Error` = c(row$Intercept_SE,
row$tt3g48a_SE,
row$R2_SE),
`t value` = c(row$Intercept_t,
row$tt3g48a_t,
row$R2_t),
row.names = c("(Intercept)", "tt3g48a", "R-squared")
)
}
Final_04 <- setNames(lapply(1:nrow(Table_04),
function(i) make_reg_table(Table_04[i, ])),
Table_04$Country)
print(Final_04)8.3 Analyses (TALISSCH – School Level)
8.3.1 Example 5 - Mean Number of Years of Work Experience as a Principal at the Current School
Analysis_05 <- Rrepest(
data = talis_2018_sch,
svy = "TALISSCH",
est = est(c("mean","std"), target = "tc3g04a"),
by = "cntry"
)
print(Analysis_05)8.3.2 Example 6 - Percentage of School Principals Who Reported Collaborating with Teachers to Solve Classroom Discipline Problems
Analysis_06 <- Rrepest(
data = talis_2018_sch,
svy = "TALISSCH",
est = est("freq", target = "tc3g22a"),
by = "cntry"
)
print(Analysis_06, width = 200)9 Summary
The Rrepest package can be considered a convenient alternative to the IEA IDB Analyzer program for conducting analyses of data from international educational surveys, especially for those working primarily in the R environment. Rrepest enables the automation of many repetitive procedures, ensuring support for multiple studies and compliance of the conducted analyses with OECD and IEA research methodology. At the same time, it should be noted that the package does not cover the full range of analyses available, for example, in the IEA IDB Analyzer or in the intsvy package.
We encourage you to use the package in your own research projects, test its functionality in the analysis of data from international educational research, and to submit comments and improvement proposals directly to the package authors via the GitLab repository.
10 Bibliography
Avvisati, F., & Keslair, F. (2014). REPEST: Stata module to run estimations with weighted replicate samples and plausible values (Statistical Software Components S457918). Boston College Department of Economics. (Revised December 11, 2024).
Ilizaliturri, R., Avvisati, F., & Keslair, F. (2025). Rrepest: An analyzer of international large-scale assessments in education (Version 1.5.4) [R package]. CRAN. https://CRAN.R-project.org/package=Rrepest