Analysis of Data from International Educational Studies in R Using the Rrepest Package

Authors

Jakub Łobaszewski (j.lobaszewski@ibe.edu.pl)

Mateusz Kleczaj (m.kleczaj@ibe.edu.pl)

1 Introduction

The Rrepest package in R (Ilizaliturri et al., 2025) is an extension of the approach used in the repest module for Stata, originally developed for analysts at the Organisation for Economic Co-operation and Development (OECD) (Avvisati & Keslair, 2014). It was designed to facilitate the analysis of data from international educational studies conducted by OECD and the International Association for the Evaluation of Educational Achievement (IEA), such as:

  • PISA, PIAAC, TALIS, SSES (OECD),
  • TIMSS, PIRLS, ICCS, ICILS (IEA).

The package enables the accurate analysis of results in accordance with the methodology used in the aforementioned studies, supporting functionalities such as:

  • calculation of means, proportions, and grouped frequencies,
  • correlation and covariance analysis and linear regression,
  • comparative tests between groups and the calculation of dispersion measures.

Rrepest automates the complex aspects of analysing data from international educational surveys, taking into account the intricate sampling designs and scaling techniques used in those studies. The package incorporates sampling weights, replicate weights, and plausible values (PVs) in its calculations, ensuring the proper estimation of measurement uncertainty. As a result, it facilitates working with large-scale datasets and allows for faster analyses that are consistent with OECD and IEA methodological standards.

Additional Resources
  • Full Documentation – a PDF file containing detailed descriptions of the package’s functions along with practical examples of its use in the analysis of international educational survey data.

  • README on CRAN – a brief installation guide and an overview of the package’s main features, available on the package’s CRAN page.

  • Repository/wiki with examples (OECD GitLab) – a site maintained by the OECD where you can find package updates and additional examples of analyses.


1.1 Why use Rrepest?

Advantages of the Rrepest package
  • Automation: it simplifies the processing of complex data by automatically accounting for OECD and IEA methodological requirements, such as replicate weights and plausible values (PVs). This eliminates the need to manually implement complex statistical formulas.
  • Convenience: the Rrepest() function provides a clear and user-friendly syntax for calculating descriptive statistics, correlations, linear regressions, group comparisons, and generating output tables.
  • Flexibility: the package supports the analysis of data from multiple international educational studies.
  • Support: it is actively used and continuously developed by the OECD.
  • Speed: the package utilizes parallel computing, distributing tasks across multiple CPU cores, allowing for up to twice as fast data processing compared to alternative solutions - especially when working with large datasets.

1.2 Limitations of Rrepest

Important limitations of the Rrepest package
  • The package supports only basic types of analyses: means, frequency distributions, group comparisons, correlations, and linear and logistic regressions. It does not support such features as estimating skill levels (benchmarks), percentiles, or complex modeling (e.g., multilevel models).
  • Logistic regression performed using the est("log", ...) parameter is only supported for OECD studies (e.g., PISA, TALIS).
  • In linear regression performed using the Rrepest() function, all predictors are treated as continuous variables – even if they have been recoded as factors.
  • If SPSS value labels remain in the dataset, column names in the output tables will reflect category labels and will be sorted alphabetically, which may reduce the readability of results.

1.3 Supported Studies

Study (Organization) Supported by Rrepest
PISA (OECD)
PIAAC (OECD)
TALIS (OECD)
SSES (OECD)
TIMSS (IEA)
PIRLS (IEA)
ICCS (IEA)
ICILS (IEA)

2 Installing Packages and Preparing Data for Analysis

To get started, you need to install the package (this step is required only once):

# Install the Rrepest package
install.packages("Rrepest")

# Install additional packages for working with data
install.packages(c("haven", "tidyverse"))


# haven: a package for importing .sav, .sas7bdat, and .dta files into R  
# tidyverse: a collection of packages that streamline data analysis, 
# including dplyr, purrr, readr, stringr, and others  

We present several best practices that can help you work more effectively with the Rrepest package below.


Best Practices When Using the Rrepest Package:
  • Standardize variable names: Convert all variable names to lowercase (e.g., names(df) <- tolower(names(df))) to avoid issues with case sensitivity in the Rrepest() function. Some of its arguments (e.g., over) are case-sensitive with respect to the variable names provided as values.

  • Check your data: Verify the structure of the dataset and the presence of missing values:

    • Data structure: glimpse(df), summary(df)
    • Number of observations per category: count(df, variable)
    • Number of missing values per column: e.g., colSums(is.na(df))
  • Reproducibility: Keep your code in scripts and document file paths and data transformations to ensure the reproducibility of your analyses.

  • Cross-check results: Compare your outputs with official OECD/IEA reports or the IEA IDB Analyzer tool to ensure alignment with the survey methodology.


3 Syntax and Key Features of the Rrepest() Function

Basic function syntax with its main arguments:

Rrepest(
  data = <data frame>,
  svy  = <name of the study ( "ALL"|"IALS"|"IELS"|"PIAAC"|
  "PIRLS"|"PISA"|"PISAOOS"|"SSES"|"TALIS" ("TALISSCH"/"TALISTCH")|
  "TALIS3S" ("TALISEC_STAFF"/ "TALISEC_LEADER")"|"TIMSS"| "ICCS" (ICCS_T)"| 
  "ICILS" | "SVY" (custom survey))>,
    est  = est(
    <"mean"|"var"|"std"|"quant"|"iqr"|"freq"|"corr"|"lm"|"log"|"cov"|"gen">,
    target = <variable>,
    regressor = <variable | vector of variables>  
    # the "regressor" argument is used for "lm" and "log" estimators;  
    # when using the "gen" estimator, you can specify a custom model  
    # without needing to provide "target" or "regressor"
  ),
  by   = <variable | vector of variables>,
  over = <variable | vector of variables>,
  test = TRUE/FALSE,
  n.pvs      = <number of plausible values (PVs)>,        
  cm.weights = c(<main weight>, <replicate weights_1:K>), 
  var.factor = <variance scaling factor>,       
  show_na    = TRUE/FALSE       
)

Key Arguments of the Rrepest() Function:

  • svy – the name of the study being used in the analysis.
  • est() – built-in estimators include "mean", "var", "std", "quant", "iqr", "freq", "corr", "lm", "log", "cov", and the general gen option for running custom R models.
  • target – a single variable or multiple PV imputations using the @ symbol syntax (e.g., PVREAD@, PVNUM@, ASMMAT0@). This syntax can also be used in by, over, and regressor arguments to specify PV variables.
  • by – specifies grouping variable(s) to be used in estimates/tables.
  • over + test=TRUE – combined group analysis with statistical tests for differences between levels of the variable specified in the over argument.
  • n.pvs – number of plausible values used when averaging results across imputations (by default, Rrepest() uses 5 PVs).
  • cm.weights – the main weight and a set of replication weights (e.g., totwgt, jr1:jrK). Setting this parameter is necessary for some analyses where it’s not possible to use the weights included by default in the Rrepest() function.
  • var.factor – a variance scaling factor based on the sample design. Setting this parameter is necessary for some analyses that do not use the weights included by default in the Rrepest() function.
  • show_na – if set to TRUE, missing values will be treated as a separate category in frequency tables.
by vs. over
  • by – performs the analysis separately for each group, treating them as independent datasets. It does not allow for the direct testing of differences between groups.
  • over combined with test=TRUE – analyzes all groups jointly and automatically computes differences between them, along with standard errors.

3.1 Data Structure in International Educational Studies

To ensure that Rrepest functions properly, key variables in the dataset—such as plausible values (PVs) and analytical weights—must follow specific naming conventions that match those used in a given study cycle.

3.1.1 Variable Names for Plausible Values (PVs)

Plausible values (PVs) are sets of multiple imputations of latent (unobservable) traits—such as mathematical ability—that cannot be directly measured. They are used to better estimate measurement error. The Rrepest() function recognizes PVs through variable names that include the @ symbol (e.g., pv@math, pv@read, pv@civ). The function assumes that multiple imputations are available for each latent trait (e.g., pv1math, …, pvKmath, where K is the number of PVs provided in a given study).

Setting the Number of Plausible Values (PVs)
  • By default, the Rrepest() function assumes that 5 imputations are available.

  • If the study includes a different number of plausible values (for example, there are 10 PVs in PISA cycles from 2015 onward), you should explicitly specify the correct number using the n.pvs argument (e.g., n.pvs = 10).

3.1.2 Names of Analytical Weight Variables

Analytical weights are another type of key variable in international educational studies. For the Rrepest() function to work correctly, the names of these variables must match the expected format for a given study. Each study includes a main (total/final) weight variable (e.g., w_fstuwt) and a set of replicate weights (e.g., w_fsturwt1, ..., w_fsturwt80). The names and number of replicate weights differ between studies. In most cases, the Rrepest() function automatically identifies the correct weight variables in the dataset based on the study name specified in the svy argument.

Example Default Analytical Weight Settings in the Rrepest() function
  • PISA – main weight: w_fstuwt; 80 replicate weights: w_fsturwt1, ..., w_fsturwt80 (student-level weights).
  • ICCS/ICILS – main weight: totwgts; 75 replicate weights: srwgt1, ..., srwgt75 (student-level weights).
Important
  • For studies such as TIMSS and PIRLS (as well as other studies using the Jackknife Repeated Replication (JRR) weighting scheme), replicate weights must be manually recalculated and added to the dataset. Instructions for doing so are provided in the practical section of this guide.

  • Some studies differentiate study designs based on the analyzed respondent groups. For example, in the TALIS study, in the svy argument, we can specify "TALISTCH" if we are analyzing teacher data, or "TALISSCH" if we are analyzing school data. Similarly, in the ICCS study, we can specify the study name in the svy argument with the suffix _T (i.e., "ICCS_T") if we are analyzing teacher data. Depending on the value assigned to the svy argument, Rrepest will automatically apply the appropriate analytical weights.

  • The methodology of TIMSS and PIRLS changed for study cycles conducted after 2011. In cycles prior to and including 2011, 150 replicate weights and one plausible value were used, whereas in subsequent cycles, 250 replicate weights and five plausible values were implemented.


4 Downloading Data

Data and documentation (e.g., codebooks, questionnaires and guides) for the studies can be accessed through the following websites:

In the case of OECD studies, you may be required to complete a short registration form before downloading the datasets.


4.1 Data Files Structure

Before starting the analysis, data from the international studies must be imported into the R environment, which requires an understanding of the complex structure of the files containing them. This structure differs between OECD and IEA studies.

4.1.1 IEA Data Structure

IEA datasets are typically divided into a large number of files, grouped by country, grade level, and the specific assessment instrument used. For example, after downloading the TIMSS 2023 Grade 4 data, you may find over 500 files in the extracted folder.

File Naming Conventions in IEA Studies, using the TIMSS study as an example

Example: the asapolm8.sav file contains data from Polish students from the TIMSS 2023 study for grade 4 in .sav format.

Where:

  • asa: student responses and achievement scores (Grade 4)
  • pol: country code (Poland)
  • m8: study cycle (here: cycle 8, conducted in 2023)

The first letter in the file name indicates the grade level:

  • a – Grade 4
  • b – Grade 8

Subsequent letters specify the data type:

  • asa/bsa – student results and plausible values (PVs)
  • asp/bsp – process data (e.g., response times)
  • ash – home questionnaire data
  • asg/bsg – student questionnaire data
  • acg/bcg – school questionnaire data
  • atg/btg – teacher questionnaire data

4.1.2 OECD Data Structure

In OECD studies, data from all participating countries are typically combined into a single, large file, rather than being split by country. The files are organized by study cycle and data type.

File Naming Conventions in OECD Studies, using the PISA study as an example

Example: the file CY08MSP_STU_QQQ.sav contains student questionnaire data from all countries participating in the PISA 2022 study.

Where:

  • CY08 – study cycle (here: cycle 8, conducted in 2022)
  • MSP – Main Study
  • STU_QQQ – student questionnaire data

Other file identifiers include:

  • SCH_QQQ – school questionnaire data
  • TCH_QQQ – teacher questionnaire data
  • STU_COG – student cognitive test results (e.g., reading, mathematics, science, etc.)
  • STU_FLT – financial literacy test results
  • STU_ICT – ICT familiarity questionnaire
  • STU_WBQ – student well-being questionnaire

5 PISA 2022 Data Analysis

5.1 Preparing the Data


# Step 1: Specify the folder containing the SPSS files
dir_pisa_2022 <- "C:/DATA/PISA 2022/Data_SPSS"

# Step 2: Load the student questionnaire data (STU_QQQ)
data_students <- read_sav(paste0(dir_pisa_2022, "/CY08MSP_STU_QQQ.sav"))

# Step 3: Select countries for the analysis (Australia, Canada, Peru, Poland) 
# and filter the dataset
countries <- c("AUS", "CAN", "PER", "POL")
pisa_2022 <- filter(data_students, CNT %in% countries)

# Step 4: Convert variable names to lowercase
names(pisa_2022) <- tolower(names(pisa_2022))

5.2 Example 1 – Mean and Standard Deviation of Reading Scores by Country and Gender

Analysis_01 <- Rrepest(
  data = pisa_2022,
  svy  = "PISA",
  est  = est(c("mean", "std"), target = "pv@read"),
  by   = c("cnt", "st004d01t"),
  n.pvs = 10 # `Rrepest()` uses 5 PVs by default. For PISA 2022, 
              # set `n.pvs = 10` to reflect the correct number 
              # of plausible values.
)
print(Analysis_01)

5.3 Example 2 – Frequency Table of Students’ Gender

Analysis_02 <- Rrepest(
  data = pisa_2022,
  svy  = "PISA",
  est  = est("freq", target = "st004d01t"),
  by   = "cnt",
  show_na = TRUE
)
print(Analysis_02, width = 200)

5.4 Example 3 – Linear Regression: The Effect of Socio-Economic Status on Student Performance in Mathematics

Analysis_03 <- Rrepest(
  data = pisa_2022,
  svy  = "PISA",
  by   = "cnt",
  est  = est("lm", target = "pv@math", regressor = "escs"),
  n.pvs = 10,
  show_na = TRUE
)
print(Analysis_03, width = 200)
An Example Code for Improving the Readability of the Output Table
Table_03<- Analysis_03 %>%
  transmute(
    Country = cnt,
    Intercept_Estimate   = rowMeans(select(., starts_with("b.reg_pv") & 
                           ends_with("intercept"))),
    Intercept_SE         = rowMeans(select(., starts_with("se.reg_pv") & 
                           ends_with("intercept"))),
    Intercept_t          = Intercept_Estimate / Intercept_SE,
    
    ESCS_Estimate        = rowMeans(select(., starts_with("b.reg_pv") & 
                           ends_with("escs"))),
    ESCS_SE              = rowMeans(select(., starts_with("se.reg_pv") & 
                           ends_with("escs"))),
    ESCS_t               = ESCS_Estimate / ESCS_SE,
    
    R2_Estimate          = rowMeans(select(., starts_with("b.reg_pv") & 
                           ends_with("rsqr"))),
    R2_SE                = rowMeans(select(., starts_with("se.reg_pv") & 
                           ends_with("rsqr"))),
    R2_t                 = R2_Estimate / R2_SE
  )

make_reg_table <- function(row) {
  data.frame(
    Estimate    = c(row$Intercept_Estimate,
                    row$ESCS_Estimate,
                    row$R2_Estimate),
    `Std. Error` = c(row$Intercept_SE,
                     row$ESCS_SE,
                     row$R2_SE),
    `t value`    = c(row$Intercept_t,
                     row$ESCS_t,
                     row$R2_t),
    row.names = c("(Intercept)", "ESCS", "R-squared")
  )
}

final_table_03 <- setNames(
  lapply(1:nrow(Table_03), 
         function(i) make_reg_table(Table_03[i, ])),
  Table_03$Country
)


print(final_table_03)

5.5 Example 4 - Logistic Regression: The Impact of Family Socioeconomic Status on Student Achievement in Mathematics (Obtaining a Score of at least 626)


# To implement logistic regression, we first need to create a new binary 
# variable taking the value "1" or "0", depending on whether a given 
# student achieved a score equal to or higher than 626 points. A separate 
# variable is created for each PV imputation.

for (i in 1:10) {
  varname <- paste0("pv", i, "math")
  newvar  <- paste0("y", i)
  pisa_2022[[newvar]] <- ifelse(pisa_2022[[varname]] >= 626, 1, 0)
}

Analysis_04 <- Rrepest(
  data = pisa_2022,
  svy  = "PISA",
  est  = est("log", target = "y@", regressor = "escs"),
  n.pvs = 10,
  by   = "cnt"
)
print(Analysis_04, width = 200)
An Example Code for Improving the Readability of the Output Table

Table_04 <- Analysis_04 %>%
  transmute(
    Country   = cnt,
    Coef      = rowMeans(select(., starts_with("b.log")  & 
                ends_with("intercept"))),
    Std_Error = rowMeans(select(., starts_with("se.log") & 
                ends_with("intercept"))),
    t_value   = round(Coef / Std_Error, 2),
    ESCS      = rowMeans(select(., starts_with("b.log")  & 
                ends_with("escs"))),
    ESCS_SE   = rowMeans(select(., starts_with("se.log") & 
                ends_with("escs"))),
    ESCS_t    = round(ESCS / ESCS_SE, 2),
    OR        = round(exp(Coef), 2),
    CI95low   = round(exp(Coef - 1.96 * `Std_Error`), 2),
    CI95up    = round(exp(Coef + 1.96 * `Std_Error`), 2),
    OR_ESCS   = round(exp(ESCS), 2),
    CI95low_ESCS = round(exp(ESCS - 1.96 * `ESCS_SE`), 2),
    CI95up_ESCS  = round(exp(ESCS + 1.96 * `ESCS_SE`), 2)
  )

make_logreg_table <- function(row) {
  data.frame(
    Coef        = c(row$Coef, row$ESCS),
    `Std. Error`= c(row$Std_Error, row$ESCS_SE),
    `t value`   = c(row$t_value, row$ESCS_t),
    OR          = c(row$OR, row$OR_ESCS),
    CI95low     = c(row$CI95low, row$CI95low_ESCS),
    CI95up      = c(row$CI95up,  row$CI95up_ESCS),
    row.names = c("(Intercept)", "ESCS")
  )
}

final_table_04 <- setNames(lapply(1:nrow(Table_04), 
                           function(i) make_logreg_table(Table_04[i, ])),
                           Table_04$Country)

print(final_table_04)

6 TIMSS 2023 (Grade 4) Data Analysis

6.1 Step-by-Step Data Preparation

Preparing TIMSS data for analysis with the Rrepest() function is a multi-step process. This is due to both the structure of the data files and the specific methodological requirements of the TIMSS study.

6.1.1 Merging the Data

# Step 1: Specify the folder containing the SPSS files
dir_timss_2023 <- "C:/DATA/TIMSS2023_IDB_SPSS_G4/2_Data Files/SPSS Data"

# Step 2: Create lists of files for selected countries (Armenia, Australia, 
# Bahrain, Poland) and specific datasets:

# Student achievement files (ASA), containing test scores
timss_2023_asa_files <- list.files(dir_timss_2023,
  pattern = ".*ASA(POL|AUS|ARM|BHR).*\\.sav$", 
  ignore.case = TRUE, full.names = TRUE)
  
# Student context files (ASG), containing student questionnaire data
timss_2023_asg_files <- list.files(dir_timss_2023,
  pattern = ".*ASG(POL|AUS|ARM|BHR).*\\.sav$", 
  ignore.case = TRUE, full.names = TRUE)
  
# School context files (ACG), containing school questionnaire data
timss_2023_acg_files <- list.files(dir_timss_2023,
  pattern = ".*ACG(POL|AUS|ARM|BHR).*\\.sav$", 
  ignore.case = TRUE, full.names = TRUE)

# Home context files (ASH), containing parent questionnaire data
timss_2023_ash_files <- list.files(dir_timss_2023,
  pattern = ".*ASH(POL|AUS|ARM|BHR).*\\.sav$", 
  ignore.case = TRUE, full.names = TRUE)

# Step 3: Merge questionnaire data from selected countries
# using the previously created file lists

timss_2023_asa <- timss_2023_asa_files %>% map(read_sav) %>% 
bind_rows()
timss_2023_asg <- timss_2023_asg_files %>% map(read_sav) %>% 
bind_rows()
timss_2023_acg <- timss_2023_acg_files %>% map(read_sav) %>% 
bind_rows()
timss_2023_ash <- timss_2023_ash_files %>% map(read_sav) %>% 
bind_rows()
Important

When merging datasets from different questionnaires using the left_join() function, the "by" parameter should include all variables that are duplicated in the datasets being merged. Depending on the datasets, these may be variables such as school or student identifiers (e.g., “IDSCHOOL” and “IDSTUD”), but also variables containing PVs. Otherwise, the left_join() function will create duplicate columns for the overlapping variables, appending suffixes like .x and .y to the column names (e.g., “variable01.x”, “variable01.y”). This can lead to confusion and complicate your analysis later on.

# Step 4: Merging datasets from all questionnaires
timss_2023 <- left_join(
  timss_2023_asa,
  select(timss_2023_acg, -any_of(names(timss_2023_asa)), 
                                 CTY, 
                                 IDSCHOOL),
  by = c("CTY", "IDSCHOOL")
) %>% left_join(
  select(timss_2023_asg, -any_of(names(timss_2023_asa)), 
                                 CTY, 
                                 IDSCHOOL, 
                                 IDSTUD),
  by = c("CTY", "IDSCHOOL", "IDSTUD")
) %>% left_join(
  select(timss_2023_ash, -any_of(names(timss_2023_asa)), 
                                CTY, 
                                IDCNTRY, 
                                IDSCHOOL, 
                                IDSTUD),
  by = c("CTY", "IDCNTRY", "IDSCHOOL", "IDSTUD")
)

6.1.2 Creation of 250 replicate weights using the Jackknife Repeated Replication (JRR) method applied in the TIMSS 2023 study

# Step 5: Creation of 250 replicate weights using the JRR method
for (i in 1:125) {
  j <- 125 + i
  timss_2023 <- timss_2023 %>%
    mutate(!!paste0("jr", i) := case_when(
      JKZONE == i & JKREP == 1 ~ 2 * TOTWGT,
      JKZONE == i & JKREP == 0 ~ 0,
      TRUE ~ TOTWGT
    )) %>%
    mutate(!!paste0("jr", j) := case_when(
      JKZONE == i & JKREP == 0 ~ 2 * TOTWGT,
      JKZONE == i & JKREP == 1 ~ 0,
      TRUE ~ TOTWGT
    ))}

6.1.3 Converting variable names to lowercase

# Step 6: Convert variable names to lowercase
names(timss_2023) <- tolower(names(timss_2023))
Important

In TIMSS/PIRLS analyses you should use the custom weights parameter cm.weights = c("totwgt", paste0("jr", 1:250)), specifying the main weight totwgt and the previously created replicate weights (jr1, ..., jr250). To correctly calculate standard errors that account for the replicate weights, it is also necessary to specify the variance factor argument: var.factor = 125. The value of the variance factor depends on the sampling design applied. To ensure the validity of your analyses, make sure to review the documentation on the replicate weight construction method used in the given study cycle.

6.2 Example 1 – Mean of Mathematics Scores by Country and Gender

Analysis_01 <- Rrepest(
  data = timss_2023,
  svy  = "TIMSS",
  cm.weights = c("totwgt", paste0("jr", 1:250)),
  est  = est(c("mean", "std"), target = "asmmat0@"),
  by   = c("cty", "itsex"),
  var.factor = 125
)
print(Analysis_01)

6.3 Example 2 – Mean of Mathematics Scores by Country and Gender, Using the over Parameter and Showing the Difference in Scores Between Genders (Boys - Girls) in Each Country

Analysis_02 <- Rrepest(
  data = timss_2023,
  svy  = "TIMSS",
  cm.weights = c("totwgt", paste0("jr", 1:250)),
  est  = est("mean", target = "asmmat0@"),
  by   = "cty",
  over = "itsex",
  test = TRUE,
  var.factor = 125
)
print(Analysis_02, width = 200)

6.4 Example 3 – Frequency Table of Students’ Gender

Analysis_03 <- Rrepest(
  data = timss_2023,
  svy  = "TIMSS",
  cm.weights = c("totwgt", paste0("jr", 1:250)),
  est  = est("freq", "itsex"),
  by   = "cty",
  show_na = TRUE,
  var.factor = 125
)
print(Analysis_03)

6.5 Example 4 – Linear Regression: The Effect of Student Gender and Lack of Early Counting Activities on Students’ Mathematics Achievement

Analysis_04 <- Rrepest(
  data = timss_2023,
  svy  = "TIMSS",
  cm.weights = c("totwgt", paste0("jr", 1:250)),
  by   = "cty",
  est  = est("lm", target = "asmmat0@", regressor = c("itsex", "asbh01k")),
  var.factor = 125
)
print(Analysis_04)
An Example Code for Improving the Readability of the Output Table
Table_04 <- Analysis_04 %>%
  transmute(
    Country = cty,
    Intercept_Estimate = rowMeans(select(., starts_with("b.reg_asmmat") & 
                         ends_with("intercept"))),
    Intercept_SE       = rowMeans(select(., starts_with("se.reg_asmmat") & 
                         ends_with("intercept"))),
    Intercept_t        = Intercept_Estimate / Intercept_SE,
    ITSEX_Estimate     = rowMeans(select(., starts_with("b.reg_asmmat") & 
                         ends_with("itsex"))),
    ITSEX_SE           = rowMeans(select(., starts_with("se.reg_asmmat") & 
                         ends_with("itsex"))),
    ITSEX_t            = ITSEX_Estimate / ITSEX_SE,
    ASBH01K_Estimate   = rowMeans(select(., starts_with("b.reg_asmmat") & 
                         ends_with("asbh01k"))),
    ASBH01K_SE         = rowMeans(select(., starts_with("se.reg_asmmat") & 
                         ends_with("asbh01k"))),
    ASBH01K_t          = ASBH01K_Estimate / ASBH01K_SE,
    R2_Estimate        = rowMeans(select(., starts_with("b.reg_asmmat") & 
                         ends_with("rsqr"))),
    R2_SE              = rowMeans(select(., starts_with("se.reg_asmmat") & 
                          ends_with("rsqr"))),
    R2_t               = R2_Estimate / R2_SE
  )

make_reg_table <- function(row) {
  data.frame(
    Estimate     = c(row$Intercept_Estimate, 
                     row$ITSEX_Estimate, 
                     row$ASBH01K_Estimate, 
                     row$R2_Estimate),
    `Std. Error` = c(row$Intercept_SE,       
                     row$ITSEX_SE,       
                     row$ASBH01K_SE,
                     row$R2_SE),
    `t value`    = c(row$Intercept_t,
                     row$ITSEX_t,        
                     row$ASBH01K_t,
                     row$R2_t),
    row.names = c("(Intercept)", "ITSEX", "ASBH01K", "R-squared")
  )
}

final_table_04 <- setNames(lapply(1:nrow(Table_04), 
                           function(i) make_reg_table(Table_04[i, ])),
                           Table_04$Country)
                           
print(final_table_04)

6.6 Analyses of Mathematics Teachers’ Data (TIMSS 2023)

To analyze teacher data, we need to merge files with the prefix ATG (teacher context), containing data from the teacher questionnaire, with files prefixed AST (student-teacher linkage), because this dataset includes variables with main weights assigned to teachers (“TCHWGT” – all teachers, “MATWGT” – mathematics teachers, “SCIWGT” – science teachers).

6.6.1 Data Preparation and Construction of Replicate Weights for Teachers in Accordance with the Jackknife Repeated Replication (JRR) Method Used in the TIMSS 2023 Study

# Step 1: Create lists of files containing teacher and student-teacher linkage
# data for selected countries


# ATG files – teacher context 
# AST files - student-teacher linkage

timss_2023_atg_files <- list.files(dir_timss_2023,
                                   pattern = ".*ATG(POL|AUS|ARM|BHR).*\\.sav$", 
                                   ignore.case = TRUE, 
                                   full.names = TRUE)

timss_2023_ast_files <- list.files(dir_timss_2023,
                                   pattern = ".*AST(POL|AUS|ARM|BHR).*\\.sav$", 
                                   ignore.case = TRUE, 
                                   full.names = TRUE)

timss_2023_atg <- timss_2023_atg_files %>% map(read_sav) %>% bind_rows()
timss_2023_ast <- timss_2023_ast_files %>% map(read_sav) %>% bind_rows()

# Step 2: Merge the data from selected countries using the previously 
# created file lists

timss_2023_teacher <- left_join(
  timss_2023_atg,
  select(timss_2023_ast, -any_of(names(timss_2023_atg)), 
                                CTY, 
                                IDCNTRY, 
                                IDSCHOOL, 
                                IDTEACH, 
                                IDTEALIN),
  by = c("CTY", "IDCNTRY", "IDSCHOOL", "IDTEACH", "IDTEALIN")
)

# Step 3: Create 250 replicate weights with the JRR method, 
# using the main weight for mathematics teachers (MATWGT)

for (i in 1:125) {
  j <- 125 + i
  timss_2023_teacher <- timss_2023_teacher %>%
    mutate(!!paste0("jr", i) := case_when(
      JKZONE == i & JKREP == 1 ~ 2 * MATWGT,
      JKZONE == i & JKREP == 0 ~ 0,
      TRUE ~ MATWGT
    )) %>%
    mutate(!!paste0("jr", j) := case_when(
      JKZONE == i & JKREP == 0 ~ 2 * MATWGT,
      JKZONE == i & JKREP == 1 ~ 0,
      TRUE ~ MATWGT
    ))
}

# Step 4: Keep only mathematics teachers in the dataset for further analysis
# (remove all rows where the variable "MATWGT" has a missing value)

timss_2023_teacher <- timss_2023_teacher %>% filter(!is.na(MATWGT))

# Step 5: Convert variable names to lowercase
names(timss_2023_teacher) <- tolower(names(timss_2023_teacher))
Important

In analyzing mathematics teachers in the TIMSS 2023 study, you should use the custom weights parameter cm.weights = c("matwgt", paste0("jr", 1:250)), specifying the main weight matwgt and the previously created replicate weights (jr1, ..., jr250). To correctly calculate standard errors that account for the applied weights, you must also set the variance factor argument: var.factor = 125.

6.6.2 Example 5 – Percentage of Mathematics Teachers by Age Group

# recoding age into a categorical variable

timss_2023_teacher <- timss_2023_teacher %>%
  dplyr::mutate(
    atbg03 = factor(atbg03, levels = 1:6,
                    labels = c(
                    "Under 25",
                    "25–29",
                    "30–39",
                    "40–49",
                    "50–59",
                    "60 or more"))
  )

Analysis_05 <- Rrepest(
  data = timss_2023_teacher,
  svy  = "TIMSS",
  cm.weights = c("matwgt", paste0("jr", 1:250)),
  est  = est("freq", target = "atbg03"),
  by   = "cty",
  show_na = TRUE,
  var.factor = 125
)

print(Analysis_05)

6.6.3 Example 6 – Mean Years of Service for Mathematics Teachers

Analysis_06 <- Rrepest(
  data = timss_2023_teacher,
  svy  = "TIMSS",
  cm.weights = c("matwgt", paste0("jr", 1:250)),
  est  = est("mean", target = "atbg01"),
  by   = "cty",
  var.factor = 125
)
print(Analysis_06)

7 ICCS 2022 Data Analysis

Important

Some countries participating in the ICCS 2022 study conducted it for the first time using computer-based testing. In these countries, an additional component of the study was implemented in the form of paper-based questionnaires, the so-called “bridge study”. The data from the “bridge” sample are not included in the official ICCS 2022 reporting. Therefore, in our analyses, we exclude them by selecting only files with the suffix “C4” in their name.

7.1 Preparing the Data

# Step 1: Specify the folder containing the SPSS files
dir_iccs_2022 <- "C:/DATA/ICCS 2022/SPSS Data/ICCS_2022_G8"

# Step 2: Create lists of files for selected countries (Croatia, Poland, 
# Romania, Spain, Sweden) and specific datasets:

# Note: Exclude files from the "bridge" edition by selecting only files
# with the "C4" suffix in the filename


# Student Civic Knowledge files (ISA), containing scores relating 
# to civic competence

iccs_isa_files <- list.files(dir_iccs_2022,
  pattern = ".*ISA(POL|HRV|ESP|SWE|ROU)C4.*\\.sav$", 
  ignore.case = TRUE, full.names = TRUE)
  
# International Student Questionnaire files (ISG), containing student 
# questionnaire data

iccs_isg_files <- list.files(dir_iccs_2022,
  pattern = ".*ISG(POL|HRV|ESP|SWE|ROU)C4.*\\.sav$", 
  ignore.case = TRUE, full.names = TRUE)

# School Questionnaire files (ICG), containing school questionnaire data
iccs_icg_files <- list.files(dir_iccs_2022,
  pattern = ".*ICG(POL|HRV|ESP|SWE|ROU)C4.*\\.sav$", 
  ignore.case = TRUE, full.names = TRUE)

# Teacher Questionnaire files (ITG), containing teacher questionnaire data
iccs_itg_files <- list.files(dir_iccs_2022,
  pattern = ".*ITG(POL|HRV|ESP|SWE|ROU)C4.*\\.sav$", 
  ignore.case = TRUE, full.names = TRUE)

# Step 3: Merge questionnaire data from selected countries
# using the previously created file lists
iccs_2022_isa <- iccs_isa_files %>% map(read_sav) %>% bind_rows()
iccs_2022_isg <- iccs_isg_files %>% map(read_sav) %>% bind_rows()
iccs_2022_icg <- iccs_icg_files %>% map(read_sav) %>% bind_rows()
iccs_2022_itg <- iccs_itg_files %>% map(read_sav) %>% bind_rows()
Important

When merging datasets from the ICCS 2022 study containing student civic competence scores (files prefixed ISA) and student survey results (files prefixed ISG), we include plausible values, the main weight variable, and the student replicate weights) in the by argument of the left_join() function. This is necessary because these variables appear in both datasets.

# Step 4:  Merge student datasets (civic competence results and student 
# survey data) with the school questionnaire dataset

iccs_2022 <- left_join(
  iccs_2022_isa,
  iccs_2022_isg,
  by = c("COUNTRY", "IDSCHOOL", "IDSTUD", "std_GENDER",
         "TOTWGTS", paste0("PV", 1:5, "CIV"), paste0("SRWGT0", 1:9), 
         paste0("SRWGT", 10:75))
) %>% left_join(
  iccs_2022_icg,
  by = c("COUNTRY", "IDSCHOOL")
)
Important

The ICCS study allows us to treat the teacher dataset as independent from the student dataset and analyze teachers using the teacher weights.

# We do not merge the teacher data with other datasets because 
# it is not needed in our example analyses.
iccs_2022_teacher <- iccs_2022_itg

# Step5: Convert variable names to lowercase
names(iccs_2022) <- tolower(names(iccs_2022))
names(iccs_2022_teacher) <- tolower(names(iccs_2022_teacher))
Important

In the case of the ICCS study, the Rrepest() function expects that the replicate weight variables take names without leading zeros for replicates numbered from 1 to 9 (e.g., “srwgt1,…, srwgt9” or “trwgt1,…, trwgt9”). However, in the ICCS 2022 dataset provided by the IEA, the names of the replicate weight variables include leading zeros, appearing as “srwgt01,…, srwgt09”, “trwgt01,…, trwgt09”. These variables must be manually renamed for the function to work correctly.

# Step 6: Remove leading zeros from replicate weight variable names in the 
# student and teacher datasets
colnames(iccs_2022) <- gsub("^srwgt0([1-9])$", "srwgt\\1", colnames(iccs_2022))
colnames(iccs_2022_teacher) <- gsub(
                               "^trwgt0([1-9])$", 
                               "trwgt\\1", 
                               colnames(iccs_2022_teacher))

7.2 Example 1 – Mean Student Scores in Civic Knowledge and Understanding of Civic Issues, Broken Down by Gender

Analysis_01 <- Rrepest(
  data = iccs_2022,
  svy  = "ICCS",
  est  = est(c("mean", "std"), target = "pv@civ"),
  by   = c("country", "std_gender")
)
print(Analysis_01)

7.3 Example 2 - Linear Regression: The Effect Of Student Gender on Scores in Civic Knowledge and Understanding of Civic Issues

Analysis_02 <- Rrepest(
  data = iccs_2022,
  svy  = "ICCS",
  by   = "country",
  est  = est("lm", target = "pv@civ", regressor = c("std_gender"))
)
print(Analysis_02, width = 200)
An Example Code for Improving the Readability of the Output Table
Table_02 <- Analysis_02 %>%
  transmute(
    country,
    Intercept_Estimate = rowMeans(select(., starts_with("b.reg_pv@civ") & 
                         ends_with("intercept"))),
    Intercept_SE       = rowMeans(select(., starts_with("se.reg_pv@civ") & 
                         ends_with("intercept"))),
    Intercept_t        = Intercept_Estimate / Intercept_SE,
    gender_Estimate    = rowMeans(select(., starts_with("b.reg_pv@civ") & 
                         ends_with("std_gender"))),
    gender_SE          = rowMeans(select(., starts_with("se.reg_pv@civ") & 
                         ends_with("std_gender"))),
    gender_t           = gender_Estimate / gender_SE,
    R2_Estimate        = rowMeans(select(., starts_with("b.reg_pv@civ") & 
                         ends_with("rsqr"))),
    R2_SE              = rowMeans(select(., starts_with("se.reg_pv@civ") & 
                         ends_with("rsqr"))),
    R2_t               = R2_Estimate / R2_SE
  )

make_reg_table <- function(row) {
  data.frame(
    Estimate     = c(row$Intercept_Estimate, 
                    row$gender_Estimate, 
                    row$R2_Estimate),
    `Std. Error` = c(row$Intercept_SE,       
                     row$gender_SE,       
                     row$R2_SE),
    `t value`    = c(row$Intercept_t,        
                     row$gender_t,        
                     row$R2_t),
    row.names = c("(Intercept)", "gender", "R-squared")
  )
}

Table_02 <- setNames(lapply(1:nrow(Table_02), 
                     function(i) make_reg_table(Table_02[i, ])),
                     Table_02$country)

print(Table_02)

7.4 Example 3 – Student Participation in School Life According to School Principals

7.4.1 Example 3A - How Often Are Students Encouraged to Contribute to Activities Planning? Percentage of Students in Schools Where Principals Responded in a Specified Way

Analysis_03a <- Rrepest(
  data = iccs_2022,
  svy  = "ICCS",
  est  = est("freq", target = "ic4g03d"),
  by   = "country"
)
print(Analysis_03a, width = 200)

7.4.2 Example 3B – How Often Are Students Involved in Creating Rules and Regulations? Percentage of Students in Schools Where Principals Responded in a Specified Way

Analysis_03b <- Rrepest(
  data = iccs_2022,
  svy  = "ICCS",
  est  = est("freq", target = "ic4g03b"),
  by   = "country"
)
print(Analysis_03b, width = 200)

7.5 Example 4 - Teachers’ Responses to the Question: “How Important Are the Following Behaviors for Being a Good Citizen in Adult Life: Engaging in Political Discussions”

Important

In order for the Rrepest() function to automatically retrieve the weights assigned to teachers, set the argument ‘svy = “ICCS_T”’. If we set the argument ‘svy = “ICCS”’, it is necessary to specify the custom weights using the argument: cm.weights = c("totwgtt", paste0("trwgt", 1:75)).

Analysis_04 <- Rrepest(
  data = iccs_2022_teacher,
  svy  = "ICCS_T",
  est  = est("freq", target = "it4g17e"),
  by   = "country"
)
print(Analysis_04, width = 300)

8 TALIS 2018 Data Analysis

In the case of the TALIS study, analyses can be conducted at two independent levels:

  • TALISTCH – Teachers (default weights: TCHWGT as the main weight and 100 replicate weights TRWGT1, ..., TRWGT100).

  • TALISSCH – School principals/Schools (default weights: SCHWGT as the main weight and 100 replicate weights SRWGT1, ..., SRWGT100).

In order for the Rrepest() function to use the correct weight variables, it is necessary to set the svy argument correctly: svy = "TALISTCH" for analyses performed on the teachers’ dataset, or svy = "TALISSCH" for analyses performed on the school dataset (data from the school principals’ questionnaire).

8.1 Preparing the Data


# Step 1: Specify the folder containing the SPSS files
dir_talis_2018 <- "C:/DATA/TALIS 2018/SPSS Data/SPSS_2018_international"

# Step 2: Load the datasets:

# Teacher Data – Files with the Prefix BTG:
talis_2018_tch <- read_sav(file.path(dir_talis_2018, "BTGINTT3.sav"))

# School Data – Files with the Prefix BCG:
talis_2018_sch <- read_sav(file.path(dir_talis_2018, "BCGINTT3.sav"))

# Step 3: Select countries for the analysis (Colombia, Czech Republic,
# Japan, Kazakhstan) and filter the dataset

countries <- c("JPN", "CZE", "COL", "KAZ")

talis_2018_tch <- talis_2018_tch %>% filter(CNTRY %in% countries)
talis_2018_sch <- talis_2018_sch %>% filter(CNTRY %in% countries)

# Step 4: Convert variable names to lowercase
names(talis_2018_tch) <- tolower(names(talis_2018_tch))
names(talis_2018_sch) <- tolower(names(talis_2018_sch))

8.2 Analyses (TALISTCH – Teacher Level)

8.2.1 Example 1 - Frequency Table of Teachers’ Gender

Analysis_01 <- Rrepest(
  data = talis_2018_tch,
  svy  = "TALISTCH",
  est  = est("freq", target = "tt3g01"),
  by   = "cntry"
)
print(Analysis_01)

8.2.2 Example 2 - Mean Number of Hours Teachers Spend Weekly on Communication and Cooperation with Parents

Analysis_02 <- Rrepest(
  data = talis_2018_tch,
  svy  = "TALISTCH",
  est  = est(c("mean","std"), target = "tt3g18h"),
  by   = "cntry"
)
print(Analysis_02)

8.2.3 Example 3 - Mean Number of Hours Teachers Spend Weekly on Professional Development Activities

Analysis_03 <- Rrepest(
  data = talis_2018_tch,
  svy  = "TALISTCH",
  est  = est(c("mean","std"), target = "tt3g18g"),
  by   = "cntry"
)
print(Analysis_03)

8.2.4 Example 4 - Linear Regression: The Effect of the Opportunity to Actively Participate in School Decisions on Teachers’ Overall Job Satisfaction

Analysis_04 <- Rrepest(
  data = talis_2018_tch,
  svy  = "TALISTCH",
  by   = "cntry",
  est  = est("lm", target = "t3jobsa", regressor = c("tt3g48a"))
)
print(Analysis_04, width = 200)
An Example Code for Improving the Readability of the Output Table
Table_04 <- Analysis_04 %>%
  transmute(
    Country = cntry,
    Intercept_Estimate = rowMeans(select(., starts_with("b.reg_t3jobsa") & 
                         ends_with("intercept"))),
    Intercept_SE       = rowMeans(select(., starts_with("se.reg_t3jobsa") & 
                         ends_with("intercept"))),
    Intercept_t        = Intercept_Estimate / Intercept_SE,
    tt3g48a_Estimate   = rowMeans(select(., starts_with("b.reg_t3jobsa") & 
                         ends_with("tt3g48a"))),
    tt3g48a_SE         = rowMeans(select(., starts_with("se.reg_t3jobsa") & 
                         ends_with("tt3g48a"))),
    tt3g48a_t          = tt3g48a_Estimate / tt3g48a_SE,
    R2_Estimate        = rowMeans(select(., starts_with("b.reg_t3jobsa") & 
                         ends_with("rsqr"))),
    R2_SE              = rowMeans(select(., starts_with("se.reg_t3jobsa") & 
                         ends_with("rsqr"))),
    R2_t               = R2_Estimate / R2_SE
  )

make_reg_table <- function(row) {
  data.frame(
    Estimate     = c(row$Intercept_Estimate, 
                     row$tt3g48a_Estimate, 
                     row$R2_Estimate),
    `Std. Error` = c(row$Intercept_SE,
                     row$tt3g48a_SE,      
                     row$R2_SE),
    `t value`    = c(row$Intercept_t,        
                     row$tt3g48a_t,        
                     row$R2_t),
    row.names    = c("(Intercept)", "tt3g48a", "R-squared")
  )
}

Final_04 <- setNames(lapply(1:nrow(Table_04), 
                    function(i) make_reg_table(Table_04[i, ])), 
                    Table_04$Country)

print(Final_04)

8.3 Analyses (TALISSCH – School Level)

8.3.1 Example 5 - Mean Number of Years of Work Experience as a Principal at the Current School

Analysis_05 <- Rrepest(
  data = talis_2018_sch,
  svy  = "TALISSCH",
  est  = est(c("mean","std"), target = "tc3g04a"),
  by   = "cntry"
)
print(Analysis_05)

8.3.2 Example 6 - Percentage of School Principals Who Reported Collaborating with Teachers to Solve Classroom Discipline Problems

Analysis_06 <- Rrepest(
  data = talis_2018_sch,
  svy  = "TALISSCH",
  est  = est("freq", target = "tc3g22a"),
  by   = "cntry"
)
print(Analysis_06, width = 200)

9 Summary

The Rrepest package can be considered a convenient alternative to the IEA IDB Analyzer program for conducting analyses of data from international educational surveys, especially for those working primarily in the R environment. Rrepest enables the automation of many repetitive procedures, ensuring support for multiple studies and compliance of the conducted analyses with OECD and IEA research methodology. At the same time, it should be noted that the package does not cover the full range of analyses available, for example, in the IEA IDB Analyzer or in the intsvy package.

We encourage you to use the package in your own research projects, test its functionality in the analysis of data from international educational research, and to submit comments and improvement proposals directly to the package authors via the GitLab repository.

10 Bibliography

  • Avvisati, F., & Keslair, F. (2014). REPEST: Stata module to run estimations with weighted replicate samples and plausible values (Statistical Software Components S457918). Boston College Department of Economics. (Revised December 11, 2024).

  • Ilizaliturri, R., Avvisati, F., & Keslair, F. (2025). Rrepest: An analyzer of international large-scale assessments in education (Version 1.5.4) [R package]. CRAN. https://CRAN.R-project.org/package=Rrepest