NCES Data R Project - EdSurvey

Due to the scope and complexity of National Center for Education Statistics (NCES) datasets, researchers often must use several different software tools to access, clean, and analyze its data. In response, AIR developed specialized tools to streamline the process. These tools take advantage of advances in computing and meet the shifting trend in higher education to move away from using commercial statistical software packages in favor of open-source software packages.

One of these tools, EdSurvey, is an R statistical package tailored to processing large-scale education data with appropriate procedures to analyze these data efficiently, taking into account their complex sample survey design and the use of plausible values.

Get started with EdSurvey:

The Challenge

Analyzing NCES data was costly, complex, inaccessible, and burdensome for researchers, requiring access to expensive and heavily programming-based software packages. Researchers tended to spend more time and effort loading and cleaning the data than analyzing the data itself.

Our Role

NCES asked AIR to create an affordable set of tools that would streamline the multi-step data preparation process, scale to the changing needs of the research community, and ensure the data were accessible and reusable. Our researchers developed—and continue to maintain—a complete open-source solutions infrastructure consisting of primary and supplementary packages built in R:

  • EdSurvey: A one-stop shop for the downloading, processing, manipulation, and analysis of survey data. EdSurvey is an R statistical package tailored to connect seamlessly with NCES data to perform analyses including complex sample survey design and the use of plausible values. Supplementary packages include:
    • Dire: Used for direct estimation, a latent regression modeling of student proficiencies directly through a Marginal Maximum Likelihood (MML) algorithm (Cohen & Jiang, 1999) conditioning on student item performance and contextual variables.
    • WeMix: Used for multilevel modeling of large-scale data with weights and plausible values.
    • wCorr: Used for weighted and unweighted correlations, including Pearson, Spearman, polychoric, and polyserial correlation.

The  EdSurvey Version 3.1 includes data from the following sources:

  • National Assessment of Educational Progress (NAEP) - up to 2022 NAEP
  • NAEP Long Term Trend (LTT) – up to 2022 LTT
  • Trends in International Mathematics and Science Study (TIMSS) and TIMSS Advanced - up to 2019 TIMSS
  • Progress in International Reading Literacy Study (PIRLS) and ePIRLS - up to 2016 PIRLS
  • International Computer and Information Literacy Study (ICILS) - up to 2018 ICILS
  • International Civic and Citizenship Education Study (ICCS) - up to 2016 ICCS
  • 1999 Civic Education Study (CivEd) • Programme for International Student Assessment (PISA) - up to 2018 PISA
  • PISA Young Adult Follow-up Study
  • Programme for the International Assessment of Adult Competencies (PIAAC) - up to Cycle 1 - Rounds 1 to 3 (2017)
  • Teaching and Learning International Survey (TALIS) - up to 2018 • Early Childhood Longitudinal Studies (ECLS-K: 1998, ECLS-K: 2011, ECLS-B)
  • Education Longitudinal Study of 2002 (ELS)
  • High School Longitudinal Study of 2009 (HSLS)
  • Beginning Teacher Longitudinal Study (BTLS)
  • Baccalaureate and Beyond Longitudinal Study (B&B)
  • The Beginning Postsecondary Students Longitudinal Study (BPS)
  • High School and Beyond (HS&B)
  • National Longitudinal Study of the High School Class of 1972 (NLS72)
  • The National Household Education Surveys Program (NHES)
  • The School Survey on Crime and Safety (SSOCS)

Outcome

AIR’s development of EdSurvey and ancillary R tools has supported NCES by bringing innovation to the analysis of complex data structures and has broadened the user base by expanding the awareness of these datasets to new audiences.

In 2018, former associate commissioner, now NCES Commissioner Peggy Carr expressed that EdSurvey “represents a huge technological innovation in the analysis of survey data collected using a complex sample design” and that the “availability of EdSurvey substantially reduces the burden on the individual researcher” to analyze data collected with a complex sample design.

Since 2014, AIR has rolled out six R packages, with over 100,000 downloads from R CRAN. The team has also been nominated three times for the Bradley Hanson Award, a national level award for contributions to educational measurement.

Considerations for Diversity, Equity, Inclusion, and Accessibility

EdSurvey and its ancillary R tools were intentionally designed and developed as an open-source suite to provide sophisticated analysis and visualization software at no cost to the end user, thus providing equitable access to NCES large-scale assessment data tools to a broader and more diverse audience.

Additional Resources

Installing and Loading EdSurvey

Unless you already have R version 3.5.0 or later, install the latest R version. Users also may want to install RStudio desktop, which has an interface that many find easier to follow.

Inside R, run the following command to install EdSurvey as well as its package dependencies:

install.packages("EdSurvey")

Once the package is successfully installed, EdSurvey can be loaded with the following command:

library(EdSurvey)
 


Key Functions

The key functions of EdSurvey Version 3.1 include:

  • analysis of achievement levels and benchmarks for NAEP and international assessment data;
  • correlations, including Pearson, Spearman, polyserial, polychoric, and correlation between plausible values, with or without weights applied;
  • data exploration, including methods to better understand survey attributes and search for variables and levels in codebooks;
  • data manipulation, such as the subsetting and merging of data, as well as renaming and recoding variables;
  • data processing, including downloading publicly available data and reading data in R;
  • direct estimation that estimates student scale scores using the marginal maximum likelihood regression estimation method. An alternative method to the plausible values approach;
  • drawing plausible values enables users to use new data that has been merged on to NCES's existing data, and then use a marginal maximum likelihood model directly. Users can run any EdSurvey analytical function (e.g., summary tables, regressions) with the new plausible values, which further expands how users can work with NCES data after merging in data beyond information contained on existing surveys;
  • gap analysis that compares the average, percentile, achievement level, or percentage of survey responses between two groups that potentially share members;
  • linear regression with or without plausible values as the dependent variable;
  • logistic regression that allows either a discrete variable or dichotomized plausible values as the dependent variable;
  • multilevel models that use weights at multiple levels and allowing plausible values in the dependent variable;
  • multivariate regression that extends multiple linear regression to include models with multiple outcome variables
  • percentile that calculates the percentiles of a numeric variable or plausible values;
  • NAEP linking error method that incorporates linking error in variance estimation for NAEP assessments during transition year from paper-based assessment to digitally based assessment; 
  • quantile regression that fits a quantile regression model that uses weights and variance estimates appropriate for the data;
  • suggesting weights assists researchers in deciding which weight to use for their analyses with ECLS-K:2011 data; and
  • summary statistics, including unweighted and weighted totals, conditional means, and the percentage of respondents in a category (conditional on an ancillary categorical variable or on the interactions of an arbitrary number of categorical variables), estimation of scale score means based on plausible values.

As the development of EdSurvey progresses, several additional functions, such as IRT and other statistical methods will be added to the package.
 


Technical Papers

EdSurvey User's Guide

Analyzing NCES Data Using EdSurvey: A User's Guide is the first introductory manual dedicated to introducing this R package to the education research community.

Book and Journal Publication

Bailey, P., Lee, M., Nguyen, T., & Zhang, T. (2020). Using EdSurvey to Analyse PIAAC Data. In Large-Scale Cognitive Assessment (pp. 209-237). Springer, Cham.

Data Set Specific Overviews

Documents that describe the analysis of specific survey data in the EdSurvey package include the following:

  • Using EdSurvey to Analyze ECLS-K:2011 Data (PDF) describes the methods in analysis of NCES longitudinal data with ECLS-K:2011 data in examples. The vignette covers topics including preparing the R environment, downloading and processing the data, exploring and manipulating data, and running statistical analyses such as summary tables, correlations, and regression models.
  • Using EdSurvey to Analyze NCES Data: An Illustration of Analyzing NAEP Primer (PDF) describes the basics of using the EdSurvey package for analysis of NAEP data. This vignette covers an introduction to the EdSurvey package with topics such as preparing the R environment for processing, creating summary tables, calculating percentiles and achievement levels, running correlations, linear regression and logistic regression, and conducting gap analysis.
  • Using EdSurvey to Analyze TIMSS Data (PDF) describes the methods used in analysis of large-scale educational assessment programs such as Trends in International Mathematics and Science Study (TIMSS) using the EdSurvey package. The vignette covers topics such as preparing the R environment for processing, creating summary tables, running linear regression models, and correlating variables.
  • Using EdSurvey to Analyze NAEP Data With and Without Accommodations (PDF) provides an overview of the use of NAEP data with accommodations and describes methods used to analyze this data.

Task Specific Walkthroughs

Documents providing an overview of functions developed in the EdSurvey package include the following:

Methodology Resources

Documents that describe the statistical methodology used in the EdSurvey package include the following:


Contact and Bug Reports

Please report bugs and other issues on our GitHub repository at https://github.com/American-Institutes-for-Research/EdSurvey/issues.