Using Machine Learning to Predict Science Performance

Project Overview

The purpose of this project is to apply machine learning to identify the most important and potentially malleable predictors of science performance among nationally representative 15-year-old students in the USA using data from the 2015 Program for International Student Assessment (PISA). Given the increasing importance of science education in a knowledge-based economy, the goal is to provide actionable insights for educators, policymakers, and researchers seeking to improve science education.

Data Source

This project used data from PISA 2015- a triennial international student assessment administered by the OECD. The dataset offers comprehensive information on student performance in science alongside rich contextual data on school, teaching and learning processes, and student characteristics. The U.S. subsample consisted of 5,712 students and is nationally representative of 15-year-old students. For those interested in reproducing this analysis, please consult the official ReadMe documentation and the accompanying Illustrative Merge Code provided by the National Center for Education Statistics (NCES).

Data Preparation

A total of 91 variables were selected for analysis. The selection process was guided by findings from large-scale reviews that emphasize the roles of motivation (Zhang & Bae, 2020), student characteristics (Kyriakides et al., 2018), teacher and school leadership practices (Hitt & Tucker, 2016), and instructional quality (Hattie, 2012). Moreover, Bronfenbrenner’s bioecological model of human development was used as a framework (Bronfenbrenner & Morris, 2007). A detailed description of all variables is available in the PISA 2015 Assessment and Analytical Framework.

Data Cleaning

The amount of missing data directly affects the validity of statistical conclusions. Although there is no universally agreed-upon standard in the literature for acceptable levels of missingness, Bennett (2001) suggests that analyses can become biased when more than 10% of the data are missing. Twenty-four variables with 100% missing data—those not administered to students in the United States—were excluded from the dataset. Initial inspection of the data indicated that certain factor variables were represented as numeric. These variables were recoded as factors. In addition, The FRPL(Free of reduced lunch) variable was further re-categorized to align with classifications commonly used in educational research based on the National Center for Education Statistics (NCES) guidelines.

The bar plots below show that the majority of variables have low levels of missing data, with most features missing less than 5% of observations.

Data Structure

A review of variable types revealed that 89.8% of the columns are continuous (numeric) and 10.2% are discrete (categorical). The plot below provides an interactive overview of variable types. Hovering over nodes reveals key metadata.

Data Exploration

Descriptive statistics revealed that most continuous variables are approximately normally distributed with several showing mild to moderate skewness.

The histograms below provide a visual summary of the distribution of selected variables in the dataset.

The two correlation heatmaps below provide an overview of the relationships among the predictor variables (features) and science performance [ Warmer colors indicate stronger associations, while cooler tones reflect weaker associations]. The first heatmap (Set 1) shows that most of the correlations are positive but modest in strength. The second heatmap (Set 2) highlights highly intercorrelated school level variables but thier direct correlations with science performance appear relatively weak.

Data Spliting

The dataset was partitioned into training (4,000) and test (1,712 ) sets. The histogram below shows that the split preserved the distribution of Science Score across both subsets. To prevent data leakage all preprocessing steps were applied independently to each dataset.

Feature Engineering: Training Data

To prepare the training data for modeling, a series of feature engineering steps were applied:

Missing values were imputed using the MissForest package.
All numeric predictors were centered and scaled.
Categorical variables were transformed using one-hot encoding.
Features with near-zero variance were removed
Highly correlated features were filtered.

Feature Engineering: Testing Data

The same preprocessing steps applied to the training data were replicated on the testing data. As noted above, this was done to prevent data leakage.

Model Training

Five machine learning models [i.e., Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), Gradient Boosting Machine (GBM), and Artificial Neural Network (ANN)] were trained on the preprocessed training dataset to predict science performance. Each model was evaluated on the test set using Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), R-squared (R²), and Mean Squared Log Error (MSLE).

Decision Tree

The decision tree below shows that science scores are higher among students with strong epistemological beliefs, higher socioeconomic status (ESCS), and positive views on collaboration (CPSVALUE). Lower scores are linked to low study time (SMINS), grade repetition, and poor disciplinary climate.

Random Forest

This plot illustrates the relationship between the number of trees in the random forest model and the corresponding mean squared error (MSE). As the number of trees increases, the MSE declines sharply at first and then gradually stabilizes. This indicates improved model performance with more trees. Beyond approximately 150 trees, the error rate levels off, suggesting that additional trees provide minimal gains in predictive accuracy.

Support Vector Machines

This plot dsiplays the relationship between the cost parameter (C) and the Root Mean Squared Error (RMSE). The model achieves the lowest RMSE at a cost value of approximately 0.5, indicating optimal predictive performance. Increasing the cost beyond this point results in higher RMSE. This analysis helped us in selecting a moderate cost value to balance bias and variance.

Gradient Boosting Machine

The plot displays the squared error loss versus the number of iterations. The black line represents the training error, while the green line shows the validation error. Both errors decrease rapidly at first, but the validation error plateaus after around 200 iterations, while the training error continues to decline. This indicates the point at which additional iterations yield diminishing returns on validation performance.

Artificial Neural Network

The neural network diagram below shows a multilayer perceptron model with 10 input variables [Plotting all features would create a cluttered and less interpretable diagram], 5 hidden nodes, and one output node. The thickness of the connections represents the weight magnitude—thicker lines indicate stronger influence. Two bias nodes (B1 for the hidden layer and B2 for the output layer) are also included

Model Assessment

The table below shows that GBM model achieved the lowest RMSE and MSLE, along with the highest R². RF and SVM also performed well, but DT and ANN while interpretable and flexible respectively, exhibited higher prediction errors and explained less variance in science performance.

Hyperparameter Tuning

To optimize the performance of the five machine learning algorithms, hyperparameters were fine-tuned using grid search and 5-fold cross-validation as implemented in the caret package. Hyperparameters were systematically optimized. For DT the complexity parameter was adjusted. For the RF the number of variables randomly sampled at each split was selected. For the SVM the cost and kernel width were set. For the GBM the number of trees, interaction depth, learning rate, and minimum observations per node were configured. For the NN, the number of hidden units and weight decay were specified.

After tuning, GBM emerged as the best-performing model(see table below). However, the gain in predictive accuracy over the baseline was modest: R² improved from 0.496 to 0.504, representing a relative increase of approximately 1.6% in explained variance.

The plot below shows the top 20 most influential predictors based on the final GBM model.

Summary and Insights

This study compared performance of five machine learning algorithms- DT), RF,SVM, GBM and ANN - to predict science performance among U.S. 15-year-old students using the PISA 2015 dataset. The results revealed several important insights:

GBM emerged as the best-performing model with the lowest Root Mean Squared Error (RMSE) and highest R² among all models.
Random Forest and SVM also showed competitive performance, with moderate RMSE and solid R² values.
Decision Tree and ANN underperformed relative to other models.
90% the top predictors identified are malleable educational or psychological factors. Thus, the results provide actionable guidance for educators and policymakers aiming to boost science outcomes through targeted interventions.
Overall, these findings reinforce the value of using advanced machine learning techniques in educational research.

References

Bronfenbrenner, U., & Morris, P. A. (2007). The bioecological model of human development. Handbook of child psychology, 1. https://doi.org/10.1002/9780470147658.chpsy0114
Hattie, J. (2012). Visible learning for teachers: Maximizing impact on learning. Routledge. https://doi.org/10.4324/9780203181522
Hitt, D. H., & Tucker, P. D. (2016). Systematic Review of Key Leader Practices Found to Influence Student Achievement: A Unified Framework. Review of Educational Research, 86(2), 531-569. https://doi.org/10.3102/0034654315614911 (Original work published 2016)
Kyriakides, L., Creemers, B., Charalambous, E. (2018). The Impact of Student Characteristics on Student Achievement: A Review of the Literature. In: Equity and Quality Dimensions in Educational Effectiveness. Policy Implications of Research in Education, vol 8. Springer, Cham. https://doi.org/10.1007/978-3-319-72066-1_2
Zhang, F., & Bae, C. L. (2020). Motivational factors that influence student science achievement: A systematic literature review of TIMSS studies. International Journal of Science Education, 42(17), 2921–2944. https://doi.org/10.1080/09500693.2020.1843083

Back to Portfolio