Skip to main content

Machine learning for automated content analysis: characteristics of training data impact reliability

Cornell Affiliated Author(s)

Author

R. Fussell
A. Mazrui
N.G. Holmes

Abstract

Natural language processing (NLP) has the capacity to increase the scale and efficiency of content analysis in Physics Education Research. One promise of this approach is the possibility of implementing coding schemes on large data sets taken from diverse contexts. Applying NLP has two main challenges, however. First, a large initial human-coded data set is needed for training, though it is not immediately clear how much training data are needed. Second, if new data are taken from a different context from the training data, automated coding may be impacted in unpredictable ways. In this study, we investigate the conditions necessary to address these two challenges for a survey question that probes students’ perspectives on the reliability of physics experimental results. We use neural networks in conjunction with Bag of Words embedding to perform automated coding of student responses for two binary codes, meaning each code is either present or absent in a response. We find that i) substantial agreement is consistently achieved for our data when the training set exceeds 600 responses, with 80-100 responses containing each code and ii) it is possible to perform automated coding using training data from a disparate context, but variation in code frequencies (outcome balances) across specific contexts can affect the reliability of coding. We offer suggestions for best practices in automated coding. Other smaller-scale investigations across a diverse range of coding scheme types and data contexts are needed to develop generalized principles. © 2022, American Association of Physics Teachers. All rights reserved.

Date Published

Conference Name

Conference

URL

https://www.scopus.com/inward/record.uri?eid=2-s2.0-85140434478&doi=10.1119%2fperc.2022.pr.Fussell&partnerID=40&md5=1b7973aa5305906b4407acca498297b2

DOI

10.1119/perc.2022.pr.Fussell

Group (Lab)

Natasha Holmes Group

Funding Source

DUE-2000739

Download citation