Models to Measure Students’ Learning in Computer Science

As computer science becomes integrated into K-12 education systems worldwide, educators and researchers continuously search for effective methods to measure and understand students’ learning levels in this field. The challenge lies in developing reliable and comprehensive assessment models that accurately and discreetly gauge student learning. Teachers must assess learning to support students’ educational needs better. Similarly, students and parents expect schools to document students’ proficiency in computing and their practical application. Unlike conventional subjects such as math and science, very few relevant assessments are available for K-12 CS education. This article explores specific models used to measure knowledge in various CS contexts and then examines several examples of student learning indicators in computer science.

Randomized Controlled Trials and Measurement Techniques

An innovative approach to measuring student performance in computer science education involves evaluating the effectiveness of teaching parallel programming concepts. Research by Daleiden et al. (2020) focuses on assessing students’ understanding and application of these concepts.

The Token Accuracy Map (TAM) technique supplements traditional empirical analysis methods, such as timings, error counting, or compiler errors, which often need more depth in analyzing the cause of errors or providing detailed insights into specific problem areas encountered by students. The study applied TAM to examine student performance across two parallel programming paradigms: threads and process-oriented programming based on Communicating Sequential Processes (CSP), measuring programming accuracy through an automated process.

The TAM approach analyzes the accuracy of student-submitted code by comparing it against a reference solution using a token-based comparison. Each element of the code, or “token,” is compared to determine its correctness, and the results are aggregated to provide an overall accuracy score ranging from 0% to 100%. This scoring system reflects the percentage of correctness, allowing for a detailed examination of which students intuitively understand specific elements of different programming paradigms or are more likely to implement them correctly.

This approach extends error counts, offering insights into students’ mistakes at a granular level. Such detailed analysis enables researchers and educators to identify specific programming concepts requiring further clarification or alternative teaching approaches. Additionally, TAM can highlight the strengths and weaknesses of different programming paradigms from a learning perspective, thereby guiding curriculum development and instructional design.

Competence Structure Models in Informatics

Torsten et al. (2015) introduced a new model in their discussion aimed at developing a competence structure model for informatics with a focus on system comprehension and object-oriented modelling. This model, part of the MoKoM project (Modeling and Measurement of Competences in Computer Science Education), seeks to create a competence structure model that is both theoretically sound and empirically validated. The project’s goals include identifying essential competencies in the field, organizing them into a coherent framework, and devising assessments to measure them accurately. The study employed the Item Response Theory (IRT) evaluation methodology to construct the test instrument and analyze survey data.

The initial foundation of the competence model was based on theoretical concepts from international syllabi and curricula, such as the ACM’s “Model Curriculum for K-12 Computer Science” and expert papers on software development. This framework encompasses cognitive and non-cognitive skills pertinent to computer science, especially emphasizing system comprehension and object-oriented modelling.

The study further included conducting expert interviews using the Critical Incident Technique to validate the model’s applicability to real-world scenarios and its empirical accuracy. This method was instrumental in pinpointing and defining the critical competencies needed to apply and understand informatics systems. It also provided a detailed insight into student learning in informatics, identifying specific strengths and areas for improvement.


The limitation of this approach is its specificity, which may hinder scalability to broader contexts or different courses. Nonetheless, the findings indicate that detailed, granular measurements can offer valuable insights into the nature and types of students’ errors and uncover learning gaps. The resources mentioned subsequently propose a more general strategy for assessing learning in computer science.

Evidence-centred Design for High School Introductory CS Courses

Another method for evaluating student learning in computer science involves using Evidence-Centered Design (ECD). Newton et al. (2021) demonstrate the application of ECD to develop assessments that align with the curriculum of introductory high school computer science courses. ECD focuses on beginning with a clear definition of the knowledge, skills, and abilities students are expected to gain from their coursework, followed by creating assessments that directly evaluate these outcomes.

The approach entails specifying the domain-specific tasks that students should be capable of performing, identifying the evidence that would indicate their proficiency, and designing assessment tasks that would generate such evidence. The model further includes an analysis of assessment items for each instructional unit, considering their difficulty, discrimination index, and item type (e.g., multiple-choice, open-ended, etc.). This analysis aids in refining the assessments to gauge student competencies and understanding more accurately.

This model offers a more precise measurement of student learning by ensuring that assessments are closely linked to curriculum objectives and learning outcomes.

Other General Student Indicators

The Exploring Computer Science website, a premier resource for research on indicators of student learning in computer science, identifies several key metrics for understanding concepts within the field:

  • Student-Reported Increase in Knowledge of CS Concepts: Students are asked to self-assess their knowledge in problem-solving techniques, design, programming, data analysis, and robotics, rating their understanding before and after instruction.
  • Persistent Motivation in Computer Problem Solving: This self-reported measure uses a 5-point Likert scale to evaluate students’ determination to tackle computer science problems. Questions include, “Once I start working on a computer science problem or assignment, I find it hard to stop,” and “When a computer science problem arises that I can’t solve immediately, I stick with it until I find a solution.”
  • Student Engagement: This metric again relies on self-reporting to gauge a student’s interest in further pursuing computer science in their studies. It assesses enthusiasm and inclination towards the subject.
  • Use of CS Vocabulary: Through pre- and post-course surveys, students respond to the prompt: “What might it mean to think like a Computer Scientist?”. Responses are analyzed for the use of computer science-related keywords such as “analyze,” “problem-solving,” and “programming.” A positive correlation was found between CS vocabulary use and self-reported CS knowledge levels.

Comparing the Models

Each model discussed provides distinct benefits but converges on a shared objective: to gauge precisely students’ understanding of computer science. The Evidence-Centered Design (ECD) model is notable for its methodical alignment assessments with educational objectives, guaranteeing that evaluations accurately reflect the intended learning outcomes. Conversely, the randomized controlled trial and innovative measurement technique present a solid approach for empirically assessing the impact of instructional strategies on student learning achievements. Finally, the competence structure model offers an exhaustive framework for identifying and evaluating specific competencies within a particular field, like informatics, ensuring a thorough understanding of student abilities. As the field continues to evolve, so will our methods for measuring student success.


Daleiden, P., Stefik, A., Uesbeck, P. M., & Pedersen, J. (2020). Analysis of a Randomized Controlled Trial of Student Performance in Parallel Programming using a New Measurement Technique. ACM Transactions on Computing Education20(3), 1–28.

Magenheim, J., Schubert, S., & Schaper, N. (2015). Modelling and measurement of competencies in computer science education. KEYCIT 2014: key competencies in informatics and ICT7(1), 33-57.

Newton, S., Alemdar, M., Rutstein, D., Edwards, D., Helms, M., Hernandez, D., & Usselman, M. (2021). Utilizing Evidence-Centered Design to Develop Assessments: A High School Introductory Computer Science Course. Frontiers in Education6.

Potential of LLMs and Automated Text Analysis in Interpreting Student Course Feedback

Integrating Large Language Models (LLMs) with automated text analysis tools offers a novel approach to interpreting student course feedback. As educators and administrators strive to refine teaching methods and enhance learning experiences, leveraging AI’s capabilities could unlock more profound insights from student feedback. Traditionally seen as a vast collection of qualitative data filled with sentiments, preferences, and suggestions, this feedback can now be more effectively analyzed. This blog will explore how LLMs can be utilized to interpret and classify student feedback, highlighting workflows that could benefit most teachers.

The Advantages of LLMs in Feedback Interpretation

Bano et al. (2023) shed light on the capabilities of LLMs, such as ChatGPT, in analyzing qualitative data, including student feedback. Their research found a significant alignment between human and LLM classifications of Alexa voice assistant app reviews, demonstrating LLMs’ ability to understand and categorize feedback effectively. This indicates that LLMs can grasp the nuances of student feedback, especially when the data is rich in specific word choices and context related to course content or teaching methodologies.

LLMs excel at processing and interpreting large volumes of text, identifying patterns, and extracting themes from qualitative feedback. Their capacity for thematic analysis at scale can assist educators in identifying common concerns, praises, or suggestions within students’ comments, tasks that might be cumbersome and time-consuming through manual efforts.

Limitations and Challenges

Despite their advantages, LLMs have limitations. Linse (2017) highlights that fully understanding the subtleties of student feedback requires more than text analysis; it demands contextual understanding and an awareness of biases. LLMs might not accurately interpret outliers and statistical anomalies, often necessitating human intervention to identify root causes.

Kastrati et al. (2021) identify several challenges in analyzing student feedback sentiment. One major challenge is accurately identifying and interpreting figurative speech, such as sarcasm and irony, which can convey sentiments opposite to their literal meanings. Additionally, many feedback analysis techniques designed for specific domains may falter when applied to the varied contexts of educational feedback. Handling complex linguistic features, such as double negatives, unknown proper names, abbreviations, and words with multiple meanings commonly found in student feedback, presents further difficulties. Lastly, there is a risk that LLMs might inadvertently reinforce biases in their training data, leading to skewed feedback interpretations.

Tools and Workflows

According to ChatGPT (OpenAI, 2024), a suggested workflow for analyzing data from course feedback forms is summarized as follows:

  1. Data Collection: Utilize tools such as Google Forms or Microsoft Forms to design and distribute course feedback forms, emphasizing open-ended questions to gather qualitative feedback from students.
  2. Data Aggregation: Employ automation to compile feedback data into a single repository, like a Google Sheet or Microsoft Excel spreadsheet, simplifying the analysis process.
  3. Initial Thematic Analysis: Import the aggregated feedback into qualitative data analysis software such as NVivo or ATLAS.ti. Use the software’s coding capabilities to identify recurring themes or sentiments in the feedback.
  4. LLM-Assisted Analysis: Engage an LLM, like OpenAI’s GPT, to further analyze the identified themes, categorize comments, and potentially uncover new themes that were not initially evident. It’s crucial to review AI-generated themes for their accuracy and relevance.
  5. Quantitative Integration: Combine qualitative insights with quantitative data from the feedback forms (e.g., ratings) using tools like Microsoft Excel or Google Sheets. This integration offers a more holistic view of student feedback.
  6. Visualization and Presentation: Apply data visualization tools such as Google Charts or Tableau to create interactive dashboards or charts that present the findings of the qualitative analysis. Employing visual aids like word clouds for common themes, sentiment analysis graphs, and charts showing thematic distribution can render the data more engaging and comprehensible.

Case Study: Minecraft Education Lesson

ChatGPT’s recommended workflow was used to analyze feedback from a recent lesson on teaching functions in Minecraft Education.

Step 1: Data Collection

A Google Forms survey was distributed to students, which comprised three quantitative five-point Likert scale questions and three qualitative open-ended questions to gather comprehensive feedback.

MCE Questionnaire

Step 2: Data Aggregation

Using Google Forms’ export to CSV feature, all survey responses were consolidated into a single file, facilitating efficient data management.

Step 3: Initial Thematic Analysis

The survey data was then imported into atlas.ti, an online thematic analysis tool with AI capabilities, to generate initial codes from the quantitative data. This process revealed several major themes, providing valuable insights from the feedback.

Results of AI Coding

Step 4: Manual Verification and Analysis

Upon reviewing the survey data manually, the main themes identified by Atlas.ti were confirmed. Additionally, this manual step highlighted specific approaches students took to solve problems presented in the lesson. Generally, the AI-generated codes were quite accurate, but a closer analysis of the comments (like the ones below) shows even more insightful student suggestions.

AI Coding

Step 5: Quantitative Integration

With both qualitative and quantitative data at hand, we bypass the need for a separate step for quantitative integration.

Step 6: LLM-Assisted Analysis and Visualization

Next, themes were further analyzed using ChatGPT’s code interpreter feature. ChatGPT helped analyze the data and summarized the aggregated data very accurately. It even provided Python code for generating additional visualizations, enhancing the interpretation of the feedback.

Python pandas code

ChatGPT’s guidance facilitated the creation of insightful visualizations such as bar charts and word clouds.

bar chart of qualitative data
Word cloud output

Python offers a wealth of data visualization libraries for even more detailed analysis (

Best Practices for Using LLMs in Feedback Analysis

Research by Bano et al. (2023) and insights from Linse (2017) highlight the potential of LLMs and automated text analysis tools in interpreting student course feedback. Adopting best practices for integrating these technologies is critical for educators and administrators to make informed decisions that enhance teaching quality and the student learning experience, contributing to a more responsive and dynamic educational environment. Below are several recommendations:

  1. Educators or trained administrators must review AI-generated themes and categorizations to ensure alignment with the intended context and uncover nuances possibly missed by the AI. This step is vital for identifying subtleties and complexities that LLMs may not detect.
  2. Utilize insights from both AI and human analyses to inform changes in teaching practices or course content. Then, assess whether subsequent feedback reflects the effects of these changes, thereby establishing an iterative loop for continuous improvement.
  3. Offer guidance on using Student course evaluations constructively. This involves understanding the context of evaluations, looking beyond average scores to grasp the distribution, and considering student feedback as one of several measures for assessing and enhancing teaching quality.
  4. This process should act as part of a holistic teaching evaluation system, which should also encompass peer evaluations, self-assessments, and reviews of teaching materials. A comprehensive approach offers a more precise and balanced assessment of teaching effectiveness.


Bano, M., Didar Zowghi, & Whittle, J. (2023). Exploring Qualitative Research Using LLMs. ArXiv (Cornell University).

Linse, A. R. (2017). Interpreting and using student ratings data: Guidance for faculty serving as administrators and on evaluation committees. Studies in Educational Evaluation54, 94–106.

Kastrati, Z., Dalipi, F., Imran, A. S., Pireva Nuci, K., & Wani, M. A. (2021). Sentiment Analysis of Students’ Feedback with NLP and Deep Learning: A Systematic Mapping Study. Applied Sciences, 11(9), 3986.

OpenAI. (2024). ChatGPT (Feb 10, 2024) [Large language model].