Programming courses typically require assignments where students write code to fulfill specific specifications. In such courses, an autograder serves as an automated tool designed to assess student code submissions by conducting input and output tests. Autograders have been in existence since the inception of computer science as a field of study (Hollingsworth, 1960). More recently, with the increase of massive online programming courses hosting up to 500 students, autograders have gained popularity as an efficient means for grading programming assignments (Keuning et al., 2018). They are instrumental in student engagement (Iosup & Epema, 2014) and pivotal in providing students with constructive feedback (Keuning et al., 2018). However, like any educational technology, autograders come with their own set of advantages and disadvantages that warrant consideration. This post aims to explore the significant pros and cons of employing autograders for assessments in programming courses.
Several renowned proprietary programming autograders are currently available, including CodePost, CodeGrade, Codio, and Mimir. Each tool offers a wealth of academic programming resources, including built-in problems, user-friendly interfaces, flexible question setting, and code review capabilities. However, these companies impose a substantial annual fee on institutions, ranging from $20,000 to $100,000 CAD, for a standard school comprising 1000 students. Additionally, each student is required to pay a monthly fee between $10 and $50 CAD.
In my view, such pricing is excessive (and greedy) and contradicts the principles outlined in the computer science code of ethics, particularly when the software is intended to advance software development. As a result, many post-secondary institutions opt to develop and maintain autograders in-house, tailoring them to their specific preferences. This approach allows faculty to propose new features and enhancements, and students can also contribute suggestions for improvement.
Advantages of Autograders
One of the most compelling incentives for using an autograder is the significant time savings it offers instructors compared to manual grading. Studies indicate that autograders can assess assignments at least three to four times faster than human graders (Ihantola et al., 2010; Keuning et al., 2018). This substantial reduction in grading workload allows instructors to allocate more time to essential teaching tasks such as lesson planning, curriculum development, and providing student support and feedback. The time savings can be particularly substantial in large classes.
Autograders also benefit students by providing quicker feedback on their work. This is especially valuable in introductory programming classes, where receiving prompt results on smaller assignments can significantly enhance student learning and motivation (Keuning et al., 2018). Unlike human grading, which can take days or weeks, autograders can assess submissions within seconds or minutes and instantly inform students whether their code has passed or failed the test cases. This expedited feedback allows students to validate and refine their work much more rapidly than traditional grading methods permit.
A prevalent concern with human graders is the inconsistency in grading from one assignment to another, from one student to another, or even within a single assignment. Factors such as fatigue, emotional states, and biases can impact the quality of human grading, potentially leading to unfairness or errors. Autograders, by contrast, eliminate this subjectivity by applying uniform standards and tests to all submissions, ensuring consistent and equitable grading across the entire class, and thereby enhancing student satisfaction (Hagerer, 2021).
In courses that employ autograders, students quickly learn the necessity of writing code that meets all the autograder test cases to secure maximum assignment credit. While the efficacy of test-driven development (TDD) as a software testing methodology is debatable, this workflow provides students with experience in the TDD framework. Here, students continually run tests on their code to rectify errors and attain the desired functionality (Wang et al., 2011). Essentially, autograders compel students to consider testing as an integral part of coding, rather than merely striving to meet the minimal functional requirements.
Disadvantages of Autograders
A significant drawback of autograders, frequently cited in literature, is their inflexibility compared to human graders (Ihantola et al., 2010; Keuning et al., 2018; Wang et al., 2018). Autograders strictly apply identical test cases to all submissions without exception. Consequently, creative solutions that meet the assignment requirements but deviate from the expected implementation or output format are marked incorrect. Even a minor discrepancy such as a missing whitespace can be the difference between a pass and a fail. Unlike autograders, human graders can exercise judgment to accommodate alternative approaches.
Most autograders assess the functional correctness of student codes, evaluating output for given tests. However, programming courses also aim to instill good coding practices, such as readability, modularization, adherence to naming conventions, coherent design, and appropriate commenting, in students. Autograders do not adequately assess these crucial design and style aspects, leading students to neglect good design principles as long as their code passes the functionality tests.
Another concern is that while autograders are designed to offer students a structured means to advance their knowledge across multiple courses, achieving uniformity in their application across various courses is challenging, especially in larger institutions. Typically, post-secondary institutions employ autograders to maintain consistency across different courses, enabling students to track their progress effectively. However, in institutions where numerous faculty members teach diverse courses with varying requirements, achieving universal acceptance and use of autograders is complex. Faculty members may prefer different tools they are more comfortable with, and some might choose not to use autograders. This results in a lack of uniformity in tool usage from one course to another, creating a disjointed student experience.
Relying exclusively on autograders poses the risk of students learning to pass test cases without acquiring a deeper understanding of programming concepts and problem-solving skills. The emphasis on meeting the autograder’s criteria can lead students to adopt a procedural approach, focusing on achieving the correct output rather than understanding the underlying logic. Some might resort to a trial-and-error method, tweaking their program until it gains autograder approval. While this approach may secure the desired grades, it does not foster genuine understanding or long-term retention of knowledge. Baniassad et al. (2021) introduced a submission penalty at the University of British Columbia to discourage over-reliance on their in-house autograding tool. This adaptation exemplifies the flexibility of modifying tool requirements, a possibility uniquely available when the tool is developed in-house.
Finally, like any web-based software system, autograders can experience technical issues that lead to grading failures and student frustration. The UC Berkeley incident highlights the “single point of failure” risk where an autograder disruption blocks all grading capabilities. Unlike distributed human graders, a centralized automated grader represents a vulnerability to technical problems. Some may fail to meet deadlines through no fault of their own. Furthermore, if instructors refuse to make accommodations for autograder malfunctions, students can feel cheated and that the grading is unfairly disconnected from actual instruction. This speaks to larger concerns around over-reliance on algorithmic systems in education. Automated aids like autograders should not be seen as the sole means of assessment.
The existing body of research on autograders underscores that they are not a panacea for replacing human graders entirely. Instead, to optimize their advantages and mitigate their limitations, autograders are most effective when thoughtfully integrated into a course assessment strategy, complemented by manual grading where it is most beneficial. Below are some best practices for incorporating autograders effectively:
- Employ autograders for basic functionality testing, while manually reviewing selected assignments for flexibility, creativity, and design.
- Utilize autograders to assess the correctness of core logic, and rely on human graders to evaluate structure, style, and readability.
- Complement autograder evaluations with human feedback on prevalent mistakes and areas requiring enhancement.
- Impose penalties for excessive submissions to discourage over-reliance on the autograder.
Proper integration of autograders aligns with technology integration frameworks like SAMR, enhancing existing processes without entirely transforming the grading in programming courses. It also redefines the manner in which students engage with programming, introducing a more gamified approach. Like any educational technology, the value of autograders is derived from their strategic utilization within well-defined goals and contexts.
Hollingsworth, J. (1960). Automatic graders for programming classes. Communications of the ACM, 3(10), 528–529. https://doi.org/10.1145/367415.367422
Keuning, H., Jeuring, J., & Heeren, B. (2016). Towards a Systematic Review of Automated Feedback Generation for Programming Exercises. Proceedings of the 2016 ACM Conference on Innovation and Technology in Computer Science Education. https://doi.org/10.1145/2899415.2899422
Iosup, A., & Epema, D. (2014). An experience report on using gamification in technical higher education. Proceedings of the 45th ACM Technical Symposium on Computer Science Education – SIGCSE ’14. https://doi.org/10.1145/2538862.2538899
Ihantola, P., Ahoniemi, T., Karavirta, V., & Seppälä, O. (2010). Review of recent systems for automatic assessment of programming assignments. Proceedings of the 10th Koli Calling International Conference on Computing Education Research – Koli Calling ’10. https://doi.org/10.1145/1930464.1930480
Hagerer, G. (2021). An Analysis of Programming Course Evaluations Before and After the Introduction of an Autograder. (n.d.). Ieeexplore.ieee.org.
Wang, T., Su, X., Ma, P., Wang, Y., & Wang, K. (2011). Ability-training-oriented automated assessment in introductory programming course. Computers & Education, 56(1), 220–226. https://doi.org/10.1016/j.compedu.2010.08.003
Baniassad, E., Zamprogno, L., Hall, B., & Holmes, R. (2021). STOP THE (AUTOGRADER) INSANITY: Regression Penalties to Deter Autograder Overreliance. Proceedings of the 52nd ACM Technical Symposium on Computer Science Education. https://doi.org/10.1145/3408877.3432430