Classes and topics of CMPUT660

Topic 1 2025-09-05

Please read the Road Ahead for Mining, and the World of Code (WOC) paper for Friday!

On Friday Abram will present the Road Ahead for Mining and WOC and FMWare

Road Ahead

World of Code

World of Code is a huge network on version control commits from Git. It is about billions of commits and git blobs all interlinked in forks.

The dataset is so huge you have to operate on VMs they make available to you. Because of the barriers to access, any learned knowledge is easy to publish. This is high risk and high reward.

Sign up to WOC: signup form
Description of World of Code [READ THIS] https://bitbucket.org/swsc/overview/raw/master/pubs/WoC.pdf
More details about World of Code and REST API:https://github.com/woc-hack/msr-challenge
Tutorial about how to get access to World of code:https://bitbucket.org/swsc/lookup/src/master/tutorial.md
Older tutorial: https://raw.githubusercontent.com/woc-hack/tutorial/refs/heads/master/README.md
Older Documentation: https://bitbucket.org/swsc/lookup/raw/5abc55b83c2090edc0418beb287d9e08d51c08ca/README.md

FMWare

FMWare Paper PDF
FMWare Slides
FMWare Vidja
AIWare Vidjas

MSR 2026 Mining Challenge

Just the proposals so far…

https://2026.msrconf.org/track/msr-2026-mining-challenge-proposals

MSR 2025 Mining Challenge

https://2025.msrconf.org/track/msr-2025-mining-challenge?#Call-for-Mining-Challenge-Papers

MSR 2024 Challenge:

https://2024.msrconf.org/track/msr-2024-mining-challenge?#Call-for-Mining-Challenge-Papers-

Intro to Stats

In class

Choose 2 challenge papers to present for next class.

Homework:

Everyone else choose 6 papers to read, 3 long, 3 short.

Topic 2 2025-09-12

Kalvin Eng presents [Reading]

https://softwareprocess.es/homepage/papers/2025-eng2025msr-unreal/

Under the Blueprints: Parsing Unreal Engine’s Visual Scripting at Scale

Kalvin Eng and Abram Hindle

Proceedings of the 22nd International Conference on Mining Software Repositories
- Ottawa, Canada
- 2025
- PDF

In Unreal Engine, a popular game engine for AAA (high budget, high profile) title video games, Blueprint Visual Scripting is a widely used tool for developing gameplay elements using visual node and edge-based source code. Despite its widespread adoption, there is limited research on the intersection of software engineering and Blueprint-based visual programming. This dataset aims to address this gap by providing parsed Blueprint graphs extracted from Unreal Engine’s binary UAsset files. We developed extractors and a custom parser to mine Blueprint graphs from 335,753 Blueprint UAsset files across 24,009 GitHub projects. By providing this dataset, we hope to encourage future research on the structure and usage of Unreal Engine Blueprints, and promote the development of tools–such as code smell detectors and language models for code completion–that can optimize visual programming practices within Unreal Engine.

Tayyib presents [Challenge Reading 1]

Tayyib will present:

Preprint
PDF

Can ChatGPT Support Developers? An Empirical Evaluation of Large Language Models for Code Generation.

Who*Kailun Jin*, Chung-Yu Wang, Hung Viet Pham, *Hadi Hemmati*TrackMSR 2024 Mining ChallengeWhenMon 15 Apr 2024 14:25 - 14:30 at Almada Negreiros - Mining Challenge Chair(s): Preetha Chatterjee, Fabio Palomba

Abstract

Large language models (LLMs) have demonstrated notable proficiency in code generation, with numerous prior studies showing their promising capabilities in various development scenarios. However, these studies mainly provide evaluation in research settings, which leaves a significant gap in understanding how effectively LLMs can support developers in real-world. To address this, we conducted an empirical analysis of conversations in DevGPT, a dataset collected from developers’ conversations with ChatGPT (captured with the Share Link feature on platforms such as GitHub). Our empirical findings indicate that the current practice of using LLM-generated code is typically limited to either demonstrating high-level concepts or providing examples in documentation, rather than to be used as production-ready code. These findings indicate that there is much future work needed to improve LLMs in code generation before they can be integral parts of modern software development.

Link to Preprint https://arxiv.org/abs/2402.11702

Daniel presents [Challenge Reading 2]

Daniel will present:

On the Co-Occurrence of Refactoring of Test and Source Code

Who: Nicholas Nagy, Rabe Abdalkareem
Track: MSR 2022 Mining Challenge
When: Wed 18 May 2022 12:12 - 12:16 at MSR Main room - even hours - Mining Challenge Chair(s): Steffen Herbold

Abstract

Refactoring is a widespread practice that aims to help improve the quality of a software system without altering its external behaviour. In practice, developers can perform refactoring operations on test and source code. However, while prior work shows that refactoring source code brings many benefits, a limited number of studies empirically investigate refactoring of test code and whether it is co-occurred with source code. To examine those co-occurring refactorings, we conducted an empirical study of 60,465 commits spanning 77 open-source Java projects.

First, we quantitatively analyzed the commits from those projects to identify co-occurring refactoring commits (i.e., commits contain refactorings performed on test and source code). Our results showed that on average 17.9% of refactoring commits are co-occurring refactoring commits, which is twice as much as test code-only refactoring commits. Also, we investigated the type of refactorings applied to test code in those co-occurring commits. We found Change Variable Type and Move Class are the most common refactorings. Second, we trained random forest classifiers to predict when refactoring test code should co-occur with refactoring source code using features extracted from the refactoring source code in ten selected projects. Our results showed that the classifier can accurately predict when test and source code refactoring co-occurs with AUC values between 0.67-0.92. Our analysis also showed that the most important features for our classifier are related to the refactoring size and developer refactoring experience.

Link to Preprint

World of Code

Remember to sign up for WOC:

Choose 6 readings:

Remember to choose 6 readings for the future, 3 long 3 short. Use the readings spreadsheet: https://docs.google.com/spreadsheets/u/1/d/12QvJxwsHIoka2k5yepRST27Wolfgz2kPaQu-e2S6E_c/edit?gid=0#gid=0

Concepts discussed:

Topic 3 2025-09-19

Zhou Yang presents Stealthy Backdoor Attack for Code Models

Zhou Yang, Bowen Xu, Jie M. Zhang, Hong Jin Kang, Jieke Shi, Junda He, and David Lo. 2024. Stealthy Backdoor Attack for Code Models. IEEE Trans. Softw. Eng. 50, 4 (April 2024), 721–741. https://doi.org/10.1109/TSE.2024.3361661

PDF

Abstract

Code models, such as CodeBERT and CodeT5, offer general-purpose representations of code and play a vital role in supporting downstream automated software engineering tasks. Most recently, code models were revealed to be vulnerable to backdoor attacks. A code model that is backdoor-attacked can behave normally on clean examples but will produce pre-defined malicious outputs on examples injected with triggers that activate the backdoors. Existing backdoor attacks on code models use unstealthy and easy-to-detect triggers. This paper aims to investigate the vulnerability of code models with stealthy backdoor attacks. To this end, we propose Afraidoor (Adversarial Feature as Adaptive Backdoor). Afraidoor achieves stealthiness by leveraging adversarial perturbations to inject adaptive triggers into different inputs. We apply Afraidoor to three widely adopted code models (CodeBERT, PLBART, and CodeT5) and two downstream tasks (code summarization and method name prediction). We evaluate three widely used defense methods and find that Afraidoor is more unlikely to be detected by the defense methods than by baseline methods. More specifically, when using spectral signature as defense, around 85% of adaptive triggers in Afraidoor bypass the detection in the defense process. By contrast, only less than 12% of the triggers from previous work bypass the defense. When the defense method is not applied, both Afraidoor and baselines have almost perfect attack success rates. However, once a defense is applied, the attack success rates of baselines decrease dramatically, while the success rate of Afraidoor remains high. Our finding exposes security weaknesses in code models under stealthy backdoor attacks and shows that state-of-the-art defense methods cannot provide sufficient protection. We call for more research efforts in understanding security threats to code models and developing more effective countermeasures.

Resources

Jainam presents [Challenge Reading 1]

Jainam will present: Zhang, Yue, et al. “Does Generative AI Generate Smells Related to Container Orchestration?: An Exploratory Study with Kubernetes Manifests.” (2024).

Abstract

Generative artificial intelligence (AI) technologies, such as Chat- GPT have shown promise in solving software engineering problems. However, these technologies have also shown to be susceptible to generating software artifacts that contain quality issues. A system- atic characterization of quality issues, such as smells in ChatGPT- generated artifacts can help in providing recommendations for practitioners who use generative AI for container orchestration. We conduct an empirical study with 98 Kubernetes manifests to quantify smells in manifests generated by ChatGPT. Our empirical study shows: (i) 35.8% of the 98 Kubernetes manifests generated include at least one instance of smell; (ii) two types of objects Kuber- netes namely, Deployment and Service are impacted by identified smells; and (iii) the most frequently occurring smell is unset CPU and memory requirements. Based on our findings, we recommend practitioners to apply quality assurance activities for ChatGPT- generated Kubernetes manifests prior to using these manifests for container orchestratio

Aron presents [Challenge Reading 2]

Aron will present: AI Writes, We Analyze: The ChatGPT Python Code Saga

Md Fazle Rabbi, Arifa Islam Champa, Minhaz F. Zibran, and Md Rakibul Islam. 2024. AI Writes, We Analyze: The ChatGPT Python Code Saga. In Proceedings of the 21st International Conference on Mining Software Repositories (MSR ‘24). Association for Computing Machinery, New York, NY, USA, 177–181. https://doi.org/10.1145/3643991.3645076

Abstract

In this study, we quantitatively analyze 1,756 AI-written Python code snippets in the DevGPT dataset and evaluate them for quality and security issues. We systematically distinguish the code snippets as either generated by ChatGPT from scratch (ChatGPT-generated) or modified user-provided code (ChatGPT-modified). The results re- veal that ChatGPT-modified code more frequently displays quality issues compared to ChatGPT-generated code. The findings provide insights into the inherent limitations of AI-written code and em- phasize the need for scrutiny before integrating such pieces of code into software systems.

Mining Challenge Finally Released

Agentic PRs.

https://2026.msrconf.org/track/msr-2026-mining-challenge
Challenge preprint https://github.com/SAILResearch/AI_Teammates_in_SE3/blob/main/AIDev_Challenge.pdf
AIDev preprint https://arxiv.org/abs/2507.15003
AIDEv Code https://github.com/SAILResearch/AI_Teammates_in_SE3/tree/main
AIDev on hugging face https://huggingface.co/datasets/hao-li/AIDev
THe dataset is not big https://huggingface.co/datasets/hao-li/AIDev/tree/main
Parquet reader for python: Pandas
Parquet reader for python: pyarrow.parquet
Parquet Details: https://medium.com/munchy-bytes/are-you-using-parquet-with-pandas-in-the-right-way-595c9ee7112
Dupe Pull Requests Dataset: https://github.com/whystar/MSR2018-DupPR
DocMine pull requests dataset https://doi.org/10.5281/zenodo.5195084
PI-Link: A Ground-Truth Dataset of Links Between Pull-Requests and Issues in GitHub paper and dataset

2025-09-26

Umut presents Challenge 1

Umut presents: “Cheating Death: A Statistical Survival Analysis of Publicly Available Python Projects”

Ali RH, Parlett-Pelleriti C, Linstead E. Cheating death: A statistical survival analysis of publicly available python projects. InProceedings of the 17th International Conference on Mining Software Repositories 2020 Jun 29 (pp. 6-10).

PDF

Abstract

We apply survival analysis methods to a dataset of publicly-available software projects in order to examine the attributes that might lead to their inactivity over time. We ran a Kaplan-Meier analysis and fit a Cox Proportional-Hazards model to a subset of Software Heritage Graph Dataset, consisting of 3052 popular Python projects hosted on GitLab/GitHub, Debian, and PyPI, over a period of 165 months. We show that projects with repositories on multiple hosting services, a timeline of publishing major releases, and a good network of developers, remain healthy over time and should be worthy of the effort put in by developers and contributors.

Imgyeong presents Challenge 2

Imgyeong presents: “Quality Assessment of ChatGPT Generated Code and their Use by Developers”

Siddiq ML, Roney L, Zhang J, Santos JC. Quality assessment of chatgpt generated code and their use by developers. InProceedings of the 21st international conference on mining software repositories 2024 Apr 15 (pp. 152-156).

PDF

Abstract

The release of large language models (LLMs) like ChatGPT has revo- lutionized software development. Prior works explored ChatGPT’s generated response quality, the effectiveness of different prompting techniques, its performance in programming contests, etc. However, there is limited information regarding the practical usage of Chat- GPT by software developers. This data mining challenge focuses on DevGPT, a curated dataset of developer-ChatGPT conversa- tions encompassing prompts with ChatGPT’s responses, including code snippets. Our paper leverages this dataset to investigate (RQ1) whether ChatGPT generates Python & Java code with quality is- sues; (RQ2) whether ChatGPT-generated code is merged into a repository, and, if it does, to what extent developers change them; and (RQ3) what are the main use cases for ChatGPT besides code generation. We found that ChatGPT-generated code suffers from using undefined/unused variables and improper documentation. They also have security issues related to improper resources and exception management. Our results show that ChatGPT-generated codes are hardly merged, and they are significantly modified before merging. Based on an analysis of developers’ discussions and the developer-ChatGPT chats, we found that developers use ChatGPT for every stage of software development and leverage it to learn about new frameworks and development kits.

Other

2025-10-03

Zhoyiyiang presents Paper 1

Zhouyiyang presents “SpecGen: Automated Generation of Formal Program Specifications via Large Language Models”

L. Ma, S. Liu, Y. Li, X. Xie and L. Bu, “SpecGen: Automated Generation of Formal Program Specifications via Large Language Models,” in 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), Ottawa, ON, Canada, 2025, pp. 16-28, doi: 10.1109/ICSE55347.2025.00129.

PDF
Arxiv

Abstract

Formal program specifications play a crucial role in various stages of software development. However, manually crafting formal program specifications is rather difficult, making the job time-consuming and labor-intensive. It is even more challenging to write specifications that correctly and comprehensively describe the semantics of complex programs. To reduce the burden on software developers, automated specification generation methods have emerged. However, existing methods usually rely on predefined templates or grammar, making them struggle to accurately describe the behavior and functionality of complex real-world programs. To tackle this challenge, we introduce SpecGen, a novel technique for formal program specification generation based on Large Language Models. Our key insight is to overcome the limitations of existing methods by leveraging the code comprehension capability of LLMs. The process of SpecGen consists of two phases. The first phase employs a conversational approach that guides the LLM to generate appropriate specifications for a given program. The second phase, designed for where the LLM fails to generate correct specifications, applies four mutation operators to the model-generated specifications and selects verifiable specifications from the mutated ones through a novel heuristic selection strategy. We evaluate SpecGen on two datasets, including the SV-COMP Java category benchmark and a manually constructed dataset. Experimental results demonstrate that SpecGen succeeds in generating verifiable specifications for 279 out of 385 programs, outperforming the existing purely LLM-based approaches and conventional specification generation tools like Houdini and Daikon. Further investigations on the quality of generated specifications indicate that SpecGen can comprehensively articulate the behaviors of the input program.

Quinn presents Paper 2

Quinn presents Bryksin T, Petukhov V, Alexin I, Prikhodko S, Shpilman A, Kovalenko V, Povarov N. Using Large-Scale Anomaly Detection on Code to Improve Kotlin Compiler. In: 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR); 2020 Oct 5-6; Seoul, Republic of Korea. ACM; 2020. p. 455–65. doi:10.1145/3379597.3387447.

PDF
Arxiv

Abstract

In this work, we apply anomaly detection to source code and bytecode to facilitate the development of a programming language and its compiler. We define anomaly as a code fragment that is different from typical code written in a particular programming language. Identifying such code fragments is beneficial to both language developers and end users, since anomalies may indicate potential issues with the compiler or with runtime performance. Moreover, anomalies could correspond to problems in language design. For this study, we choose Kotlin as the target programming language. We outline and discuss approaches to obtaining vector representations of source code and bytecode and to the detection of anomalies across vectorized code snippets. The paper presents a method that aims to detect two types of anomalies: syntax tree anomalies and so-called compiler-induced anomalies that arise only in the compiled bytecode. We describe several experiments that employ different combinations of vectorization and anomaly detection techniques and discuss types of detected anomalies and their usefulness for language developers. We demonstrate that the extracted anomalies and the underlying extraction technique provide additional value for language development.

Resources

SpecGen https://sites.google.com/view/specgen
Benchmark https://gitlab.com/sosy-lab/benchmarking/sv-benchmarks
Daikon https://plse.cs.washington.edu/daikon/
Daikon example: https://plse.cs.washington.edu/daikon/download/doc/daikon.html#Understanding-the-invariants
Daikon Tutorial https://www.cs.cmu.edu/~aldrich/courses/17-355-19sp/notes/slides27-daikon.pdf
More Diakon notes https://ece.uwaterloo.ca/~agurfink/ece653w17/assets/pdf/W12-Daikon.pdf
Anomaly detection https://github.com/abramhindle/intro-to-stats-of-empirical-se/blob/master/anomaly/anomaly.org
Embeddings https://github.com/abramhindle/intro-to-stats-of-empirical-se/blob/master/embeddings/embeddings.org

2025-10-10

Tayyib presents Paper 1

Tayyib presents The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering Hao Li, Haoxiang Zhang, Ahmed E. Hassan

ARXIV
PDF

Abstract

The future of software engineering–SE 3.0–is unfolding with the rise of AI teammates: autonomous, goal-driven systems collaborating with human developers. Among these, autonomous coding agents are especially transformative, now actively initiating, reviewing, and evolving code at scale. This paper introduces AIDev, the first large-scale dataset capturing how such agents operate in the wild. Spanning over 456,000 pull requests by five leading agents–OpenAI Codex, Devin, GitHub Copilot, Cursor, and Claude Code–across 61,000 repositories and 47,000 developers, AIDev provides an unprecedented empirical foundation for studying autonomous teammates in software development. Unlike prior work that has largely theorized the rise of AI-native software engineering, AIDev offers structured, open data to support research in benchmarking, agent readiness, optimization, collaboration modeling, and AI governance. The dataset includes rich metadata on PRs, authorship, review timelines, code changes, and integration outcomes–enabling exploration beyond synthetic benchmarks like SWE-bench. For instance, although agents often outperform humans in speed, their PRs are accepted less frequently, revealing a trust and utility gap. Furthermore, while agents accelerate code submission–one developer submitted as many PRs in three days as they had in three years–these are structurally simpler (via code complexity metrics). We envision AIDev as a living resource: extensible, analyzable, and ready for the SE and AI communities. Grounding SE 3.0 in real-world evidence, AIDev enables a new generation of research into AI-native workflows and supports building the next wave of symbiotic human-AI collaboration. The dataset is publicly available at this https URL. AI Agent, Agentic AI, Coding Agent, Agentic Coding, Software Engineering Agent

Lukas presents Challenge 2

Lukas presents Amirreza Bagheri and Péter Hegedüs. 2022. Is refactoring always a good egg? exploring the interconnection between bugs and refactorings. In Proceedings of the 19th International Conference on Mining Software Repositories (MSR ‘22). Association for Computing Machinery, New York, NY, USA, 117–121. https://doi.org/10.1145/3524842.3528034

ACM Page
PDF

Abstract

Bug fixing and code refactoring are two distinct maintenance actions with different goals. While bug fixing is a corrective change that eliminates a defect from the program, refactoring targets improving the internal quality (i.e., maintainability) of a software system without changing its functionality. Best practices and common intuition suggest that these code actions should not be mixed in a single code change. Furthermore, as refactoring aims for improving quality without functional changes, we would expect that refactoring code changes will not be sources of bugs. Nonetheless, empirical studies show that none of the above hypotheses are necessarily true in practice. In this paper, we empirically investigate the interconnection between bug-related and refactoring code changes using the SmartSHARK dataset. Our goal is to explore how often bug fixes and refactorings co-occur in a single commit (tangled changes) and whether refactoring changes themselves might induce bugs into the system. We found that it is not uncommon to have tangled commits of bug fixes and refactorings; 21% of bug-fixing commits include at least one type of refactoring on average. What is even more shocking is that 54% of bug-inducing commits also contain code refactoring changes. For instance, 10% (652 occurrences) of the Change Variable Type refactorings in the dataset appear in bug-inducing commits that make up 7.9% of the total inducing commits.

Resources

Greg Wilson’s SE Research Questions

2025-10-17

Daniel Presents Long Paper 1

Daniel presents An Empirical Study of End-user Programmers in the Computer Music Community by Gregory Burlet, Abram Hindle, MSR 2015

PDF
Page

Abstract

Computer musicians are a community of end-user programmers who often use visual programming languages such as Max/MSP or Pure Data to realize their musical compo- sitions. This research study conducts a multifaceted analysis of the software development practices of computer musicians when programming in these visual music-oriented languages. A statistical analysis of project metadata harvested from software repositories hosted on GitHub reveals that in comparison to the general population of software developers, computer musicians’ repositories have less commits, less frequent commits, more commits on weekends, yet similar numbers of bug reports and similar numbers of contributing authors. Analysis of source code in these repositories reveals that the vast majority of code can be reconstructed from duplicate fragments. Finally, these results are corroborated by a survey of computer musicians and interviews with individuals in this end-user community. Based on this analysis and feedback from computer musicians we find that there are many avenues where software engineering can be applied to help aid this community of end-user programmers.

Aron Presents Long Paper 2

Aron presents Toufique Ahmed, Premkumar Devanbu, Christoph Treude, and Michael Pradel. 2025. Can LLMs Replace Manual Annotation of Software Engineering Artifacts?. In 2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). 526–538. doi:10.1109/MSR66628.2025.00086

ARXIV
PDF

Abstract

Experimental evaluations of software engineering innovations, e.g., tools and processes, often include human-subject studies as a component of a multi-pronged strategy to obtain greater generalizability of the findings. However, human-subject studies in our field are challenging, due to the cost and difficulty of finding and employing suitable subjects, ideally, professional programmers with varying degrees of experience. Meanwhile, large language models (LLMs) have recently started to demonstrate human-level performance in several areas. This paper explores the possibility of substituting costly human subjects with much cheaper LLM queries in evaluations of code and code-related artifacts. We study this idea by applying six state-of-the-art LLMs to ten annotation tasks from five datasets created by prior work, such as judging the accuracy of a natural language summary of a method or deciding whether a code change fixes a static analysis warning. Our results show that replacing some human annotation effort with LLMs can produce inter-rater agreements equal or close to human-rater agreement. To help decide when and how to use LLMs in human-subject studies, we propose model-model agreement as a predictor of whether a given task is suitable for LLMs at all, and model confidence as a means to select specific samples where LLMs can safely replace human annotators. Overall, our work is the first step toward mixed human-LLM evaluations in software engineering.

Resources

2025-10-24

Jainam presents GreenHub Farmer

Jainam presents Matalonga H, Cabral B, Castor F, Couto M, Pereira R, Sousa SM, et al. GreenHub Farmer: Real-world data for Android Energy Mining. 2018.

Abstract

As mobile devices are supporting more and more of our daily activities, it is vital to widen their battery up-time as much as possible. In fact, according to the Wall Street Journal, 9/10 users suffer from low battery anxiety. The goal of our work is to understand how Android usage, apps, operating systems, hardware and user habits inﬂuence battery lifespan. Our strategy is to collect anonymous raw data from devices all over the world, through a mobile app, build and analyze a large-scale dataset containing real-world, day-to-day data, representative of user practices. So far, the dataset we collected includes 12 million+ (anonymous) data samples, across 900+ device brands and 5.000+ models. And, it keeps growing. The data we collect, which is publicly available and by different channels, is sufﬁciently heterogeneous for supporting studies with a wide range of focuses and research goals, thus opening the opportunity to inform and reshape user habits, and even inﬂuence the development of both hardware and software for mobile devices.

Imgyeong presents Language Models in Software Development Tasks

Imgyeong presents Alizadeh N, Belchev B, Saurabh N, Kelbert P, Castor F. Language Models in Software Development Tasks: An Experimental Analysis of Energy and Accuracy. In: 2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR); 2025. p. 725. doi: 10.1109/MSR66628.2025.00109.

PDF
ARXIV

Abstract

The use of generative AI-based coding assistants like ChatGPT and Github Copilot is a reality in contemporary software development. Many of these tools are provided as remote APIs. Using third-party APIs raises data privacy and security concerns for client companies, which motivates the use of locally-deployed language models. In this study, we explore the trade-off between model accuracy and energy consumption, aiming to provide valuable insights to help developers make informed decisions when selecting a language model. We investigate the performance of 18 families of LLMs in typical software development tasks on two real-world infrastructures, a commodity GPU and a powerful AI-specific GPU. Given that deploying LLMs locally requires powerful infrastructure which might not be affordable for everyone, we consider both full-precision and quantized models. Our findings reveal that employing a big LLM with a higher energy budget does not always translate to significantly improved accuracy. Additionally, quantized versions of large models generally offer better efficiency and accuracy compared to full-precision versions of medium-sized ones. Apart from that, not a single model is suitable for all types of software development tasks.

Resources

2025-10-31

Assignment 2 presentations.

Resources

Sklearn tutorial

2025-11-07

Project Proposal Presentations

Zhouyiyang will presents “Mining email social networks, 2006, Proceedings of the 2006 International Workshop on Mining Software Repositories, MSR 2006, Shanghai, China, May 22-23, 2006, Christian Bird, Alex Gourley, Premkumar T. Devanbu, Michael Gertz, Anand Swaminathan”

From https://cabird.com/publications.html Mining email social networks 2006 | Proceedings of the 2006 International Workshop on Mining Software Repositories, MSR 2006, Shanghai, China, May 22-23, 2006 Christian Bird, Alex Gourley, Premkumar T. Devanbu, Michael Gertz, Anand Swaminathan Most Influential Paper Award (10 years)

TL;DR:

Analyzed Apache developer mailing-list email archives, resolved aliasing, constructed a reply-based social network and matched it to CVS commits to show that email activity and network centrality strongly correlate with source-code contributions and that developers occupy higher-status positions than non-developers.

Topic:

Mining email social networks in open-source software

Problem:

Communication and coordination in software projects are hard to observe in closed settings; the authors aim to use public mailing-list archives to study social interactions, relate them to development activity, and overcome practical challenges such as alias resolution and linking email identities to CVS accounts.

Approach:

Parsed ~101k messages from the Apache HTTP Server developer mailing list (1999 onwards), extracted reply relationships to build a directed social network, resolved aliases using an automated name/email similarity clustering followed by manual post-processing, matched email identities to CVS commit accounts, and computed network measures (in-/out-degree, betweenness) and Spearman/t-test correlations between email activity and source/document change activity.

Key Insights:
Email participation and reply-based in-/out-degree distributions are long-tailed (small-world/scale-free): a few people generate and attract most activity.
There is a very strong correlation between number of messages sent and number of distinct respondents (Spearman ≈ 0.97).
Among committers (n=73), message volume and social-centrality (especially betweenness) strongly correlate with source-code changes (Spearman ≈ 0.80 for messages vs source changes; betweenness ≈ 0.757 with source changes).
Developers have significantly higher centrality than non-developers (large, significant differences in betweenness, in-degree and out-degree); document changes correlate less strongly with social measures than source changes.
Implications:

Public mailing-list archives can be reliably mined (with careful alias resolution) to reveal coordination structures and identify key contributors; social-network metrics from email can serve researchers and project managers as proxies for developer status and activity, help detect brokers or bottlenecks, and guide interventions or further causal/time-series studies linking communication and code evolution.

Resources

2025-11-21

2 Readings

Lukas presents A Comprehensive Study of Autonomous Vehicle Bugs

Lukas will present Joshua Garcia, Yang Feng, Junjie Shen, Sumaya Almanee, Yuan Xia, and and Qi Alfred Chen. 2020. A comprehensive study of autonomous vehicle bugs. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (ICSE ‘20). Association for Computing Machinery, New York, NY, USA, 385–396. https://doi.org/10.1145/3377811.3380397

Abstract

Self-driving cars, or Autonomous Vehicles (AVs), are increasingly becoming an integral part of our daily life. About 50 corporations are actively working on AVs, including large companies such as Google, Ford, and Intel. Some AVs are already operating on public roads, with at least one unfortunate fatality recently on record. As a result, understanding bugs in AVs is critical for ensuring their security, safety, robustness, and correctness. While previous studies have focused on a variety of domains (e.g., numerical software; machine learning; and error-handling, concurrency, and performance bugs) to investigate bug characteristics, AVs have not been studied in a similar manner. Recently, two software systems for AVs, Baidu Apollo and Autoware, have emerged as frontrunners in the open-source community and have been used by large companies and governments (e.g., Lincoln, Volvo, Ford, Intel, Hitachi, LG, and the US Department of Transportation). From these two leading AV software systems, this paper describes our investigation of 16,851 commits and 499 AV bugs and introduces our classification of those bugs into 13 root causes, 20 bug symptoms, and 18 categories of software components those bugs often affect. We identify 16 major findings from our study and draw broader lessons from them to guide the research community towards future directions in software bug detection, localization, and repair.

Umut presents Striking Gold in Software Repositories? An Econometric Study of Cryptocurrencies on GitHub

Umut will presents A. Trockman, R. van Tonder and B. Vasilescu, “Striking Gold in Software Repositories? An Econometric Study of Cryptocurrencies on GitHub,” 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), Montreal, QC, Canada, 2019, pp. 181-185, doi: 10.1109/MSR.2019.00036.

Abstract

Abstract: Cryptocurrencies have a significant open source development presence on GitHub. This presents a unique opportunity to observe their related developer effort and software growth. Individual cryptocurrency prices are partly driven by attractiveness, and we hypothesize that high-quality, actively-developed software is one of its influences. Thus, we report on a study of a panel data set containing nearly a year of daily observations of development activity, popularity, and market capitalization for over two hundred open source cryptocurrencies. We find that open source project popularity is associated with higher market capitalization, though development activity and quality assurance practices are insignificant variables in our models. Using Granger causality tests, we find no compelling evidence for a dynamic relation between market capitalization and metrics such as daily stars, forks, watchers, commits, contributors, and lines of code changed.

Resources

Granger Casaulity Notebook

2025-11-28

2 Readings

Balreet presents

Balreet presents “XBIDetective: Leveraging Vision Language Models for Identifying Cross-Browser Visual Inconsistencies”, Balreet Grewal , James Graham , Jeff Muizelaar , Jan Honza Odvarko , Suhaib Mujahid , Marco Castelluccio , Cor-Paul Bezemer

Abstract

Browser rendering bugs can be challenging to detect for browser de- velopers, as they may be triggered by very specific conditions that are exhibited on only a very small subset of websites. Cross-browser inconsistencies (XBIs), variations in how a website is interpreted and displayed on different browsers, can be helpful guides to detect such rendering bugs. Although visual and Document Object Model (DOM)-based analysis techniques exist for detecting XBIs, they often struggle with dynamic and interactive elements. In this study, we discuss our industry experience with using vision language models (VLMs) to identify XBIs. We present the XBIDetective tool which automatically captures screenshots of a website in Mozilla Firefox and Google Chrome, and analyzes them with a VLM for XBIs. We evaluate XBIDetective’s performance with an off-the-shelf and a fine-tuned VLM on 1,052 websites. We show that XBIDetective can identify cross-browser discrepancies with 79% accuracy and detect dynamic elements and advertisements with 84% and 85% accuracy, respectively, when using the fine-tuned VLM. We discuss important lessons learned, and we present several potential prac- tical use cases for XBIDetective, including automated regression testing, large-scale monitoring of websites, and rapid triaging of XBI bug reports.

Akalanka presents Detecting and Fixing API Misuses of Data Science Libraries Using Large Language Models

Akalanka presents: Akalanka Galappaththi, Francisco Ribeiro, and Sarah Nadi. 2025. Detecting and Fixing API Misuses of Data Science Libraries Using Large Language Models. In Proceedings of the CASCON 2025 Conference. University of Alberta, Edmonton, Canada; New York University Abu Dhabi, Abu Dhabi, United Arab Emirates.

Abstract

Abstract—Data science libraries, such as scikit-learn and pandas, specialize in processing and manipulating data. The data-centric nature of these libraries makes the detection of API misuse in them more challenging. This paper introduces DSCHECKER, an LLM-based approach designed for detecting and fixing API misuses of data science libraries. We identify two key pieces of information, API directives and data information, that may be beneficial for API misuse detection and fixing. Using three LLMs and misuses from five data science libraries, we experiment with various prompts. We find that incorporating API directives and data-specific details enhances DSCHECKER’s ability to detect and fix API misuses, with the best-performing model achieving a detection F1-score of 61.18% and fixing 51.28% of the misuses. Building on these results, we implement DSCHECKERagent which includes an adaptive function calling mechanism to access information on demand, simulating a real- world setting where information about the misuse is unknown in advance. We find that DSCHECKERagent achieves 48.65% de- tection F1-score and fixes 39.47% of the misuses, demonstrating the promise of LLM-based API misuse detection and fixing in real-world scenarios.

Resources

2025-12-05

Project Presentations
Last Day
SPOT

CMPUT660F25 Classes and Topics

2025/09/03

Classes and topics of CMPUT660

Topic 1 2025-09-05

Road Ahead

World of Code

FMWare

MSR 2026 Mining Challenge

MSR 2025 Mining Challenge

MSR 2024 Challenge:

Intro to Stats

In class

Homework:

Topic 2 2025-09-12

Kalvin Eng presents [Reading]

Under the Blueprints: Parsing Unreal Engine’s Visual Scripting at Scale

Tayyib presents [Challenge Reading 1]

Can ChatGPT Support Developers? An Empirical Evaluation of Large Language Models for Code Generation.

Abstract

Daniel presents [Challenge Reading 2]

On the Co-Occurrence of Refactoring of Test and Source Code

Abstract

World of Code

Choose 6 readings:

Concepts discussed:

Topic 3 2025-09-19

Zhou Yang presents Stealthy Backdoor Attack for Code Models

Abstract

Resources

Jainam presents [Challenge Reading 1]

Abstract

Aron presents [Challenge Reading 2]

Abstract

Mining Challenge Finally Released

2025-09-26

Umut presents Challenge 1

Abstract

Imgyeong presents Challenge 2

Abstract

Other

2025-10-03

Zhoyiyiang presents Paper 1

Abstract

Quinn presents Paper 2

Abstract

Resources

2025-10-10

Tayyib presents Paper 1

Abstract

Lukas presents Challenge 2

Abstract

Resources

2025-10-17

Daniel Presents Long Paper 1

Abstract

Aron Presents Long Paper 2

Abstract

Resources

2025-10-24

Jainam presents GreenHub Farmer

Abstract

Imgyeong presents Language Models in Software Development Tasks

Abstract

Resources

2025-10-31

Resources

2025-11-07

Zhoyiyang will present Mining Email Social Networks

Resources

2025-11-21

Lukas presents A Comprehensive Study of Autonomous Vehicle Bugs

Abstract

Umut presents Striking Gold in Software Repositories? An Econometric Study of Cryptocurrencies on GitHub

Abstract

Resources

2025-11-28

Balreet presents

Abstract

Akalanka presents Detecting and Fixing API Misuses of Data Science Libraries Using Large Language Models

Abstract