CMPUT660F25 Assignment 1

Please note that the mining challenge is not released yet.

You can do the 2023 mining challenge and combine it with the 2026 mining challenge. This assignment will be updated once the new mining challenge is released.

https://2026.msrconf.org/committee/msr-2026-mining-challenge

1 Assignment 1

1.1 When: October 3rd 5pm

1.2 Who: Just you, with some consultation

1.3 Why: Get an introduction to sifting through Software Repositories

I want you to become comfortable with the data and the repositories that exist, in particular the data set for the assignment 2 and project. You may use some of your results from here in the project. You need hands on experience with this data. It is so large in many cases you cannot do it at the last minute. You should start immediately the second you commit to taking the class.

1.4 What: We will use the MSR 2023 mining challenge data and tools and World of Code dataset

https://conf.researchr.org/track/msr-2023/msr-2023-mining-challenge#Call-for-Mining-Challenge-Papers

Minimum requirement: Sample and describe World of Code GSSC data version U Dataset U.
Optimal requirement: Describe the entire World of Code GSSC data version U Dataset U.

You will have to sample and convert this data to some format that you are comfortable with and then walk through it. Scripting languages with regexes are great for this, but respectably parsing XML is good too. SQLite is a great database format to use because it allows you to transmit the entire database in 1 file.

Consultation notes: You may share parsed databases. That is if you have translated the challenge data into a RDBMs like PostgreSQL or SQLite you are free to share the raw translated data (please cite each other). You can share tools. You cannot share the writeups. You must cite all your sources, resources, and people.

1.4.1 MSR 2023 Mining Challenge

The 2023 MSR Mining Challenge will be this dataset:

https://conf.researchr.org/track/msr-2023/msr-2023-mining-challenge#Call-for-Mining-Challenge-Papers

This is a lot of meta-data about commits.

1.4.2 World of Code [MSR 2023 Hackathon?]

World of Code is a huge network on version control commits from Git. It is about billions of commits and git blobs all interlinked in forks.

The dataset is so huge you have to operate on VMs they make available to you. Because of the barriers to access, any learned knowledge is easy to publish. This is high risk and high reward.

Description of World of Code https://mockus.org/papers/WoC_EMSE.pdf https://github.com/woc-hack/msr-challenge/blob/master/MSR_Challange.pdf
More details about World of Code and REST API: https://github.com/woc-hack/msr-challenge
Tutorial about how to get access to World of code: https://github.com/woc-hack/tutorial

1.5 Questions

1.5.1 Briefly describe the schemas of the data held within challenge dataset

Question: What is not in the schema that you expected to be there?
Very briefly describe how you would get this information.
Question: What do you think are questions that are easy to answer with this dataset?
Question: What do you think are questions that are hard to answer with this dataset?

1.5.2 Size metrics

Question: what is the size of the dataset?
- number of entities (this includes issue reports and other provided metadata)
  - projects? commits? patches? What is in there?
- number of authors [people]
- number of files or blobs
- sizes of files or blobs
- vocabulary (number of unique tokens) of blobs or files
  - Might not be possible for WOC
- summary statistics of the entities and their sizes. These stats include:
  - number of lines
  - number of blocks
  - number of entities
- Essentially for each dataset and database can you give me summary statistics about them.

1.5.3 Question: Plot boxplots / histograms of entities

Please use boxplots, violin plots, or histograms to describe the distributions of number, count, or properties or entities.

1.5.4 Traceability

You should download some blobs, parse them and summarize them. You can use sampling!

Question: How many text blobs are there?
Question: How many foriegn URLs are there?
In available texts in the datasets what kind of URLs are there?
- e.g. commit messages or issues
Question: Are there blobs, snippets, entities that are more than 1 language?
You can answer this with natural languages or computer languages
Question: Please plot entities over time
For the important entities in the dataset please show how many appear over time.
- WOC can sample.

1.6 Notes

1.6.1 It’s too big, I’ll never finish

consider sampling

1.6.2 What do you mean by summary statistics

At the bare minimum we need Mean, Median, Standard Deviation, Variance.
- Skew, Kurtosis and other statistical moments are great too.
- A boxplot is even better.
  - Or a CDF/PDF plot

1.6.3 What format do you want?

I want a terse PDF document from you.

1.6.4 How?

EMAIL & canvas
Submit it to canvas and email it to me as well
Use the subject like “[CMPUT660] Assignment 1 submission” in the email.

1.6.5 How many pages?

Given the fact that you should be making plots of almost every question and tables, if it is less than 4 pages I doubt you’ve been thorough enough.

1.7 Rubric

1.7.1 General Rubric

10 Excellent
- Hits most of the excellent column of the rubric. Thorough, meets requirements.
8 Good
- Hits most of the good and excellent column of the rubric. Thorough, meets requirements. Missing some components.
6 Satisfactory
- Hits most of the satisfactory and good column of the rubric. Missing some portions, cursory. Not thorough.
5 Unsatisfactory
- Missing many components but there was sufficient effort to warrant some marks.
0 Failure
- Not enough effort displayed to suggest merit.

1.7.2 Detailed Rubric

Here’s a rubric of expectations:

660-assignment1-rubric.pdf

Author: Abram