CMPUT660F25 Assignment 1

2025/08/29

Please note that the mining challenge is not released yet.

You can do the 2023 mining challenge and combine it with the 2026 mining challenge. This assignment will be updated once the new mining challenge is released.

https://2026.msrconf.org/committee/msr-2026-mining-challenge

1 Assignment 1

1.1 When: October 3rd 5pm

1.2 Who: Just you, with some consultation

1.3 Why: Get an introduction to sifting through Software Repositories

I want you to become comfortable with the data and the repositories that exist, in particular the data set for the assignment 2 and project. You may use some of your results from here in the project. You need hands on experience with this data. It is so large in many cases you cannot do it at the last minute. You should start immediately the second you commit to taking the class.

1.4 What: We will use the MSR 2023 mining challenge data and tools and World of Code dataset

https://conf.researchr.org/track/msr-2023/msr-2023-mining-challenge#Call-for-Mining-Challenge-Papers

You will have to sample and convert this data to some format that you are comfortable with and then walk through it. Scripting languages with regexes are great for this, but respectably parsing XML is good too. SQLite is a great database format to use because it allows you to transmit the entire database in 1 file.

Consultation notes: You may share parsed databases. That is if you have translated the challenge data into a RDBMs like PostgreSQL or SQLite you are free to share the raw translated data (please cite each other). You can share tools. You cannot share the writeups. You must cite all your sources, resources, and people.

1.4.1 MSR 2023 Mining Challenge

The 2023 MSR Mining Challenge will be this dataset:

This is a lot of meta-data about commits.

1.4.2 World of Code [MSR 2023 Hackathon?]

World of Code is a huge network on version control commits from Git. It is about billions of commits and git blobs all interlinked in forks.

The dataset is so huge you have to operate on VMs they make available to you. Because of the barriers to access, any learned knowledge is easy to publish. This is high risk and high reward.

1.5 Questions

1.5.1 Briefly describe the schemas of the data held within challenge dataset

  1. Question: What is not in the schema that you expected to be there?
    Very briefly describe how you would get this information.

  2. Question: What do you think are questions that are easy to answer with this dataset?

  3. Question: What do you think are questions that are hard to answer with this dataset?

1.5.2 Size metrics

  1. Question: what is the size of the dataset?
    • number of entities (this includes issue reports and other provided metadata)

      • projects? commits? patches? What is in there?
    • number of authors [people]

    • number of files or blobs

    • sizes of files or blobs

    • vocabulary (number of unique tokens) of blobs or files

      • Might not be possible for WOC
    • summary statistics of the entities and their sizes. These stats include:

      • number of lines

      • number of blocks

      • number of entities

    • Essentially for each dataset and database can you give me summary statistics about them.

1.5.3 Question: Plot boxplots / histograms of entities

1.5.4 Traceability

You should download some blobs, parse them and summarize them. You can use sampling!

  1. Question: How many text blobs are there?

  2. Question: How many foriegn URLs are there?
    In available texts in the datasets what kind of URLs are there?

    • e.g. commit messages or issues
  3. Question: Are there blobs, snippets, entities that are more than 1 language?
    You can answer this with natural languages or computer languages

  4. Question: Please plot entities over time
    For the important entities in the dataset please show how many appear over time.

    • WOC can sample.

1.6 Notes

1.6.1 It’s too big, I’ll never finish

1.6.2 What do you mean by summary statistics

1.6.3 What format do you want?

1.6.4 How?

1.6.5 How many pages?

Given the fact that you should be making plots of almost every question and tables, if it is less than 4 pages I doubt you’ve been thorough enough.

1.7 Rubric

1.7.1 General Rubric

1.7.2 Detailed Rubric

Here’s a rubric of expectations:

660-assignment1-rubric.pdf

Author: Abram