Author ORCID Identifier

Audris Mockus https://orcid.org/0000-0002-7987-7598

Rui Abreu https://orcid.org/0000-0003-3734-3157

Document Type

Article

Publication Date

2025

DOI

https://doi.org/10.1145/3722216

Abstract

Changing software is essential to add needed functionality and to fix problems, but changes may introduce defects that lead to outages. This motivates one of the oldest software quality control techniques: a temporary prevention of non-critical changes to the codebase — code freeze. Despite its widespread use in practice, research literature is scant. Historically, code freezes were used as a way to improve software quality by preventing changes during periods before software releases, but code freezes significantly slow down development. To address this shortcoming we develop and evaluate a family of code un-freeze (permitting changes) strategies tailored to different occasions and products at Meta. They are designed to un-freeze the maximum amount of code without compromising quality. The three primary dimensions to un-freeze involve a) the exact timing of (and the reasoning behind it) the code freezes, b) the parts of the organization or the codebase where the codebase freeze is applied to, and c) the method of screening of the code diffs during the code freeze with the aim to allow low risk diffs and prevent only the most risky diffs.

To operationalize the drivers of outages, we consider the entire network of interdependencies among different parts of the source code, the engineers that modify the code, code complexity, and the coordination dependencies and authors’ expertise. Since the code freeze is a balancing act between reducing outages and allowing software development to proceed unimpeded, the performance of the various approaches to code un-freeze is evaluated based on the fraction of flagged/gated changes to measure overhead and the fraction of all outage-causing changes contained within the set of flagged set of changes to measure the ability of the code un-freeze to delay (or prevent) outages. We found that taking into account the risk posed by modifying individual files and the properties of the change we could un-freeze two and 2.5 times more changes correspondingly.

The change level model is used by Meta in production. For example, during the winter 2023 code freeze, we see that only 16% of changes are gated. Although 42% more changes landed (were integrated into the codebase) compared to the prior year, there was a 52% decrease in outages. This reduction meant less impact on users and less strain on engineers during the holiday period. The risk model has been enormously effective at allowing low risk changes to proceed while gating high risk changes and reducing outages.

Files over 3MB may be slow to open. For best results, right-click and select "save as..."

Share

COinS