Orcid ID

https://orcid.org/0000-0003-4408-1183

Date of Award

5-2025

Degree Type

Dissertation

Degree Name

Doctor of Philosophy

Major

Computer Science

Major Professor

Audris Mockus

Committee Members

Jian Huang, Doowon Kim, Russell Zaretzki

Abstract

This dissertation investigates copy-based reuse in open source software (OSS) supply chains, emphasizing its identification, analysis, and potential impacts.

First, we develop a novel algorithm to identify copy-based reuse by detecting whole-file copying across the global OSS ecosystem. Leveraging the World of Code infrastructure, we generate a large-scale map of copy-based reuse instances, providing a foundation for future research and tool development to support reuse practices and mitigate associated risks.

Next, we analyze the prevalence, patterns, and motivations behind copy-based reuse. By integrating large-scale reuse detection with developer surveys, we find that copy-based reuse is widespread and varies by programming language, resource type, and project size. Popular projects drive substantial reuse activity, yet more than half of copied resources originate from small and medium-sized projects. Developers cite diverse motivations for copying code, including convenience and trust, while expressing a preference for package managers when feasible.

Our first case study examines the implications of copy-based reuse for OSS license compliance. We construct a copy-based code reuse network and quantify potential license noncompliance across the OSS ecosystem. Our analysis reveals that projects with permissive licenses, such as MIT and Apache, experience higher reuse rates, whereas copyleft licenses, like GPL, yield mixed effects. Alarmingly, 39.4% of reuse instances present a risk of noncompliance, particularly when license information is absent or ambiguous.

The second case study investigates the impact of copy-based reuse on LLM pretraining datasets. We propose an automated source code autocuration technique that utilizes OSS version histories to detect and filter outdated, buggy, and non-compliant code. Evaluating this approach on "The Stack" v2 dataset, we find that 17% of code samples have newer versions, with 17% of these updates addressing bugs, including known vulnerabilities (CVEs). Additionally, we identify serious compliance risks from misidentified blob origins, which introduce non-permissively licensed code into training datasets.

Collectively, this work provides novel insights and practical contributions to understanding and managing copy-based reuse in OSS supply chains. It offers foundational tools and datasets to advance research, informs policy on software licensing practices, and proposes methods to enhance the quality and compliance of AI model training datasets.

Recommended Citation

Jahanshahi, Mahmoud, "Copy-Based Reuse and its Implications in Open Source Software Supply Chains. " PhD diss., University of Tennessee, 2025.
https://trace.tennessee.edu/utk_graddiss/12374

Download

Files over 3MB may be slow to open. For best results, right-click and select "save as..."

Included in

Artificial Intelligence and Robotics Commons, Data Science Commons, Software Engineering Commons

COinS

Doctoral Dissertations

Copy-Based Reuse and its Implications in Open Source Software Supply Chains

Orcid ID

Date of Award

Degree Type

Degree Name

Major

Major Professor

Committee Members

Abstract

Recommended Citation

Included in

Search

Browse

Contributors

Useful Links

About Trace

Doctoral Dissertations

Copy-Based Reuse and its Implications in Open Source Software Supply Chains

Author

Orcid ID

Date of Award

Degree Type

Degree Name

Major

Major Professor

Committee Members

Abstract

Recommended Citation

Included in

Share

Search

Browse

Contributors

Useful Links

About Trace