Sourcegraph went dark

Towards the end of my mid-2019 job search, I was down to joining the Google Go team or Sourcegraph. Sourcegraph ultimately won due to cultural factors - the most important of which was the ability to build 100% in the open. All documents were public by default. Technical and product RFCs (and later PR/FAQs) were drafted, reviewed, and catalogued in a public Google Drive folder. All product implementation was done in public GitHub repositories.

Today, the sourcegraph/sourcegraph repository went private. This is the final cleaving blow, following many other smaller chops, on the culture that made Sourcegraph an attractive place to work. It’s a decision for a business from which I resigned, and therefore have no voice. But I still lament the rocky accessibility of artifacts showing four years of genuine effort into a product that I loved (and miss the use of daily in my current role).

On the bright side, I’ve cemented my place on the insights leaderboard for the remainder of time.

Contributor leaderboard
Sourcegraph has made their future development repository private, but it seems they've left a public snapshot available at sourcegraph/sourcegraph-public-snapshot for the time being.

Keeping references alive

Over my tenure at Sourcegraph I’ve done a fair bit of writing for the engineering blog, which I’ve inlined into this website for stable reference. It’s interesting to see what people are trying to build and, for an engineer, how they’re trying to build it. Much of my writing used links into relevant public code as a reference.

All of these links are now broken.

There’s a common saying that cool URIs don’t change. In a related sense, I have the hot take that cool articles don’t suddenly start rotting links. I’m going to break at least one of these best practices, and I can’t do anything about the first one. So I’ll attempt to preserve as much information in this writing as possible by moving these links into a repository under my influence.

I'm opting to bite the bullet now and move references to something completely under my control rather than kick then can down the road by referencing another repository that _could_ suddenly disappear at any time.

I had a feeling this would be a risk a while ago, so I had forked sourcegraph/sourcegraph into efritz/sourcegraph in preparation. Given the fork, it should be easy enough job to do a global find-and-replace of one repository name with another at this point and mission accomplished, right?

Unfortunately, no. I had links to code on the main branch, but also links to pull requests and commits within pull requests. Forks don’t inherit pull requests (problem #1). And commits not directly referenced by a branch of your fork are visible as only as long as they’re part of the repository network (problem #2).

Non-local commit warning

I had wondered what happens to forks when a repository is deleted or changes visibility and found some calming information in the official GitHub documentation:

In other words, a public repository’s forks will remain public in their own separate repository network even after the upstream repository is made private. This allows the fork owners to continue to work and collaborate without interruption. […] If a public repository is made private and then deleted, its public forks will continue to exist in a separate network.

My fork will continue to exist (yay), but the source repository becoming inaccessible might take commits outside of the main branch with it. I need to ensure that these commits are part of the new repository network.

Scraping for relevant commits

Step one is to find all the commits I care about. I ran the following Go program to iterate through all of my pull requests on the source repository and write their payloads to disk for further processing.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
package main

import (
	"context"
	"encoding/json"
	"fmt"
	"log"
	"os"
	"strings"
	"time"

	"github.com/google/go-github/v63/github"
)

const (
	owner      = "sourcegraph"
	repo       = "sourcegraph"
	targetUser = "efritz"
	token      = "ghp_pls_dont_hax_me"
)

func main() {
	ctx := context.Background()

	if err := scrapePRs(ctx); err != nil {
		log.Fatalf("Error: %v", err)
	}
}

func scrapePRs(ctx context.Context) error {
	client := github.NewClient(nil).WithAuthToken(token)

	page := 0
	for {
		fmt.Printf("Requesting page #%d...\n", page)

		prs, resp, err := client.PullRequests.List(
			ctx, 
			owner, 
			repo, 
			&github.PullRequestListOptions{
				State: "all",
				ListOptions: github.ListOptions{
					Page:    page,
					PerPage: 100,
				},
			},
		)
		if err != nil {
			if !resp.Rate.Reset.Time.IsZero() {
				duration := time.Until(resp.Rate.Reset.Time)
				time.Sleep(duration)
				continue
			}

			return err
		}
		if len(prs) == 0 {
			break
		}

		for _, pr := range prs {
			if *pr.User.Login != targetUser {
				continue
			}

			fmt.Printf("Saving %d: %s\n", *pr.ID, *pr.Title)
			serialized, err := json.Marshal(pr)
			if err != nil {
				return err
			}
			filename := fmt.Sprintf("prs/%d.json", *pr.ID)
			if err := os.WriteFile(filename, serialized, 0777); err != nil {
				return err
			}
		}

		page++
	}

	return nil
}

This program yielded 2,645 files with pull request metadata. I then used jq to read these JSON payloads and extract data for subsequent steps.

1
2
3
4
5
6
7
8
for file in prs/*.json; do
	number=$(jq -r '.number' "$file")
	merge_commit_sha=$(jq -r '.merge_commit_sha // ""' "$file")

	echo "$number" >> pr_ids.txt
 	echo "$merge_commit_sha" >> commits.txt
	echo "$number $merge_commit_sha" >> replace_pairs.txt
done

This script creates three files:

  • pr_ids.txt is a flat list of GitHub identifiers, which are used in URLs. Since the list endpoint returns only enough data to render a pull request list, we’ll need to fetch additional information (intermediate commits) for each pull request by its ID.
  • commits.txt is a flat list of git SHAs that were a result of merging a PR into the target branch (not always main). These commits may or may not be in the forked repository network, depending on the merge target. These should be synced over.
  • replace_pairs.txt contains pairs of of pull request identifier and its merge commit. This will later be used to mass replace /pull/{id} with /commit/{sha}. Since pull requests can’t be linked directly anymore, I can at least link to the full pull request contents.

Next, I ran a second program (with the same preamble as the program above) to list all the non-merge commits of each pull request. Based on the pants-on-head way I work, these will mostly be WIP. commits, but sometimes I did a better job and (possibly) linked directly to these.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
func extractCommits(ctx context.Context) error {
	contents, err := os.ReadFile("pr_ids.txt")
	if err != nil {
		return err
	}

	var ids []int
	for _, line := range strings.Split(string(contents), "\n") {
		if line == "" {
			continue
		}

		var id int
		_, _ = fmt.Sscanf(line, "%d", &id)
		ids = append(ids, id)
	}

	client := github.NewClient(nil).WithAuthToken(token)

	for _, id := range ids {
		for {
			commits, resp, err := client.PullRequests.ListCommits(
				ctx, 
				owner, 
				repo, 
				id, 
				&github.ListOptions{},
			)
			if err != nil {
				if !resp.Rate.Reset.Time.IsZero() {
					duration := time.Until(resp.Rate.Reset.Time)
					time.Sleep(duration)
					continue
				}

				return err
			}

			for _, commit := range commits {
				fmt.Println(*commit.SHA)
			}

			break
		}
	}

	return nil
}

Running go run . >> commits.txt dumped these commits onto the end of the file and completes the set of Git SHAs that need to be brought into the repository network for stable reference.

Bringing commits into the new repository network

Given the warning above (“does not belong to any branch on this repository”), it should be sufficient to ensure that my fork has a branch containing each relevant SHA I’d like to retain access to.

Bash here does a good enough job since all we’re doing is a bunch of git operations in sequence.

1
2
3
4
5
6
7
8
9
#!/bin/bash

for SHA in $(cat commits.txt); do
    git fetch upstream $SHA             # Pull SHA from sg/sg
    git checkout -b "mirror/$SHA" $SHA  # Create reference in fork
    git push origin "mirror/$SHA"       # Push branch to efritz/sg
    git checkout main                   # Reset
    git branch -D "mirror/$SHA"         # Cleanup
done

Rewriting references

At this point I should be safe and have some target to link to in my fork for each reference to a pull request or commit in the source repository. Now I just have to figure out how to automate that process (there are at least 275 code references over 15 files and I’m not doing that by hand).

Ironically, I used my own thing instead of Cody to figure out how to use xargs correctly for this task.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#!/bin/bash

sg_prefix='https://github.com/sourcegraph/sourcegraph'
fork_prefix='https://github.com/efritz/sourcegraph'

# Rewrite direct references to commits to the fork
grep -rl "${sg_prefix}/commit/" . | \
xargs -I {} perl -i -pe "s|${sg_prefix}/commit/|${fork_prefix}/commit/|g" {}

# Rewrite references to pull request to their merge commit in the fork
while IFS=' ' read -r id sha; do
    grep -rl "${sg_prefix}/pull/${id}" . | \
    xargs -I {} perl -i -pe "s|${sg_prefix}/pull/${id}|${fork_prefix}/commit/${sha}|g" {}
done < replace_pairs.txt

Now I think we can say mission accomplished and I hope my dead links detector stops throwing a fit after all these changes.