How to improve your open source code (6) – Version control

February 3, 2018

This post was written during my trip through Iceland and published much later than it was written.

Version Control is one important aspect when developing software as a whole, and especially when developing open source software.

Here are some thoughts about it.

Technology

First of all, technology wise it doesn't matter which version control system one uses. For the sake I'm using git here as an example VCS, though others might do as well.

One important thing, at least in my opinion, is that the VCS has some basic functionality. This is mainly that it can be used distributed and has a branching functionality (which are two things I like to believe go hand in hand).

So I do not care whether one uses git, mecurial, or anything else. Most important is that a (D)VCS is actually used.

Branching model

Branching is a method that came up before git was created, as bitkeeper had such functionality (as far as I can tell) before Linus Torvalds wrote git. It is only that git has revolutionised the way we do version control and brought branching to wider knowledge and use.

In my opinion it is really important how branching is done. There is not simply “the branching” but there are many ways to do branching and one might be better for certain use case than another. There are known models such as feature branching, the gitflow branching model and a rebase-merge workflow. I don't want to explain each of them because others have done so way better than I ever could.

What I want to tell is that branching is not only important, but as flexible as you might not even guess. This is not necessarily a good thing – I'll show you in a minute. In my opinion, branching and developing a branching model for a project is like developing an API. Once it is set up properly, it may serve as a communication rule for a project, putting developers on the same page about how certain things have to be handled. Having a protocol on how to work on things is a good thing. If implemented properly, branching can improve the work of everyone as it is one point less to think about.

The bad thing about flexibility of branching functionality is that it can be done wrong. It's as simply as that, but merging one branch into another when one is not supposed to do that, creates overhead which might not be reversible. This has happened to the best communities (for example the kernel community) but also happens in small communities, often due to too few knowledge of the tools at hand.

To summarize: If an open source project gets to a certain size (both code-wise and contributor/community-wise) a branching model should be implemented. If there are rules that contributors agree upon, it can improve working speed and therefore overall happiness in the community. Because developers like to bikeshed, it could also worsen happiness, of course. Though, it is better than no plan and chaos instead.

Hosting

I will not go into thoughts about hosting platforms in this article but rather on the how and why.

First of all, hosting the code somewhere with a way to show it in a web browser is a good way to improve the “open” part of open source code. Of course, tarball downloads and such suffice, but we are in the 21st century, so having a nice web interface is something one can expect.

Making the code browsable is often done via a VCS-specific web frontend, for example cgit for code version controlled with git. Therefore this web interfaces often also feature functionality to go back in time and view the history of one file. Maybe this is not needed often, but nevertheless helpful if needed.

I personally do not care about comments on code in my web interfaces or even ways to register users on the site, but of course some people like that. There are web interfaces that feature such things, for example for the git VCS there is gitea, gogs, gitlab, ... and many more. And of course there are the closed providers github, bitbucket and others...

Making code public and contributions easy

Hosting helps a lot with enabling contributions from strangers. No doubt, github makes contributions ridiculously easy.

I don't want to reiterate what others have said better and most people already know. What I want to point out here is that open source does not mean “open contributions”. One is completely free to reject all contributions one ones code base.

I really want to stress this. Open source does indeed mean that everyone is able to view the code, which also enables them to copy it (though redistribution might be limited or forbidden, as only free software allows you – by definition – to redistribute and alter code) but not necessarily that one is allowed or welcome to send in changes, feature requests or the like.

So if you want people to contribute to your code and suggest changes, features or report bugs, you should somehow give them the opportunity to do so. Depending on how “open” you want to be with your development you either should use a hosting platform (like github or bitbucket) or a slightly more “closed” variant, for example hosting your code on your own gitea instance. One step further you'd host your code on a site where people might be able to get it, maybe even with a “git clone”, though not send in pull requests, feature requests or open issues (for example a hosted git repository with cgit interface). Issues and bug reports could still be done via a mailinglist, if desired.

In fact, that last bit is what I consider for my own project imag.

SemVer, Change Management, Release Management

As soon as your code is out there, you have to think about change and release management. In my opinion, these are topics closely related to source code version control as VCS often offer functionality to do releases in one form or another and are clearly involved in the process of change management.

First of all, I'd like to suggest you read the SemVer specification. It is not that long but will help you understanding the next few paragraphs. So if you haven't read it already, go ahead and do so. Even if you don't apply SemVer to your projects it might open your eyes in one aspect or another.

But before we get into releases, we should first talk about change management, or better named for my points: Pull request management.

What I personally do with my PRs is, merge them when they're ready. This approach is easy and works, so far, pretty well. From time to time I have changes in my working branches (as stated before, I use feature branches) which might conflict with other peoples work. For the sake of contributor experience, I pause my PRs and wait until they are done with theirs. We will talk a lot about this in the next episode of this series, so I won't go into much detail. For now: This is a simple approach that works perfectly well so far for me and my (considerably small) open source projects.

But as soon as ones project grews bigger, that approach might not do the job anymore. If there are too many changes in a short amount of time which have to be agreed on and that have to be merged, it might be time to think about an alternative approach.

There are two ways I would tackle this problem. I never experienced it in the “real world”/in my projects, so the following is just a write down of my thoughts. Take a grain of salt from here on.

The first approach I can think of is to assign certain subsystems to certain people. If the amount of changes has become too big, one could assume that the codebase has also become tremendous. If that is the case, sub-maintainers can handle certain subsystems and the project leader can then periodically merge all changes together. This requires, of course, at least two people that are interested into the subject and willing to contribute maintaining efforts to the project.

If the latter is not the case or there are too few people around for this, one could consider a merge-window style approach, like known from Linus himself. Changes are pulled in every other week, for example, and the rest of the time, only bug fixes are merged into the project.

These two approaches might become handy some day if one is about to maintain a large code base alone (as in “as the only project owner”).

Now on to release management. In my opinion, releases should be done as soon as something works and from there on periodically. I myself made one mistake too often: Pull more things into one release than would have been good. For example the imag 0.2.0 release was over one year ago. 0.3.0 is almost ready, but not yet. I should've done more releases in between.

In my opinion, more releases with clear-cut edges are better than long release-cycles. As soon as there is a new feature for users – release. User-facing fixed – release. This might result in high numbers for versioning, but who cares?

This is where I want to throw SemVer in, to adjust my statement from the last paragraph with a “but”.

SemVer can be used to notify breaking user interfaces. This is a really good thing and therefore I think SemVer should be applied everywhere. SemVer also states that in the “ 0.y.z phase” everything is allowed to happen, also API breakage. This is where I want to adjust my statement from above. A lot of releases should be done in the 0.y.z phase, but also within that scope. As soon as a library or program hits the 1.0.0, changes should be applied carefully. One really does not want to end up with a program or library in version 127.0.0, right? That'd also decrease a users trust into the application as one can expect breakage with every new release.

So what I'd do and actually plan doing with my projects is releasing a number of zero-releases until I am confident that everything is all right and then go from there. For imag specifically I am not thinking about 1.0.0 because imag is far from ready, but for my other projects, especially toml-query, I think of 1.0.0 already.

Another point which popped into my head weeks after the initial draft of this article was: Do not plan the features of the next release with a release number! This might sound a bit odd, so let me explain. For example, you're planning three major features for the next release, which will be 0.15.0 then. And you're slowly getting to a point where the release becomes ready, you might need three more weeks to get it ready. Now, a contributor steps up and opens a pull request with another feature, which is already completely implemented, tested and also documented in the pull request. The contributor needs this feature as soon as possible in your code and you also think that it might be a great idea to release this as soon as possible. After you merged the request, you release the source – as 0.15.0, despite your three features are not yet completed.

Two things come to mind in this scenario: First, if two of your three features are already completed, they might show up in 0.15.0 but one feature has to be moved to the next release. If these two features are ready, but not tested, you might end up with a buggy release and have to release 0.15.1 soonish – more effort for you. If you do not merge your features into the master branch of your project, but you have a 0.15.0-prepare branch or something like that, you end up with a rather ugly merge-mess later on, as 0.15.0 is already released and you cannot just rename a public branch.

So how to handle this properly? I came to the conclusion that release-branches is the way to go here. In the scenario described above, you'd branch off of the release before, most certainly 0.14.x and create a new branch 0.15.0, where the pull request of the contributor would be merged than. As soon as the release is out, 0.15.0 will be tagged and merged back to the master branch.

What my point is here: you'd still need to rename your next milestone or rewrite your issues for the next release. That's why I would not plan “0.15.0”, but simply “the next release” – because you'll never know whether your planned things will actually be the next release or the the release after. So lessen the effort for yourself here!

In the next article in this series I want to elaborate on how to make a contribution as pleasing as possible for the contributor. I guess I can talk a lot about that because I've contributed to a lot of projects already, including but not limited to linux, nixpkgs and nanoc.

tags: #open-source #programming #software #tools #rust

musicmatzes blog