musicmatzes blog

software

This post was written during my trip through Iceland and published much later than it was written.

Version Control is one important aspect when developing software as a whole, and especially when developing open source software.

Here are some thoughts about it.

Technology

First of all, technology wise it doesn't matter which version control system one uses. For the sake I'm using git here as an example VCS, though others might do as well.

One important thing, at least in my opinion, is that the VCS has some basic functionality. This is mainly that it can be used distributed and has a branching functionality (which are two things I like to believe go hand in hand).

So I do not care whether one uses git, mecurial, or anything else. Most important is that a (D)VCS is actually used.

Branching model

Branching is a method that came up before git was created, as bitkeeper had such functionality (as far as I can tell) before Linus Torvalds wrote git. It is only that git has revolutionised the way we do version control and brought branching to wider knowledge and use.

In my opinion it is really important how branching is done. There is not simply “the branching” but there are many ways to do branching and one might be better for certain use case than another. There are known models such as feature branching, the gitflow branching model and a rebase-merge workflow. I don't want to explain each of them because others have done so way better than I ever could.

What I want to tell is that branching is not only important, but as flexible as you might not even guess. This is not necessarily a good thing – I'll show you in a minute. In my opinion, branching and developing a branching model for a project is like developing an API. Once it is set up properly, it may serve as a communication rule for a project, putting developers on the same page about how certain things have to be handled. Having a protocol on how to work on things is a good thing. If implemented properly, branching can improve the work of everyone as it is one point less to think about.

The bad thing about flexibility of branching functionality is that it can be done wrong. It's as simply as that, but merging one branch into another when one is not supposed to do that, creates overhead which might not be reversible. This has happened to the best communities (for example the kernel community) but also happens in small communities, often due to too few knowledge of the tools at hand.

To summarize: If an open source project gets to a certain size (both code-wise and contributor/community-wise) a branching model should be implemented. If there are rules that contributors agree upon, it can improve working speed and therefore overall happiness in the community. Because developers like to bikeshed, it could also worsen happiness, of course. Though, it is better than no plan and chaos instead.

Hosting

I will not go into thoughts about hosting platforms in this article but rather on the how and why.

First of all, hosting the code somewhere with a way to show it in a web browser is a good way to improve the “open” part of open source code. Of course, tarball downloads and such suffice, but we are in the 21st century, so having a nice web interface is something one can expect.

Making the code browsable is often done via a VCS-specific web frontend, for example cgit for code version controlled with git. Therefore this web interfaces often also feature functionality to go back in time and view the history of one file. Maybe this is not needed often, but nevertheless helpful if needed.

I personally do not care about comments on code in my web interfaces or even ways to register users on the site, but of course some people like that. There are web interfaces that feature such things, for example for the git VCS there is gitea, gogs, gitlab, ... and many more. And of course there are the closed providers github, bitbucket and others...

Making code public and contributions easy

Hosting helps a lot with enabling contributions from strangers. No doubt, github makes contributions ridiculously easy.

I don't want to reiterate what others have said better and most people already know. What I want to point out here is that open source does not mean “open contributions”. One is completely free to reject all contributions one ones code base.

I really want to stress this. Open source does indeed mean that everyone is able to view the code, which also enables them to copy it (though redistribution might be limited or forbidden, as only free software allows you – by definition – to redistribute and alter code) but not necessarily that one is allowed or welcome to send in changes, feature requests or the like.

So if you want people to contribute to your code and suggest changes, features or report bugs, you should somehow give them the opportunity to do so. Depending on how “open” you want to be with your development you either should use a hosting platform (like github or bitbucket) or a slightly more “closed” variant, for example hosting your code on your own gitea instance. One step further you'd host your code on a site where people might be able to get it, maybe even with a “git clone”, though not send in pull requests, feature requests or open issues (for example a hosted git repository with cgit interface). Issues and bug reports could still be done via a mailinglist, if desired.

In fact, that last bit is what I consider for my own project imag.

SemVer, Change Management, Release Management

As soon as your code is out there, you have to think about change and release management. In my opinion, these are topics closely related to source code version control as VCS often offer functionality to do releases in one form or another and are clearly involved in the process of change management.

First of all, I'd like to suggest you read the SemVer specification. It is not that long but will help you understanding the next few paragraphs. So if you haven't read it already, go ahead and do so. Even if you don't apply SemVer to your projects it might open your eyes in one aspect or another.

But before we get into releases, we should first talk about change management, or better named for my points: Pull request management.

What I personally do with my PRs is, merge them when they're ready. This approach is easy and works, so far, pretty well. From time to time I have changes in my working branches (as stated before, I use feature branches) which might conflict with other peoples work. For the sake of contributor experience, I pause my PRs and wait until they are done with theirs. We will talk a lot about this in the next episode of this series, so I won't go into much detail. For now: This is a simple approach that works perfectly well so far for me and my (considerably small) open source projects.

But as soon as ones project grews bigger, that approach might not do the job anymore. If there are too many changes in a short amount of time which have to be agreed on and that have to be merged, it might be time to think about an alternative approach.

There are two ways I would tackle this problem. I never experienced it in the “real world”/in my projects, so the following is just a write down of my thoughts. Take a grain of salt from here on.

The first approach I can think of is to assign certain subsystems to certain people. If the amount of changes has become too big, one could assume that the codebase has also become tremendous. If that is the case, sub-maintainers can handle certain subsystems and the project leader can then periodically merge all changes together. This requires, of course, at least two people that are interested into the subject and willing to contribute maintaining efforts to the project.

If the latter is not the case or there are too few people around for this, one could consider a merge-window style approach, like known from Linus himself. Changes are pulled in every other week, for example, and the rest of the time, only bug fixes are merged into the project.

These two approaches might become handy some day if one is about to maintain a large code base alone (as in “as the only project owner”).

Now on to release management. In my opinion, releases should be done as soon as something works and from there on periodically. I myself made one mistake too often: Pull more things into one release than would have been good. For example the imag 0.2.0 release was over one year ago. 0.3.0 is almost ready, but not yet. I should've done more releases in between.

In my opinion, more releases with clear-cut edges are better than long release-cycles. As soon as there is a new feature for users – release. User-facing fixed – release. This might result in high numbers for versioning, but who cares?

This is where I want to throw SemVer in, to adjust my statement from the last paragraph with a “but”.

SemVer can be used to notify breaking user interfaces. This is a really good thing and therefore I think SemVer should be applied everywhere. SemVer also states that in the “ 0.y.z phase” everything is allowed to happen, also API breakage. This is where I want to adjust my statement from above. A lot of releases should be done in the 0.y.z phase, but also within that scope. As soon as a library or program hits the 1.0.0, changes should be applied carefully. One really does not want to end up with a program or library in version 127.0.0, right? That'd also decrease a users trust into the application as one can expect breakage with every new release.

So what I'd do and actually plan doing with my projects is releasing a number of zero-releases until I am confident that everything is all right and then go from there. For imag specifically I am not thinking about 1.0.0 because imag is far from ready, but for my other projects, especially toml-query, I think of 1.0.0 already.

Another point which popped into my head weeks after the initial draft of this article was: Do not plan the features of the next release with a release number! This might sound a bit odd, so let me explain. For example, you're planning three major features for the next release, which will be 0.15.0 then. And you're slowly getting to a point where the release becomes ready, you might need three more weeks to get it ready. Now, a contributor steps up and opens a pull request with another feature, which is already completely implemented, tested and also documented in the pull request. The contributor needs this feature as soon as possible in your code and you also think that it might be a great idea to release this as soon as possible. After you merged the request, you release the source – as 0.15.0, despite your three features are not yet completed.

Two things come to mind in this scenario: First, if two of your three features are already completed, they might show up in 0.15.0 but one feature has to be moved to the next release. If these two features are ready, but not tested, you might end up with a buggy release and have to release 0.15.1 soonish – more effort for you. If you do not merge your features into the master branch of your project, but you have a 0.15.0-prepare branch or something like that, you end up with a rather ugly merge-mess later on, as 0.15.0 is already released and you cannot just rename a public branch.

So how to handle this properly? I came to the conclusion that release-branches is the way to go here. In the scenario described above, you'd branch off of the release before, most certainly 0.14.x and create a new branch 0.15.0, where the pull request of the contributor would be merged than. As soon as the release is out, 0.15.0 will be tagged and merged back to the master branch.

What my point is here: you'd still need to rename your next milestone or rewrite your issues for the next release. That's why I would not plan “0.15.0”, but simply “the next release” – because you'll never know whether your planned things will actually be the next release or the the release after. So lessen the effort for yourself here!

Next

In the next article in this series I want to elaborate on how to make a contribution as pleasing as possible for the contributor. I guess I can talk a lot about that because I've contributed to a lot of projects already, including but not limited to linux, nixpkgs and nanoc.

tags: #open-source #programming #software #tools #rust

A common thing one has to do in Rust to leverage zero-cost abstractions and convenience is to extend types. By “extending” them I mean adding functionality (functions) to the type which make it more convenient to use and “improve” it for the domain at hand.

From time to time, one needs to extend generic types, which is a difficult thing to do. Especially when it comes to iterators, where extending is often tricky.

That's why I want to write some notes on it down.

How do filters work?

I have this crate called “filters” which offers nice functionality for building predicates with the builder pattern. A predicate is nothing more than a function which takes something by reference and returns a boolean for it:

fn(&self, some: &Thing) -> bool;

The function is now allowed to decide whether the Thing shall be used. This is useful when working with iterators:

let filter = LowerThan::new(5);
vec![1,2,3,4,5,6,7,8,9].iter().filter(|e| filter.filter(&e)).collect()

Filters can, as I mentioned, be build with the builder-pattern:

let filter = HigherThan::new(5)
    .and(LowerThan::new(20))
    .and(UnequalTo::new(15));

vec![1,2,3,4,5,6,7,8,9].iter().filter(|e| filter.filter(&e)).collect()

One can also define own filters (in fact, the HigherThan, LowerThan and UnequalTo types do not ship with the crate) by implementing the Filter trait:

struct LowerThan(u64);
impl Filter<u64> for LowerThan {
  fn filter(&self, elem: &u64) -> bool {
    *elem < self.0
  }
}

What are we going to do?

What we're going to do is rather easy. We want a functionality implemented on all iterator types which gives us the possibility to pass an object for filtering rather than writing down this nasty closure (as shown above):

iterator.filter(|e| filter.filter(&e))

We want to write:

iterator.filter_with(some_filter)

And we want to get back a nice iterator type.

Step 1: A type

First, we need a type representing a iterator which is filtered. This type, of course, has to be abstract over the actual iterator we want to filter (that's also how the rust standard library does this under the hood). The type also has to be generic over the filter in use. So basically we write down an entirely generic type:

pub struct FilteredIterator<T, F, I>(F, I)
    where F: Filter<T>,
          I: Iterator<Item = T>;

Here F is a filter over T and I is the iterator over T. The FilteredIterator holds both F and I.

Step 2: Implement Iterator on this type

The next step is implementing iterator on this new type. Because we want to use this iterator in a chain like this, for example:

forest
    .filtered_with(TreeWithColor::new(Color::Green))
    .map(|tree| tree.get_name())
    .enumerate()
    .map(|(i, tree_name)| {
      store_tree_number(tree_name, i)
    })
    .fold(Ok(()), |acc, e| {
      // ...
    })

we have to implement Iterator on it.

Because the type is generic, so is our iterator implementation.

impl<T, F, I> Iterator for FilteredIterator<T, F, I>
    where F: Filter<T>,
          I: Iterator<Item = T>
{
    type Item = T;

    fn next(&mut self) -> Option<Self::Item> {
        // ...
    }
}

The implementation is left out as task for the reader. Hint: The only function one needs is Filter::filter() and a while let Some(e).

Step 3: Transforming one iterator into a filtered iterator

Next, we want to be able to transform an existing iterator type into a filtered iterator. As I wrote above, we do this by providing a new trait for this. Because of the generic parameters our FilteredIterator type has, we need generic types in our trait as well:

pub trait FilterWith<T, F: Filter<T>> : Iterator<Item = T> + Sized {
    fn filter_with(self, f: F) -> FilteredIterator<T, F, Self>;
}

The trait provides only one function: filter_with(). This function takes self owning and a Filter which is used in the FilteredIterator later.

Step 4: Extending all iterators

Last but not least, we implement this trait on all Iterators. Even more generics here, of course:

impl<I, T, F: Filter<T>> FilterWith<T, F> for I
    where I: Iterator<Item = T>
{
    fn filter_with(self, f: F) -> FilteredIterator<T, F, Self> {
        FilteredIterator(f, self)
    }
}

The actual implementation is trivial, sure. But the type signature is not, so I'll explain.

  1. I is the type we implement the FilterWith trait for. Because we want to implement it for all iterators, we have to use a generic type.
  2. I is, of course, generic itself. We want to be able to filter all iterators over all types. So, T is the Item of the Iterator.
  3. F is needed because FilterWith is generic over the provided filter: We can filter with any Filter we want. So we have to be generic over that as well.

The output of the filter_with function is a FilteredIterator which is generic over T and F – but wait! FilteredIterator is generic over three types!

That's true. But the third generic type is the iterator it encapsulated – the iterator it actually filters. Because of that, we return FilteredIterator<T, F, Self> here – we don't need a generic type as third type here because we know that the iterator which will be encapsulated has exactly the type which the trait gets currently implemented for.

(I hope that explanation is easy to understand.)

Conclusion

Extending types in Rust is incredible easy – if one can grok generics. I know that generics are not that easy to grok for a beginner – I had a hard time learning how to use them myself. But it is really worth it, as our test shows:

#[test]
fn test_filter_with() {
    struct Foo;
    impl Filter<u64> for Foo {
        fn filter(&self, u: &u64) -> bool {
            *u > 5
        }
    }
    let foo = Foo;

    let v : Vec<u64> = vec![1, 2, 3, 4, 5, 6, 7, 8, 9, 0]
        .into_iter()
        .filter_with(foo)
        .collect();

    assert_eq!(v, vec![6, 7, 8, 9]);
}

tags: #tools #software #rust #open-source

After reading another post in the users.rust-lang.org forum where a crate author is looking for a maintainer for his crate, I started thinking. Do we need a “crates-team” in the rust community?

Incentive

The Rust community has a lot of great teams. Each of these teams has a different goal: The Core team, for example

is responsible for steering the design and development process, overseeing the introduction of new features, and ultimately making decisions for which there is no consensus [...]

(source)

Because people change, interests of people change and also because the life-situations people are in change, a crate author might not be able to continue maintaining their crates. I, for one, have such a problem: I am about to go on a really long vacation from may 2018 until early 2019. Most of that time I will be off the grid, in a mobile home in northern America. This is a one-time opportunity for me and I am really looking forward to that experience. But because of that I will have issues maintaining my crates in that time. I have 7 projects where I cannot continue development for about one year:

  • toml-query
  • filters
  • kairos
  • task-hookrs
  • git-dit (and libgitdit)
  • is-match

and of course imag, which contains over 40 crates at the time of writing. All of the above are ready to be used (imag is not) and are, in fact, used by others from what I know.

In my case, this is only temporarily: I will be able (and am willing to) continue development on them in 2019. But people change, interests change, people get kids, a dog or some new hobby. The end result is always: They have less time (or no time anymore) to continue developing their crates.

So why not have a team available which takes over maintainership?

What a crates-team would do

The crates-team won't always be able to continue development of the crate, but it might be able to continue maintainership. That is: Merge pull requests which add new functionality (or even just discuss whether this new functionality would be a nice addition to the crate), update the dependencies from time to time, fix and close bugs in the codebase, update the documentation and document known bugs.

The team would consist of people from the community (so basically hobbyists) which are willing to maintain a few crates in their free time. The crates-team would, of course, not take over maintainership for all abandoned crates. It would rather discuss whether a crate shall be maintained by them and then do so (or not).

Some ideas on basic rules for the team:

  • Only the following things are in the scope of “maintaining” a crate in the crates-team:
    • Update the dependencies to the newest version if and only if updating to the newest version is as simple as updating the version number in the Cargo.toml file. If it is more complicated, the crates-team is allowed to reject any requests or reply to such requests with a variant of the phrase “You're welcome to file an PR for this!”
    • Updating the documentation of a crate
    • Publish new versions of the crate, the first newly published version should change the cargo metainformation maintenance = ... to passively-maintained.
    • Discuss and (if consensus is reached) merge pull requests which add new functionality to the crate
    • Merge pull requests which
      • Fix bugs
      • Update dependencies
      • Update documentation
  • A crate only gets maintained if at least 2 members of the crates-team are willing to maintain the crate (these two get rights to publish new versions)

More rules are possible, of course.

Discussion

I'd love to discuss the subject on either users.rust-lang.org or reddit!

Also, feel free to reach out via mail (the thing which looks like an 'a') (thisdomain) (dot) de.

tags: #tools #software #rust #open-source

So I started developing an importer for importing github issues into git-dit – the distributed issue tracker build on git.

Turns out it works well, though some things are not yet implemented:

  • Wrapping of text. This is difficult because quotations are wrapped, but the quotation character is not prepended to the new line – which results in broken format
  • Importing only issues. Right now, PRs are imported ... which is not exactly what we want. I really hope I can figure this out to actually attach PR comments to the actual commits of the PR. This would be really nice. Issues shall be imported without parent (orphaned) like git-dit wants it.
  • Mapping of github handles to real names and email addresses.
  • Mapping github labels to git-dit “trailers”.

Have a look at my importer tool here (Just be told: This is WIP and shouldn't be used right now)!

Or at git-dit itself here (I am co-author).

tags: #tools #software #rust #open-source #git #github

Inspired by the Call for Community Blogposts I want to summarize my experiences and thoughts on Rust in 2017 and what I am excited about for 2018.

Reflecting 2017

2017 was an amazing year for Rust. We got 8 releases of rust itself! We got basic procedural macros allowing custom derive (also known as “macros 1.1”) in the first release last year (1.15.0). This made serde 1.0 possible, if I'm not mistaken? We got 103 stabilized APIs in 2017. This is incredible! The improvements of compiletime and also the tooling got so much better. I mean, it was awesome before. But now it is even better!

On a personal side I got a lot better at programming Rust. I wrote about 37800 lines of rust code in my main project imag and 17380 lines in other crates (authored and contributed, according to a bit git-fooing around). Is that a lot? I don't know.

Hopes for 2018

Now lets talk about 2018. This year will be amazing, I am sure.

Language features

I am really excited about the “impl trait” thing. Beeing able to return an trait from a function will reduce the imag codebase so much, for example. We no longer need to define our own iterator helper types but can simply return Iterator<Item = Whatever>!

I have no other hopes for the language itself, because what we have right now is really amazing and I honestly cannot think of ways it could be improved.

Ecosystem needs / Tooling enhancements

I'm still a bit concerned about cargo functionality for building workspace projects. From what I see, building two different crates in one workspace which share dependencies rebuilds the dependencies. This is not as intended, I guess, but that's what I see. I did not dive deep into this, so I might be wrong, though.

What I am thinking about for several weeks now is a cargo/rust tool for calculating code metrics. I think of things like documentation/code ratio, average function length, simple things... but also about cohesion and coupling metrics and other inter-module/inter-crate metrics.

Also, I tried to set up the rust language server for vim on my workstation and failed hard. I guess this is a packaging problem with my distro (NixOS), though. Either way, installing the rls with a stable toolchain would be nice!

Crates I am still missing / should be improved

There are some crates I would love to have which do not exist yet.

  • A (high level) email crate. There is the email crate, but it is mainly unstable and does not even have a 0.1.0 yet. There's also lettre_email, which is in 0.7.0, but it doesn't support parsing of emails.
  • I really hope rust-vobject (which is one of the crates I contributed to in 2017) will improve even more and be the defacto-standard crate for handling vcard and icalendar data.
  • I follow the development of Cursive and from what I see it is awesome. I really hope people start writing high-level objects for cursive (like a file explorer, a form builder, a text editor like thing, a tab helper and so on) so I have to do less work when implementing a TUI for imag. (To be fair, there are already some crates available).
  • I hope there will be some awesome crates for handling multi-media files and reading/writing their metadata. Especially audio formats and video formats are important to me with imag.
  • Rust bindings for pass would be awesome.
  • Markdown (and other formats, like asciidoc, restructured text, textile and maybe even bbcode) parsers and renderers should be written/improved
  • A API for IPFS or maybe even a protocol implementation
  • Qt bindings (yeah, I have high hopes for 2018)

There are possibly thousands more... But I won't list them all.

tags: #open-source #programming #software #rust

This post was written during my trip through Iceland and published much latern than it was written.

In this and also maybe in the next few articles we will focus on rather code-related things than on direct code properties. I hope that's okay.

Planning of an application or library is not easy, not at all. But how much planning do we actually do before writing code? And should we do more?

My thoughts on the subject.

What we've learned

One that has studied computer science should know at least some UML types like class diagrams, flow charts, module plans and use case diagrams. They are used in (let's call it) “normal” software development and in the professional world out there.

But when we are developing open source software for our own needs and maybe for our friends, we do that often in our chambers at home. Class diagrams are often not being developed and I can say that I never saw a hobby programmer draw a use case diagram before writing the code of the application.

Why we don't use it

Why is that? Well, because open source software is often done as a hobby type of thing, there is often no need for planning ahead. A hobbyist is able to hold use case, simple class diagrams and flow charts “in his mind” because he has great knowledge of the domain.

In fact. as he defines the domain entirety, he is both stakeholder, project leader, software architect, programmer, tester and marketing guy at the same time. He knows what problems are about to be solved and therefore can adjust every aspect of the application to the needs required.

This holds true for small and medium sized applications or code bases, where the problems is of certain complexity but not too big. Basically one could say that every aspect of the domain has to fit into one head without much effort, in the open-source-programming-at-home-world. With a bit of training, I believe, one can even get to a point where only a few aspects of the domain have to be in a persons mind to be able to work on a solution

But there is certainly a point where the effort needed to solve a specific problem explodes. One can still write software to solve the problem at hand, but not in reasonable time.

So why don't we hobby programmers do not use planning tools like we've learned in university? Why don't we use diagrams to make things clearer, better documented, even before the real programming starts? The answer is quite simple: because it annoys the hell out of us. We don't like to plan ahead. We don't like to adjust plans as soon as we find out that changing a small aspect of our library could be changed to gain more flexibility and overall goodness. We don't like to check our plans before writing down the next module until it works.

Coding is fun, planning is not.

But should we use these things

In my opinion, this is foolish. We really should use the things we learned in university to plan out software and of course also to document it. It would be such a huge improvement of everything to simply think a bit more about it before actually implementing it!

How we do it

What we do and why we do not use tools to plan ahead is explained with one sentence: We program from the user interface to the implementation, because the other way round is to complicated. Or, with other words: We program top-down because bottom-up needs planning and therefore not that easy.

Of course, I'm speaking about the average case. I've programmed bottom-up before but, for me, it seems much more error prone than top-down does, especially without a plan.

Also, I do not say that top-down is not error prone. Not at all. When writing an API without an actual implementation in mind, one easily results in sacrificing cleanness and speed at some points to keep the API nice, which is not always a good idea. So top-down is only good as long as we get it right.

Tooling

Tooling is one big problem in this context. We do not have a toolchain for planning just yet. At least I do not have one that I would like to use. Because we are really good at controlling (versioning, moving around, managing) our source code (for example with git, and to some extend github), we also want to be able to do this with charts diagrams. But we also want the niceness of SVG-rendered graphics. We don't want to play around with layout all day long, but use tools to simply get the job done.

And there are no such tools available.

Sure, one can use graphviz to design such things, but then again we do not have a nice overview on what's going on while editing our work. One could use ascii-art to draw all those things, but hey... ascii-art. We are better than that, aren't we? We could render the ascii-art into SVG... though the tooling there is not yet as good as it should be. And even if it would be, version controlling these things with git is (I fail to believe otherwise) painful.

Conclusion

Well, I can only conclude the obvious here. We need better tooling for the open source programming community to do their planning, if they need to. Clearly, one does not always have to (or want to) plan things before trying out. But when one does, the tooling should be there and be useful and help with the process.

Next

In the next episode we will talk about version control of open source software projects. I'm not going into details about git or other systems used, but rather on the style how they should be used so everyone is pleased with it. This might be strongly biased, but hey, isn't this whole article series biased?

tags: #open-source #programming #software #tools #rust

This post was written during my trip through Iceland and published much latern than it was written.

What is a nice and gold API. How is “nice” defined when it comes to library interfaces? That's a question I want to discuss in this post and also, how you can create a nice API in your open source library without studying a topic like software architecture or similar.

Definition of a “nice”/” easy to use” API

But first, we have to define what makes an API good. And that's not that easy because this topic is very biased.

For me, a good API is one where I can get the job done without thinking much about it. That means that there shouldn't be that much setup code involved in my code just to use the library. So no Factory hell if the only thing I want to have is the current time, for example. This also means that the API has to be decent high level, but without losing the ability to do fine-grained work if necessary. So for the most part, low level (for example implementation details) things are not interesting for me. But when I want to bit-fiddle around with the library, it should let me.

If a builder, factory or some other mechanism is necessary to produce objects in some way, the library should make clear (documentation wise but also code wise) why it is needed. There's no point in making the user call the tenth factory instantiation if it is not necessary and also it makes the users codebase blow up in size and complexity.

The naming of things in the library should be good, appropriate and, for the most part, be consistent. If a function on an object which returns the string representation of that objectbis named “to_string” it should be named that way for all types from that library, not only some parts.

Statelessness

Calling functions of your API should always result in the same values for the same arguments. That does not mean that your API should be pure in a functional programming meaning, but rather that the actions executed when calling a function should not result in some library-internal variables to be set, changed or unset. This is easily achievable by letting the user of the API have an object that holds the state, and functions of your API work based on that value. In short: your library should not have global variables.

This simple design pattern already results in easy to use APIs and a nice user experience.

Error exposure

Good libraries don't hide errors. Indeed, it is even better if errors are exposed to the user as much as possible. Because the user of the library knows best when and how to handle errors, even from your library.

I'm also a big fan of lots of error cases. The more error cases (the better a user of a library can distinguish between different errors) the better. This way, you let the user decide where she doesn't distinguish between two almost-equal error cases and where it is better to handle them independently. If your library does not give that opportunity, the user has to make ugly Spaghetti-code handling things to be able to tell what is going on. Of course, these things have to be documented properly.

Another thing that can come in handy is when your error types or your library exposes functionality to translate error types into text which can be shown to a user of your library. Nothing is worse (from a users point of view) than a “CallOnInconsistenStateObjectBuilderFactory on line 2832” error message shown in an user-facing interface (and trust me, I've seen such things already).

Completeness

Nothing is worse than an API that is not complete. I mean, don't get me wrong – sometimes one does not think of all cases a library could be used for – and that's completely okay. But some things are too obvious for being left out. For example, if you provide functions to transform your time object from local time into GMT, why wouldn't you provide functions for converting it into UTC or EST? These also matter!

Also cleanup routines. In some languages it is necessary to include cleanup routines for your objects. If your library exposes alloc_vacation_location_obj() it should also provide free_vacation_location_obj()! Sure, a user could use free(), but it is not nice API-wise. Even if your function does nothing more than call to free(), it is better to provide a function (and if you want to include some more cleanup in your function later on, in a new version of your library, a user does not have to think about it that much when upgrading their dependencies).

Consistency

We had the naming game already, but it always comes back to us, right? Consistent naming is one of the most important things in an API. If allocating worked with functions prefixed with new_ all the time, it shouldn't be done with alloc_ this time. Also not in later versions if your library. Not even in a major version bump.

Even more important than naming is behaviour. A function that is named with some alloc prefix should only allocate, never initialize or do other fancy stuff (debugging output excluded here, if necessary).

Next

In the next episode we will talk about how one can plan an application.

tags: #open-source #programming #software #tools #rust

#matrix , #ipfs , #scuttlebutt and now #mastodon – We're living in awesome times! centralization < decentralization/federation < distribution! #lovefortech

(me, April 10, 2017, on mastodon)

The idea

With the rise of protocols like the matrix protocol, activitypub and others, decentralized social community platforms like matrix, mastodon and others gained power and were made real. I consider these platforms, especially mastodon and matrix, to be great steps into the future and am using both enthusiastically.

But can we do better? Can we do more distribution,? I think so!

So far we have a twitter-like thumbleblog platform (mastodon), a chat platform (matrix) and facebook-like platforms (diaspora and friendica) which are federated (some form of decentralization). I think we can make a completely distributed social network platform reality today.

Let me reiterate on that: I think, we can make a facebook/googleplus/etc clone which works without a central component, today. And I would even go one step further and state: All we need for this is IPFS (and related technology like IPLD and IPNS)!

This platform would feature personal profiles, publishing articles/posts/images/videos/voice messages/etc, instant messaging, following others, and all the things one would want in such a platform.

How would it work?

What do we need for this? Well, as stated before: not much! From what I can think of, we would need IPFS, some sort of public/private key functionality (which IPFS already has), a nice frontend-framework and that's basically it.

Let me tell you how I think such a platform would work.

The moment a user starts the application, the application would boot an IPFS node. The username and all other information about the profile are added to IPFS as structured data. If the profile changes because the user edits it, it is added to IPFS again, using IPLD to link to its previous version.

If a user adds a post to her profile, that post is added to IPFS as well and linked from the profile via IPLD. All other nodes are informed about the new content via pubsub and are free to pin the new content (the new profile version) or only cache it for a while (or to not care at all). The post itself could add a link to the IPNS hash of the profile under which the post is published. This way, a link from the post to the current version of the profile would always exist.

Because the profile always links to its previous version as well as to the post content, that would imply that the node the user of the profile runs would always keep all data the user adds to the network. As the data is only kept by links, the user is free to drop published content at any point in time.

This means that basically each operation would “generate” a new profile, which is of course published as an IPNS name. Following others would be a matter of subscribing to their “pub” channel (as in “pubsub”) or their IPNS name.

Chat

A chat application using IPFS is already implemented with orbit, so that's a matter of integrating one application into another. Peer-to-Peer (or rather Profile-to-Profile) messaging is therefore no problem.

Data format

All the data would be saved in a structured format. For example Json (though order of serialization is important, because of cryptographic hashes) or Bson or any other data serialization format that is widely adopted.

Sidenote: As long as it is made clear that any client must support all formats, the format itself doesn't matter that much. For simplicity of this article, I stick to Json (and also because it is most widely known).

A Profile(-version) would look roughly like this (consider 'ipfs hash' to mean “some kind of IPLD link” in this context):

{
  "previous": [ "<ipfs hash>" ],
  "post": {
    "type": "<post type>",
    "nodes": ["<ipfs hash>"],
    "metadata": {
      "date": "2017-12-12T12:00:00+0200",
      "tags": [],
      "category": "kittens",
      "custom": {}
    }
  }
}

Let me explain:

  • The previous key would point to the previous profile version(s). It would only contain IPFS hashes (Why plural, see below in “Multi-Device Support”).
  • The post key would contain information about the post published with this profile version.
    • The type of the post could be “article”, “image”, “video”... normal stuff. But also “biography” for the biography shown on the profile or other things. Even “username” would be possible, for adding a user name to the profile.
    • The nodes key would point to an IPFS hash containing the actual payload; either the text of the article (only one hash then) or the ipfs hashes of the pictures, the video(s) or other binary content. Of course, posts could be formatted using Markdown, reStructured Text or whatever format one likes to use. It would be a clients job to render it properly.
    • The metadata field would contain plain meta information, like published date, tags, category and also custom metainformation as key-value pairs.

Maybe a version attribute for protocol version could be added as well. Of course, this should be considered an incomplete example, as I almost certainly forgot things here.

The idea of linking the previous version of a profile from each new version of the profile is very much blockchain-like, of course, with the difference that nobody needs to fetch the whole chain but only the latest one to get a profile. The more content a viewer of the profile wants to see, the more she needs to traverse the graph of profile versions (and automatically caching the content for others). This would automatically result in older content beeing “forgotten” slowly (but the content would not be forgotten until the publisher itself and all other “pinners” drop it). Because the actual payload is not stored in the fetched data, the actual amount of data which is required to simply view a profile is rather small. A client could be configured to fetch all textual content of a file, but not more than 10 versions, or one screenpage, or something like that. The possibilities are endless here.

Federated component

One might think “If I go offline with my node, my posts are not accessible if nobody else is online having them”. And that's true.

That's why I would introduce a federated component, which would run a stripped-down version of the application.

As soon as another instance connects and a new post is announced via pubsub, the instance automatically pins or caches it. Of course, this would mean that all of these federated instances would pin all content, which is surely not nice. One (rather simple and maybe even stupid) option would be to roll a dice and make the chance that a post is pinned a 50-50 thing, or something like that. Also, posts which are pinned for a certain amount of time are most likely distributed well enough so the federated component nodes can drop them... maybe after 90 days, maybe after 10... Details!

Blockchain-Approaches

The fundamental problem with Blockchains is that every peer in the network hosts the complete content. Nobody benefits from that, especially if you think of a social network which should also work on mobile devices. With users loading up images, videos and other large blobs of data, a blockchain is the wrong approach.

That's why I think a social network on Euthereum, Bitcoin or any other crypto-currency/blockchain is not an option at all.

IPLD

IPLD can be used not only to link posts and profiles, but also to link from content to content. Namely to link from one post to another, from a post to an image, a video, a voice message,... but also to link from one post to a git commit, an euthereum transaction or any other IPLD-supported data structure.

Once nice detail is that one does not have to traverse these links. If a user sees a post which links to other posts, for example, she does not have to fetch these links to see the post itself, only if she wants to see the linked content. Caching nodes, on the other hand, can automatically traverse the whole graph and fetch all the content into their cache.

That makes a IPLD-based linking approach really beneficial.

Scuttlebutt

Scuttlebutt is a first step into the right direction. One can say what one wants about electron and the whole technology stack which is used in Scuttlebutt (and like or dislike the whole Javascript world), but so far Scuttlebutt seems like the first social network that is completely distributed.

I thought about whether it would be a great idea to port Scuttlebutt to use IPFS in the backend. From what I know right now, it would be a nice way of bringing IPFS and IPLD to the mix and therefor enhancing and extending the capabilities of Scuttlebutt itself.

I have not final conclusion on that thought, though.

Problems

There are several problems one has to think about when designing such a system.

Comments on Posts (and comments)

Consider you want to comment on a post. Of course you create new content, which links to the post you just commented. But the person who wrote the original post does not automatically link to your comment, so is neither able to find the comment (which could be solved via pubsub), nor are others able to find them.

The approach to this problem is simple: Notification about comments can be done via pubsub. And, if a user gets a notification about a new comment, she can approve it and automatically publish a new version of her post, with some added meta information:

  • A link to the comment
  • A link to the “old version of the content in IPFS”

Now, if a client fetches all posts of a profile, it resolves all entries for their newest version (so basically the one entry which does not link to an older version of itself) and only shows the latest versions of it.

Comments on comments (and so on) would be possible with the exact same approach. That would, of course, cause a whole tree of comments to be rebuild every time a new comment is added.

Maybe not the best idea in that regard.

Multi-Device Support

There are several problems regarding multi-device support.

Publishing content

Publishing from multiple devices with the same profile is possible – one just needs to import the private key for the signatures and the profile information to the other device.

Though, this needs some sort of merging mechanism if two posts are published from two devices (or more) at the same time / without the other devices beeing online to get notifications of the new point of truth.

As creating two posts from two seperate devices would create two new versions of the profile (because of IPLD linking), which means two points of truth suddenly exists, a merging-mechanism must be implemented to merged multiple points of truth for the profile. This could yield a rather large network of profile versions, but ultimatively a DAG (Directed Acyclic Graph).

        Profile Init
             ^
             |
          Post A
             ^
             |
          Post B <----+
             ^        |
             |        |
  +-----> Post C    Post C'
  |          ^        ^
  |          |        |
Post D    Post D'   Post D''
  ^          ^        ^
  |          |        |
  |          +--------+
  |          |
  |       Post E
  |          ^
  |          |
  +----------+
             |
             |
          Post F

A scenario like the one above (each Post also represents a new version of the profile) would be easy to create with three devices:

  1. One starts using the network on a notebook
  2. Post A published from the notebook
  3. Post B published from the notebook
  4. Profile added on the workstation
  5. Post C published from the notebook while off of the internet
  6. Post C' published on the workstation
  7. Profile added to the mobile phone (from the notebook)
  8. Post D published from the mobile while off of the internet
  9. Post D' published from the notebook while off of the internet
  10. Post D'' published on the workstation
  11. Notebook comes back online, Post E published, merging the state from Post D'' from the workstation and Post D' from the notebook itself.
  12. Phone comes online, one of the devices is used to publish Post F, merging the state from Post D and Post E.

In this scenario, there would still be one problem, though: If the profile is published as an IPNS name, branching off of versions would be problematic. If C is published while C' is published, both devices would publish their version as an IPNS name. Now, first come first serve applies. And of course that is problematic, because every device would always see one of the posts, but no device could see the other. Only at E (in the above example), when the branches are merged, both C and C' would be visible (though D wouldn't be visible as long as it isn't merged into the chain). But how does a device discover that there are two “current” versions which have to be linked to the new post?

So, discoverability is an issue in this approach. Maybe someone can come up with a clean and easy solution that would work for netsplit and all those scenarios.

One idea would be that there is a profile-key which is used to publish profile versions under an IPNS name as well as a device-key, which is used to announce profile versions as a seperate IPNS name. That IPNS name could be added to the profile, so each other device can find it and fetch “current” versions from each device. Only the initial setup of a new device would need to be made carefully then.

Or, maybe, the whole approach is wrong and another approach would fit better for this kind of problem. I don't know.

Subscribing

Another issue with multi-device support would be subscribing. For example, if a user (lets call her Amy) subscribes to another user (lets call him Sheldon) on her Notebook, this information needs to be stored somehow. And because Amys machines do not necessarily sync with each other, her mobile phone may never know that following Sheldon is a thing now!

This problem could by solved by storing the “follow”-information in her public profile. Although, some users might not like everyone to know who to follow. Cryptographic things could be considered to fix visibility.

But then, users may want to “categorize” their friends, store them in groups or whatever. This information would be stored in the public profile as well, which would create even more noise on the network. Also, because cryptography is hard and information would be stored forever, this might not be an option as some day, the crypto might be broken and reveal all the things that were stored privately before.

Deleting profile versions

Some time, a user may want to remove a biography entry or a user name she once published. Because all the information is chained in a long chain of versions, one may think that deleting a node is not possible. But it is!

Consider the following (simple) graph of profile versions:

A<---B<---C<---D<---E

If the user now wants to delete node C in this graph, she simply drops it. Now, E beeing the latest point of truth, one may think that finding B and A is not possible anymore. That's true. But why not shipping around this by creating a new profile version and linking the previous versions:

A<---B     <---D<---E<---F
      \                 /
       -----------------

Of course, D would now point to a node which does not exist. But that is not a problem. Indeed, its a fundamental concept of the idea – that content may be unavailable.

F must not contain new content. It even should not, because dropping F because of its content becomes harder this way. Also, new versions of the profile is simple and cheap.

Problems are hard in distributed environments

I do not claim to know the final solution to any of these problems. Its just that I think of them and would love to get an open conversation started on the whole subject of distributed social networks and problems that come with them.

tags: #distributed #network #open-source #social #software

This post was written during my trip through Iceland and published much latern than it was written.

When writing my last entry, I argued that we need decentralized issue tracking.

Here's why.

Why these things must be decentralized

Issue tracking, bug tracking and of course pull request tracking (which could all be named “issue tracking”, btw) must, in my opinion, be decentralized. That's not only because I think offline-first should be the way to go, even today in our always-connected and always-on(line) world. It's also because of redundancy and failure safety. Linus Torvalds himself once said:

I don't do backups, I publish and let the world mirror it

(paraphrased)

And that's true. We should not need to do backups, we should share all data, in my opinion. Even absolutely personal data. Yes, you read that right, me, the Facebook/Google/Twitter/Whatsapp/whatever free living guy tells you that all data need to be shared. But I also tell you: Never ever unencrypted! And I do not mean transport encryption, I mean real encryption. Unbreakable is not possible, but at least theoretically-unbreakable for at least 250 years should do. If the data is shared in a decentralized way, like IPFS or Filecoin try to do, we (almost) can be sure that if our hard drive failes, we don't lose the data. And of course, you can still do backups.

Now let's get back to topic. If we decentralize issue tracking, we can make sure that issues are around somewhere. If github closes down tomorrow, thousands, if not millions, open source projects lose their issues. And that's not only current issues, but also history of issue tracking, which means also data how a bug was closed, how a bug should be closed, what features are implemented why or how and these things. If your self-hosted solution loses data, like gitlab did not long ago on their gitlab.com hosted instance, data is gone forever. If we decentralize these things, more instances have to fail to bring the the whole project down.

There's actually a law of some sort about these things, named Amdahl's law: The more instances a distributed system has, the more likely it is that one instance is dead right now, but at the same time, the less likely it is that the whole system is dead. And this is not linear likelihood, it is exponential. That means that with 10, 15 or 20 instances you can be sure that your stuff is alive somewhere if your instance fails.

Now think of projects with many contributors. Maybe not as big as the kernel, which has an exceptionally big community. Think of communities like the neovim one. The tmux project. The GNU Hurd (have you Hurd about that one?) or such projects. If in one of these projects 30 or 40 developers are actively maintaining things, their repositories will never die. And if the repository holds the issue history as well, we get backup safety there for free. How awesome is that?

I guess I made my point.

tags: #open-source #programming #software #tools #git #github

This post was written during my trip through Iceland. It is part of a series on how to improve ones open-source code. Topics (will) contain programming style, habits, project planning, management and everything that relates to these topics. Suggestions welcome.

Please note that this is totally biased and might not represent the ideas of the broad community.

When it comes to hobby projects in the open source world, one often works alone on the codebase. There is no community around your tool or library that is actively pushing towards a nice codebase. Anyways, you can have one. And what matters in this regard (not only but also) is a nice modularization of your codebase.

Why modularization

It is not always beneficial to modularize your code. But if your program or library hits a certain size, it can be. Modularization helps not only with readability, but also with separation of concerns, which is an important topic. If you do not separate concerns, you end up with spaghetti code, code duplication and this leads to doubled effort, which is essentially more work for the same result. And we all know that nobody wants that, right? In the end we're all lazy hackers. Also, if you write code twice, trice or maybe even more often, you end up with more bugs. And of course you also don't want that.

SoC

Now that the “why” is cleared, the “how” is the next step to think about.

We talked about the separation of concerns in the last section already. And that's what it boils down to: You should separate concerns. Now, what is a concern? An example would be logging. Fairly, you should use a decent library for the logging in your code, but that library might need some initialization routines called and some configuration be passed on. Maybe you even need to wrap some things to be able to have nice logging calls in your domain logic – you should clearly modularize the logging related code and move the vast majority of it out of your main code base. Another concern would be file IO. Most of the time when you doing File IO, you're duplicating things. You don't need to catch IO errors in every other function of your code – wrap these things and move them to ankther helper function. Maybe your File IO isnmore complex and you need to be able to read and write multiple files with similar structure all the time – a good idea would be to wrap these things in a module that handles these things only. The concern of File IO is now encapsulated in a module, errors are handled there and users of that module know what they're working with – An abstraction which either works or fails in a way so that they don't have to re-write error handling in every other function but can simply forward errors or fail in a defined way.

If you think about that for a moment, you'll notice that modularization is a multilevel thing. Once you have nicely defined modules for several concerns, you might build new and more high-level modules out of them. And that's also exactly what frameworks do most of the time: They combine libraries (which might be already at a certain level of abstraction) into a new, even more highlevel, library for another concern. For example, a web framework combines a database library, templating library, middleeare library, logging library, authentication and authorization libraries into one big library, which is then – inconsistently – called a framework.

CCC

No, I'm not talking about the Chaos Computer Club here, but about Cross Cutting Concerns. A cross cutting concern is one that is used throughout your entire codebase. For example logging. You want logging calls in your data access layer, your business logic and your user interface logic.

So logging is a functionality all modules of your software want to have and need. But because it is cross cutting, it is not possible to insert it as a layer in your stack of modules. So it is really important that your cross cutting concerns are as loosely coupled to all other modules as possible. In case of logging, this is particularly easy, because after the general setup of the logging functionality, which is done exactly once, you have a really small interface to that module. Most of the time, not more than four or five functions: one call for each level of verbosity (for example: debug, info, warn, error).

Naming things...

In know, I know... this bikeshedding topic again. I don't want to elaborate to extensively on this subject, though a few thoughts must be said.

First of all, module names should be like function names (which I explained in the last episode) – short, to the point and describing what a function, in our case module, does. And that's still it. A module should be named after its domain. If the module does, for example, contain only some datatypes for your library, or only algorithms to calculate things, it should be named exactly that: “types” or “algo”.

A second thing I would clearly say is that module names should be a one-word-game. Module names should hardly contain more than one word – why would they? A domain is one word, as showed above.

Multiple libraries

What I really like to do is to have several libraries for one application, if the codebase hits a certain complexity. I like to take imag as an example. Imag is a collection of libraries which build on a core library and offer abstraction over it. This is a great separation of concerns and functionality. So a user of the imag source can use some libraries to get a certain functionality and leave out others if their functionality is not needed. That's a great separation of concerns with loose coupling. But when it comes to this approach, another topic gets important as well, and we'll talk about that in the next episode.

The naming restrictions I stated above apply to library names which are solely for separation inside of an applications, too, though I would relax the one-word-name rule a bit here.

Next

In the next Episode of this blog series we will talk about another important subject which is related to modularization: API design.

tags: #open-source #programming #software