Automatically finding branches in Subversion repos

Subversion lacks formal mechanism for branches and tags. These are distinct entities in many other source control systems, but in Subversion these are a combination of convention and the copy primitive that Subversion has.

This is awkward for new users of Subversion, who have to learn how to follow the convention, and even puts a little burden on creators of new repos, who have to decide what pattern to follow. But it’s an ambiguously flawed feature, because it does have some benefit - it allowed for mega-repos that contain many independent projects, each of which could be independently branched, and could even be cross-branched.

Of course, all this flexibility comes out as a pain when you want to convert a Subversion repo to, say, Git. Converting files themselves is trivial; the pain comes when you want to faithfully represent the old repo’s lines of development.

There are two typical patterns when using Subversion. The first is to have this top-level structure for a single-project repository:

/branches
/tags
/trunk

Your main line of development is in /trunk, you put branches as directories in /branches, and you put tags in /tags. Oh, right, forgot to mention - Subversion doesn’t have tagging either. There’s nothing to prevent you from committing to a copy made in the /tags hierarchy, thus turning it into a defacto branch, because it’s all convention. Hold that thought.

The other typical pattern is when Subversion is used to host a multi-project repository. In that case, you might have this:

/abdera/
    /branches
    /tags
    /trunk
/accumulo
    /branches
    /tags
    /trunk
...
/httpd
    /branches
    /tags
    /trunk
...

As you might guess, this is from the Apache Software Foundation repo at https://svn.apache.org/repos/asf. Except, no, the reality is a lot messier.

Some projects, like ace or avro, have a handful of files and directories alongside the trunk and branches dirs.

KEYS
branches/
doap.rdf
releases/
sandbox/
site/
trunk/

The htttp (Apache) project has a host of nested projects, each of which has its own branches, tags and trunk. This is a subset, in the interests of space:

apreq/
docs-build/
flood/
httpd/
    branches/
    tags/
    trunk/
    vendor/
    win32-msi/
mod_fcgid/
mod_ftp/
    branches/
    tags/
    trunk/
mod_mbox/
mod_spdy/
mod_wombat/
sandbox/
site/
test/

And maven - well, maven is its own special child. This is a subset of the top level:

app-engine/
archetype/
maven-1/
maven-2/
maven-3/
project/
release/
resources/
retired/
scm/
shared/
site/
trunks/

Look, a trunks directory. Which is - empty? And the maven-1 dir has an empty trunks. We finally find source in /maven/maven-2/branches. And the mainline of development is really /maven/maven-3/trunk.

So, while the ASF repo is an example of the multi-project repo approach, it’s also an example of the heterogeniety of large Subversion repos. The approach up until now when converting a repo is that you have to describe the structure, which can be time-consuming, and you can get it wrong. But maybe that’s enough?

What about the fact that Subversion copy operations can be of any subset and can do topologically-absurd actions? For example, copying root to a point in the tree, or copying a point in the tree to a parent of that point?

By the way, it is possible to compose a Git repo as a Subversion repo, it’s just bizarre and unwieldly. Take master and all branches, and add them as subtrees of a new root. This takes up almost no space in the Git repository, just like with Subversion. The difference is that unless you do selective checkout, you’re going to have a very large working folder. Selective checkout is the default behavior in Subversion.