Rewriting git history simply with git-filter-repo

In this post I describe how I used git-filter-repo in Docker to rewrite the history of a git repository to move files into a subfolder

In this post I describe how I used git-filter-repo to rewrite the history of a git repository to move files into a subfolder.

Background: rewriting git history

As a git user, I like to Rebase. I like to make lots of small commits and tidy them up later using interactive rebase, and to rewrite my PRs to make them easier to understand (and review). I use git push origin --force-with-lease so much, that I have it aliased as git pof.

What I don't do is rewrite the history of my main/master branch. There's a whole world of pain there, as other people will likely have started branches from the branch, and they can easily end up in a complete mess.

However, sometimes it makes sense.

I was working on a small side project the other day, when I realised it would really make sense for it to effectively be a "monorepo". So rather than having all the existing code in the root directory, I wanted to move it to a child directory.

So I started with a directory that looked like this:

Directory before the changes

And I wanted a directory that looked like this:

Directory after the changes

The notable points here are:

  • Everything has been moved to an engine subfolder
  • Except the .gitattributes and .gitignore files, which are still at the top level.

The simplest way to do this is to just move all the files, and create a new commit with the changes, job done. The downside to that is that while git itself is ok at tracking file moves (it sometimes gets things wrong), it can cause some other issues.

For example, if you're looking at a file on GitHub, and you want to see what it looks like at a particular commit, then you can use the branch selector to change it. However, if the file has moved, you'll get a 404. Not a great experience.

Changing the branch for a file and getting a 404 in GitHub

If the odd file has moved, that's not a big deal, but if literally every file has moved, that's not a great experience.

So what's the alternative? Rewriting history!

Rewriting history: the options

With rewriting history, we update the git branches to make it look like all the files were originally committed to the engine subfolder. There's no "sudden move". The history shows them as always having been in the engine folder.

This sort of wholesale rewriting of your main/master branch is definitely not advisable if you are sharing the repo publicly. You will likely break all sorts of people's work!

Normally when I'm rewriting history I use git rebase -i in combination with git reset HEAD~. This lets me squash commits together, pause to split them apart, reorder them, or remove them entirely. That's great for when you're massaging a PR, but it's really not designed for wholesale rewriting of an entire repository.

For those scenarios, git filter-branch is a b etter option. This is a complex git command, that frankly, scares me. I have used it, on occasion, but the syntax is janky, you typically have to incorporate a lot of bash, it's often slow, and you could mess up your whole repository. Yay!

Just take a look at this Stack Overflow question which is about a similar requirement but in reverseΓ’€"moving from the engine folder to the root. One of the suggested answers suggests running the following command:

git filter-branch -f --index-filter 'PATHS=`git ls-files -s | sed "s/^engine//"`; \  GIT_INDEX_FILE=$GIT_INDEX_FILE.new; \  echo -n "$PATHS" | \  git update-index --index-info \  && if [ -e "$GIT_INDEX_FILE.new" ]; \    then mv "$GIT_INDEX_FILE.new" "$GIT_INDEX_FILE"; \  fi' -- --all  

That's definitely something. Does it work? Probably. Would you want to write your own? Almost certainly not.

So instead of trying to figure out how to mangle git filter-branch to my liking, I decided to look at at a suggestion I saw elsewhere: git-filter-repo.

"Installing" git-filter-repo using Docker

git-filter-repo isn't built-in to git itself. In fact, it's a single Python file, but it's written to feel like a git plugin. And the really nice thing is that the API is so much nicer. That whole git filter-branch expression in the previous section could be rewritten with git-filter-repo to be something like this:

git filter-repo --path-rename engine/:  

I think you'll agree that's much clearer! The manual is also very good, with lots of examples.

The only problem from my point of view, is that git-filter-repo is a Python module. Python on Windows can be problematic (even the install instructions make that clear) and while you can install Python from the Microsoft Store, I really didn't want to go through that. Docker to the rescue!

Docker is such a great use-case for something like this, where I want to quickly try a tool, and don't want to risk messing up my machine. Instead of installing Python, I'll run a Docker image that already has Python installed, map the drive to my project, and wo rk inside the docker image!

git-filter-repo requires Python 3.5+, so I searched for Python on Docker Hub and found the official images. The python:3 image is a bullseye (Debian 11) image, with Python 3.10 installed, which would do nicely.

I ran the following command from inside my app to pull and run the Docker image, to map the current directory to the /app directory inside the container, set the working directory to /app, and to start a bash shell.

docker run --rm -it -v ${PWD}:/app -w /app python:3 /bin/bash  

I now have a running Python container, but I don't have the git-filter-repo tool installed yet. The python:3 repo uses Debian 11, and according to the git-filter-repo install instructions, I needed to use the "backports" repository to install via apt-get:

A repository in this context refers to the server containing all the packages used by apt for installation into a Linux machine. It is separate from the concept of a "git repository".

Unfortunately the backports repository isn't enabled by default in Debian 11, so I followed the instructions from the backport website to add it to the sources list, and installed the git-filter-repo package:

# Add the backports repo to sources.list  echo 'deb http://deb.debian.org/debian bullseye-backports main' > /etc/apt/sources.list.d/backports.list    # Update the list of available packages  apt-get update    # Install git-filter-repo, adding the required /bullseye-backports suffix  apt-get install -y git-filter-repo/bullseye-backports  

The logs indicated this had installed correctly, so I was ready to take it for a spin!

Using git-filter-repo to move files into a subdirectory

My first attempt to use git-filter-repo wasn't very successful. I tried running:

git filter-repo --to-subdirectory-filter engine/  

which seemed like it would do most of what I wanted, but I was presented with the following:

> git filter-repo --to-subdirectory-filter engine/  Aborting: Refusing to destructively overwrite repo history since  this does not look like a fresh clone.    (expected freshly packed repo)  Please operate on a fresh clone instead.  If you want to proceed  anyway, use --force.  

This is very interesting! Rewriting history is obviously a very destructive process in which you can lose work, and git-filter-repo is doing its best to make sure you don't hurt yourself. As long as you have your work pushed to a remote git repository you should be fine, but to be safe, git-filter-repo requires you work in a fresh clone by default.

This seemed very sensible to me, so I did as it asked, created a fresh clone, and tried again:

> git filter-repo --to-subdirectory-filter engine/    Parsed 24 commits  New history written in 2.37 seconds; now repacking/cleaning...  Repacking your repo and cleaning out old unneeded objects  HEAD is now at 547b073 Use alternate robots.txt  Enumerating objects: 375, done.  Counting objects: 100% (375/375), done.  Delta compression using up to 4 threads  Compressing objects: 100% (161/161), done.  Writing objects: 100% (375/375), done.  Total 375 (delta 189), reused 327 (delta 189), pack-reused 0  Completely finished after 6.32 seconds.  

That's much better! As you can see from the logs, git-filter-repo was very busy, rewriting the commits. Taking a look at the results afterwards, everything except the .git folder had been moved to the engine subfolder:

All files have been moved to the engine subfolder

and the history (shown with gitk here) shows that the original commits were all to the engine folder.

gitk shows the files were always committed to the engine folder

This is almost exactly what I want, except I wanted the .gitignore and .gitattributes to remain at the top level.

I'll come back to those strange replace/* tags in the gitk image shortly

The easiest way to fix the .gitignore location was more rewriting! I ran the following command to move the .gitignore and .gitattributes files back up to the root folder:

> git filter-repo \    --path-rename engine/.gitattributes:.gitattributes \    --path-rename engine/.gitignore:.gitignore    Parsed 24 commits  New history written in 1.35 seconds; now repacking/cleaning...  Repacking your repo and cleaning out old unneeded objects  HEAD is now at f554e31 Use alternate robots.txt  fatal: replace depth too high for object 8027f9f8670e3da4762099d39e733bcfa44fea39  fatal: failed to run pack-refs  Completely finished after 2.45 seconds.  

That appeared to work, as I now had the folder structure I wanted. But there were two slightly worrying fatal error messages in the logs Γ°Ÿ¤" On top of that, when I tried opening gitk I got the following error message:

Error reading commits: fatal: replace depth too high for object 8027f9f8670e3da4762099d39e733bcfa44fea39

That's a bit concerning Γ°Ÿ˜Ÿ Luckily, after a bit of Googling, I found I could fix the issue by running:

> git replace -d 8027f9f8670e3da4762099d39e733bcfa44fea39  Deleted replace ref '8027f9f8670e3da4762099d39e733bcfa44fea39'  

After that, I could successfully open gitk, and could see that the .gitignore and .gitattributes files were again in the root, with everything else in the engine folder:

gitk shows the gitignore files in the root folder, with everything else in the engine subfolder

So with that, my work was pretty much done. But that fatal error was bugging me, as were all those extraneous replace/ refs.

It took me a little while to work out what those refs even were but eventually I pinned it down to a git feature called git-replace. That feature is worth a whole blog post on its own, so for now I'll just point you to the docs if you're interested, and I'll walk through the feature in a subsequent post.

I decided to st art again, and this time I told git-filter-repo I didn't need the extra replace/ references by passing --replace-refs delete-no-add:

# Move everything to the engine/ subfolder  git filter-repo --replace-refs delete-no-add --to-subdirectory-filter engine/  # Move .gitignore and .gitattributes back to the root  git filter-repo --replace-refs delete-no-add \    --path-rename engine/.gitattributes:.gitattributes \    --path-rename engine/.gitignore:.gitignore  

This time there were no fatal errors in the logs, gitk opened without any errors, and all the replace/ references were gone. Success! With that I could exit the Docker container, double check everything was correct, and do a git push origin --force-with-lease of my newly rewritten repo!

All in all, I'm very impressed with git-filter-repo, and using it inside the Docker container is clean and painless, so I'd definitely recommend it!

Summary

In this post I described a scenario where I wanted to rewrite the history of a git repository to make it appear as though some files were originally created in a sub-folder instead of the root folder. I described how to run a python:3 Docker container, how to install git-filter-repo, and the commands required to move all the files except .gitattributes and .gitignore to an engine subfolder. To make it simpler, I've reproduced the main steps here:

  1. Create a fresh clone of your repository, and cd to the clone directory
# Clone my/repo to output_directory  git clone https://github.com/my/repo output_directory  cd output_directory  
  1. Run a python:3 Docker container interactively, and install git-filter-repo inside it
# run the Docker container  docker run --rm -it -v ${PWD}:/app -w /app python:3 /bin/bash    # inside the container, install git-filter-repo  # Add the backports repo to sources.list  echo 'deb http://deb.debian.org/debian bullseye-backports main' > /etc/apt/sources.list.d/backports.list    # Update the list of available packages  apt-get update    # Install git-filter-repo, adding the required /bullseye-backports suffix  apt-get install -y git-filter-repo/bullseye-backports  
  1. Run the git-filter-repo commands to move all the files to the engine subdirectory, and then move the .gitignore and .gitattribute files back. Don't create replace/ refs.
# Move everything to the engine/ subfolder  git filter-repo --replace-refs delete-no-add --to-subdirectory-filter engine/    # Move .gitignore and .gitattributes back to the root  git filter-repo --replace-refs delete-no-add \    --path-rename engine/.gitattributes:.gitattributes \    --path-rename engine/.gitignore:.gitignore  
Namaste Devops is a one stop solution view, read and learn Devops Articles selected from worlds Top Devops content publishers inclusing AWS, Azure and others. All the credit/appreciations/issues apart from the Clean UI and faster loading time goes to original author.

Comments

Did you find the article or blog useful? Please share this among your dev friends or network.

An android app or website on your mind?

We build blazing fast Rest APIs and web-apps and love to discuss and develop on great product ideas over a Google meet call. Let's connect for a free consultation or project development.

Contact Us

Trending DevOps Articles

Working with System.Random and threads safely in .NET Core and .NET Framework

Popular DevOps Categories

Docker aws cdk application load balancer AWS CDK Application security AWS CDK application Application Load Balancers with DevOps Guru Auto scale group Automation Autoscale EC2 Autoscale VPC Autoscaling AWS Azure DevOps Big Data BigQuery CAMS DevOps Containers Data Observability Frequently Asked Devops Questions in Interviews GCP Large Table Export GCP Serverless Dataproc DB Export GTmetrix Page Speed 100% Google Page Speed 100% Healthy CI/CD Pipelines How to use AWS Developer Tools IDL web services Infrastructure as code Istio App Deploy Istio Gateways Istio Installation Istio Official Docs Istio Service Istio Traffic Management Java Database Export with GCP Jenkin K8 Kubernetes Large DB Export GCP Linux MSSQL March announcement MySQL Networking Popular DevOps Tools PostgreSQL Puppet Python Database Export with GCP Python GCP Large Table Export Python GCP Serverless Dataproc DB Export Python Postgres DB Export to BigQuery Sprint Top 100 Devops Questions TypeScript Client Generator anti-patterns of DevOps application performance monitoring (APM) aws amplify deploy blazor webassembly aws cdk application load balancer security group aws cdk construct example aws cdk l2 constructs aws cdk web application firewall aws codeguru reviewer cli command aws devops guru performance management aws service catalog best practices aws service catalog ci/cd aws service catalog examples azure Devops use cases azure devops whitepaper codeguru aws cli deploy asp.net core blazor webassembly devops guru for rds devops guru rds performance devops project explanation devops project ideas devops real time examples devops real time scenarios devops whitepaper aws docker-compose.yml health aware ci/cd pipeline example host and deploy asp.net core blazor webassembly on AWS scalable and secure CI/CD pipelines security vulnerabilities ci cd pipeline security vulnerabilities ci cd pipeline aws smithy code generation smithy server generator
Show more