Skip to content

4. Git Concepts, Data Model and Commands

drawing

Version Control for Cell Clusters — A Git Journey in the Lab

Dr. X is a computational biologist working on a new single-cell RNA sequencing (scRNA-seq) pipeline. Like most data scientists, their workflow involves constant tweaking: preprocessing, quality control, normalization, clustering, and visualization.

They are tired of lost scripts, mysterious file versions like clustering_final_FINAL.R, and overwritten notebooks.

So, they decided to version-control their pipeline properly using Git

1. git init - Starting a new experiment

They begin their project as they would a new experiment — setting up a clean workspace.

mkdir CellClusterFlow
cd CellClusterFlow
git init

Output

Initialized empty Git repository in /path/to/CellClusterFlow/.git/

This initialises an empty Git repository, like labeling an empty freezer box before adding samples. Every change from here on will be tracked.

What just happened? The .git directory

The git init command created a hidden .git directory — the heart of your repository.

ls -la

Output:

drwxr-xr-x  .git/

  • This .git folder is Git's "lab notebook" — it contains:
.git/
├── HEAD              # Points to current branch
├── config            # Repository settings
├── description       # Repository description (rarely used)
├── branches/         # (deprecated, legacy Git)
├── hooks/            # Scripts triggered by Git events
├── info/             # Additional repository information (excludes, attributes)
├── objects/          # Database of all file versions (commits, trees, blobs)
└── refs/             # Pointers to commits (branches, tags)
    ├── heads/        # Local branches
    └── tags/         # Version tags

Never delete .git!

Deleting this folder erases all version history. Your files remain, but Git forgets everything — like burning your lab notebook while keeping only today's samples.

Checking if a directory is a Git repository

    # If .git exists, you're in a Git repo
    ls -la .git

    # Or use Git itself with git status command which we will cover next
    # If not a repo: "fatal: not a git repository"

Creating essential files

Dr. X sets up the initial project structure:

Create a mock file-set

touch README.md LICENSE data_preprocessing.py qc_filtering.py clustering.R visualization.ipynb

Files created:

CellClusterFlow/
├── .git/                    # Git's database (hidden)
├── README.md                # Project documentation
├── LICENSE                  # Usage terms
├── data_preprocessing.py    # Data loading and cleaning
├── qc_filtering.py          # Quality control filters
├── clustering.R             # Clustering algorithms
└── visualization.ipynb      # Plotting and figures

At this point, Git knows these files exist (they're in the working directory), but isn't tracking them yet.

That's what git add will do next.

flowchart LR
  A[Empty directory] -->|git init| B["Repository with .git/ folder
  (Git's brain)"]
  B -->|create files| C["Working directory with files
  (untracked)"]

  style A fill:#f0f0f0,stroke:#333
  style B fill:#e6f7ff,stroke:#1890ff,stroke-width:3px
  style C fill:#fff7e6,stroke:#fa8c16

Working directory vs Repository

  • Working directory: The files you can see and edit
  • Repository (.git/): Git's internal database of all versions

Think of it like: - Working directory = Your lab bench (current experiment) - Repository = Your archived lab notebooks (all past experiments)

2. git status — Checking the lab bench

Just like checking which samples are unprocessed, they inspects their project’s state.

git status
Output
On branch main

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
    LICENSE
    README.md
    clustering.R
    data_preprocessing.py
    qc_filtering.py
    visualization.ipynb

nothing added to commit but untracked files present (use "git add" to track)

Git reports untracked files — nothing is committed yet. This command should become a habit; they run it before almost every operation.

flowchart LR
  WD["`Working Directory
  (modified/untracked files)`"] --> IDX[Index]
  IDX --> HEAD["`HEAD
  (committed snapshot)`"]
  style WD fill:#fffbe6,stroke:#333
  style IDX fill:#e6f7ff,stroke:#333
  style HEAD fill:#e6ffe6,stroke:#333

3. git add — Staging files for the record

Like labeling tubes before freezing them, they stages files to prepare for a permanent record.

git add README.md LICENSE
git add .

This moves files from the working directory into the staging area, also known as the index. Only staged files will be committed.

4. git commit — Recording the experiment

They capture their first snapshot.

git commit -m "Initial commit: basic pipeline structure"
Output
[main (root-commit) 40eb049] Initial commit: basic pipeline structure
 6 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 LICENSE
 create mode 100644 README.md
 create mode 100644 clustering.R
 create mode 100644 data_preprocessing.py
 create mode 100644 qc_filtering.py
 create mode 100644 visualization.ipynb

This is their first “frozen sample” — They can always revert to it later.

%%{init: {'theme':'base'}}%%
gitGraph
  commit id: "Initial commit"

5. git branch — Designing new experiments

To test a new normalisation method, they create a new branch — a safe environment to experiment without contaminating their main results.

git branch normalisation
git checkout normalisation

Branches in Git are like running parallel experiments in separate tubes. Each can evolve independently until they decide to merge the results.

%%{init: {'theme': 'base'}}%%
gitGraph
  commit id: "A"
  branch normalization
  checkout normalization
  commit id: "B"

6. git status, git add, and git commit again

After modifying data_preprocessing.py to include log-normalization, they check the progress:

git status
git add data_preprocessing.py
git commit -m "Add log-normalization step for scRNA-seq data"

Now, their new branch contains a reproducible change.

%%{init: {'theme': 'base'}}%%
gitGraph
  commit id: "A"
  branch normalization
  checkout normalization
  commit id: "Add log-normalization"

7. git log — Reviewing experimental history

Every scientist keeps lab notes; Git is no different.

git log --oneline --graph --decorate

It shows a tidy, timestamped list of commits — their computational “lab notebook”.

%%{init: {'theme': 'base'}}%%
gitGraph
  commit id: "Initial commit"
  commit id: "Add log-normalization"

Useful git log variations

# Compact one-line format
git log --oneline

# Show last 5 commits
git log -5

# Show changes in each commit
git log -p

# Show commits by specific author
git log --author="Dr. X"

# Show commits in date range
git log --since="2 weeks ago"

8. git checkout — Switching branches

To review another experiment, Dr. X switches branches. ( make sure the current branch is normalisation )

git branch

  main
* normalisation
  • This * indicates a the current branch

Create a new branch ( branch namem is clustering-tweaks and check it out one line)

# Switch back to main
git checkout main

# Create AND switch to a new branch in one command
git checkout -b clustering-tweaks

This is like changing which dataset or parameter set they are exploring — safely isolated.

%%{init: {'theme': 'base'}}%%
gitGraph
  commit id: "A"
  branch clustering-tweaks
  checkout clustering-tweaks
  commit id: "B"

Uncommitted changes

If you have uncommitted changes, Git may prevent you from switching branches: error: Your local changes to the following files would be overwritten by checkout

Solutions:

  1. Commit your changes: git commit -m "Work in progress"
  2. Stash your changes: git stash (covered later)
  3. Discard your changes: git checkout -- <file> (be careful!)

9. git merge — Combining branches

Once testing is complete, Dr. X merges their branch back into main.

# Switch to the branch you want to merge INTO
git checkout main

# Merge the feature branch
git merge normalisation

10. git rebase — Keeping history tidy

After a few days, main has moved ahead. Rather than merge and create messy branches, Dr.X rebases their branch to make history look clean.

git checkout normalisation
git rebase main

This reapplies their commits on top of the latest base — as if they had started from the most recent code.

flowchart LR
  A((A)) --> B((B))
  B --> C((C))
  B --> D((D))
  D -.rebase.-> C'((C'))
  style C' fill:#fffbe6,stroke:#f90
Before rebase
%%{init: {'theme': 'base'}}%%
gitGraph
  commit id: "A"
  commit id: "B"
  branch normalization
  checkout normalization
  commit id: "C"
  checkout main
  commit id: "D"
  checkout normalization
  merge main
After rebase

Golden rule of rebase

Never rebase commits that have been pushed to a shared repository! Rebase rewrites history, which can cause problems for collaborators.

  • Safe: Rebase your local, unpushed commits
  • Unsafe: Rebase commits others are building on

11. git stash — Pausing unfinished work

Midway through plotting, Dr. X gets a Slack message: “Can you quickly check that clustering bug?” Their notebook isn’t ready to commit, but they don't want to lose their changes.

git stash

Their work is saved safely on the “stash stack,” and their working directory is clean again.

After debugging, they brings their changes back:

git stash pop
flowchart LR
  WD[Modified files] -->|git stash| STASH["stash@{0}"]
  STASH -->|git stash pop| WD

Useful stash commands

# List all stashes
git stash list

# Apply stash without removing it
git stash apply

# Apply a specific stash
git stash apply stash@{1}

# Create a named stash
git stash save "WIP: adding UMAP plots"

# Delete all stashes
git stash clear

12. git reset — Undoing a mistake

Oops — Dr. X accidentally committed a large matrix.mtx test file.

There are three types of reset:

`--soft - : Undo commit, keep changes staged

Use when: You want to recommit with a better message or add more files.

# Undo last commit, keep changes in staging area
git reset --soft HEAD^

--mixed (default): Undo commit, unstage changes

Use when: You want to redo the commit but need to modify files first.

# Undo last commit, move changes back to working directory
git reset HEAD^
# or
git reset --mixed HEAD^

--hard: Undo commit, delete changes

# Undo last commit and DELETE all changes (be careful!)
git reset --hard HEAD^

Common Questions

Why do I need the staging area? Can't I just commit directly?

The staging area gives you control! You can:

  • Commit only related changes together
  • Review changes before committing
  • Keep work-in-progress unstaged
  • Create clean, logical commit history

Many Git tools allow skipping staging (git commit -a), but understanding it makes you a better Git user.

What's the difference between Git and GitHub?
  • Git = Version control software (runs on your computer)
  • GitHub = Website for hosting Git repositories (in the cloud)

Think of it like: - Git = Microsoft Word - GitHub = Google Docs

We'll cover GitHub in detail later!

Can I use Git for non-code projects?

Absolutely! Git works great for:

  • Documentation (Markdown, LaTeX)
  • Design files (if text-based)
  • Configuration files
  • Writing (books, articles)
  • Any text files that change over time
How much space does Git use?

Git is surprisingly efficient! It:

  • Compresses data
  • Stores only changes (internally)
  • Removes duplicates

A repo with years of history might only be a few MB.