4. Git Concepts, Data Model and Commands
¶Version Control for Cell Clusters — A Git Journey in the Lab
Dr. X is a computational biologist working on a new single-cell RNA sequencing (scRNA-seq) pipeline. Like most data scientists, their workflow involves constant tweaking: preprocessing, quality control, normalization, clustering, and visualization.
They are tired of lost scripts, mysterious file versions like clustering_final_FINAL.R, and overwritten notebooks.
So, they decided to version-control their pipeline properly using Git
1. git init - Starting a new experiment¶
They begin their project as they would a new experiment — setting up a clean workspace.
This initialises an empty Git repository, like labeling an empty freezer box before adding samples. Every change from here on will be tracked.
What just happened? The .git directory¶
The git init command created a hidden .git directory — the heart of your repository.
Output:
- This
.gitfolder is Git's "lab notebook" — it contains:
.git/
├── HEAD # Points to current branch
├── config # Repository settings
├── description # Repository description (rarely used)
├── branches/ # (deprecated, legacy Git)
├── hooks/ # Scripts triggered by Git events
├── info/ # Additional repository information (excludes, attributes)
├── objects/ # Database of all file versions (commits, trees, blobs)
└── refs/ # Pointers to commits (branches, tags)
├── heads/ # Local branches
└── tags/ # Version tags
Never delete .git!
Deleting this folder erases all version history. Your files remain, but Git forgets everything — like burning your lab notebook while keeping only today's samples.
Checking if a directory is a Git repository
Creating essential files¶
Dr. X sets up the initial project structure:
Create a mock file-set
Files created:
CellClusterFlow/
├── .git/ # Git's database (hidden)
├── README.md # Project documentation
├── LICENSE # Usage terms
├── data_preprocessing.py # Data loading and cleaning
├── qc_filtering.py # Quality control filters
├── clustering.R # Clustering algorithms
└── visualization.ipynb # Plotting and figures
At this point, Git knows these files exist (they're in the working directory), but isn't tracking them yet.
That's what git add will do next.
flowchart LR
A[Empty directory] -->|git init| B["Repository with .git/ folder
(Git's brain)"]
B -->|create files| C["Working directory with files
(untracked)"]
style A fill:#f0f0f0,stroke:#333
style B fill:#e6f7ff,stroke:#1890ff,stroke-width:3px
style C fill:#fff7e6,stroke:#fa8c16
Working directory vs Repository
- Working directory: The files you can see and edit
- Repository (
.git/): Git's internal database of all versions
Think of it like: - Working directory = Your lab bench (current experiment) - Repository = Your archived lab notebooks (all past experiments)
2. git status — Checking the lab bench¶
Just like checking which samples are unprocessed, they inspects their project’s state.
Git reports untracked files — nothing is committed yet. This command should become a habit; they run it before almost every operation.
flowchart LR
WD["`Working Directory
(modified/untracked files)`"] --> IDX[Index]
IDX --> HEAD["`HEAD
(committed snapshot)`"]
style WD fill:#fffbe6,stroke:#333
style IDX fill:#e6f7ff,stroke:#333
style HEAD fill:#e6ffe6,stroke:#333
3. git add — Staging files for the record¶
Like labeling tubes before freezing them, they stages files to prepare for a permanent record.
This moves files from the working directory into the staging area, also known as the index. Only staged files will be committed.
4. git commit — Recording the experiment¶
They capture their first snapshot.
Output
[main (root-commit) 40eb049] Initial commit: basic pipeline structure
6 files changed, 0 insertions(+), 0 deletions(-)
create mode 100644 LICENSE
create mode 100644 README.md
create mode 100644 clustering.R
create mode 100644 data_preprocessing.py
create mode 100644 qc_filtering.py
create mode 100644 visualization.ipynb
This is their first “frozen sample” — They can always revert to it later.
%%{init: {'theme':'base'}}%%
gitGraph
commit id: "Initial commit"
5. git branch — Designing new experiments¶
To test a new normalisation method, they create a new branch — a safe environment to experiment without contaminating their main results.
Branches in Git are like running parallel experiments in separate tubes. Each can evolve independently until they decide to merge the results.
%%{init: {'theme': 'base'}}%%
gitGraph
commit id: "A"
branch normalization
checkout normalization
commit id: "B"
6. git status, git add, and git commit again¶
After modifying data_preprocessing.py to include log-normalization, they check the progress:
git status
git add data_preprocessing.py
git commit -m "Add log-normalization step for scRNA-seq data"
Now, their new branch contains a reproducible change.
%%{init: {'theme': 'base'}}%%
gitGraph
commit id: "A"
branch normalization
checkout normalization
commit id: "Add log-normalization"
7. git log — Reviewing experimental history¶
Every scientist keeps lab notes; Git is no different.
It shows a tidy, timestamped list of commits — their computational “lab notebook”.
%%{init: {'theme': 'base'}}%%
gitGraph
commit id: "Initial commit"
commit id: "Add log-normalization"
Useful git log variations
8. git checkout — Switching branches¶
To review another experiment, Dr. X switches branches. ( make sure the current branch is normalisation )
- This
*indicates a the current branch
Create a new branch ( branch namem is clustering-tweaks and check it out one line)
This is like changing which dataset or parameter set they are exploring — safely isolated.
%%{init: {'theme': 'base'}}%%
gitGraph
commit id: "A"
branch clustering-tweaks
checkout clustering-tweaks
commit id: "B"
Uncommitted changes
If you have uncommitted changes, Git may prevent you from switching branches: error: Your local changes to the following files would be overwritten by checkout
Solutions:
- Commit your changes:
git commit -m "Work in progress" - Stash your changes:
git stash(covered later) - Discard your changes:
git checkout -- <file>(be careful!)
9. git merge — Combining branches¶
Once testing is complete, Dr. X merges their branch back into main.
# Switch to the branch you want to merge INTO
git checkout main
# Merge the feature branch
git merge normalisation
10. git rebase — Keeping history tidy¶
After a few days, main has moved ahead.
Rather than merge and create messy branches, Dr.X rebases their branch to make history look clean.
This reapplies their commits on top of the latest base — as if they had started from the most recent code.
flowchart LR
A((A)) --> B((B))
B --> C((C))
B --> D((D))
D -.rebase.-> C'((C'))
style C' fill:#fffbe6,stroke:#f90
Before rebase
%%{init: {'theme': 'base'}}%%
gitGraph
commit id: "A"
commit id: "B"
branch normalization
checkout normalization
commit id: "C"
checkout main
commit id: "D"
checkout normalization
merge main
After rebase
Golden rule of rebase
Never rebase commits that have been pushed to a shared repository! Rebase rewrites history, which can cause problems for collaborators.
- Safe: Rebase your local, unpushed commits
- Unsafe: Rebase commits others are building on
11. git stash — Pausing unfinished work¶
Midway through plotting, Dr. X gets a Slack message: “Can you quickly check that clustering bug?” Their notebook isn’t ready to commit, but they don't want to lose their changes.
Their work is saved safely on the “stash stack,” and their working directory is clean again.
After debugging, they brings their changes back:
flowchart LR
WD[Modified files] -->|git stash| STASH["stash@{0}"]
STASH -->|git stash pop| WD
Useful stash commands
12. git reset — Undoing a mistake¶
Oops — Dr. X accidentally committed a large matrix.mtx test file.
There are three types of reset:
`--soft - : Undo commit, keep changes staged
Use when: You want to recommit with a better message or add more files.
--mixed (default): Undo commit, unstage changes
Use when: You want to redo the commit but need to modify files first.
--hard: Undo commit, delete changes
Common Questions¶
Why do I need the staging area? Can't I just commit directly?
The staging area gives you control! You can:
- Commit only related changes together
- Review changes before committing
- Keep work-in-progress unstaged
- Create clean, logical commit history
Many Git tools allow skipping staging (git commit -a), but understanding it makes you a better Git user.
What's the difference between Git and GitHub?
- Git = Version control software (runs on your computer)
- GitHub = Website for hosting Git repositories (in the cloud)
Think of it like: - Git = Microsoft Word - GitHub = Google Docs
We'll cover GitHub in detail later!
Can I use Git for non-code projects?
Absolutely! Git works great for:
- Documentation (Markdown, LaTeX)
- Design files (if text-based)
- Configuration files
- Writing (books, articles)
- Any text files that change over time
How much space does Git use?
Git is surprisingly efficient! It:
- Compresses data
- Stores only changes (internally)
- Removes duplicates
A repo with years of history might only be a few MB.
