Week 2 Tasks - Single-Cell RNA-seq Project
This week's focus is on data normalization and feature selection. These steps prepare the dataset for downstream clustering and annotation by reducing technical variation and focusing on informative genes.
Objectives
- Normalize raw gene expression counts
- Log-transform. the data
- Identify and visualize highly variable genes (HVGs)
- Explore effects of normalization through visualizations (subsample if necessary)
- Document insights and prepare a summary report
Tasks
1. 1. Normalize and Log Transform the Data
• - Use scanpy's normalize
• - Log-transform. the data (log1p).
• - Store the raw counts using `adata.raw = adata`.
2. 2. Identify Highly Variable Genes (HVGs)
• - Use scanpy’s highly variable genes function (adata, n_top_genes=2000)`.
• - Filter to retain only HVGs
• - Plot HVG selection (scanpy has a function for this) .
3. 3. Subsample Cells for Visualization (if necessary due to computational constraints, if not use the entire dataset)
• - Randomly select ~10,000 cells using numpy
• - Create a new AnnData object for plotting: `adata_sub = adata[subset_idx].copy()`.
4. 4. Plot Normalization Effects
• - Violin plots of `total_counts` and `n_genes_by_counts
• - Scatter plot
• - Plot top expressed genes. Submit Work
• - Clean Jupyter Notebook (`.ipynb`) with markdown annotations.
• - 1-page summary report explaining what was done and observed.
• - Include visuals and key takeaways about normalization and HVG selection.
Time Estimate
- Total: ~15 hours
- Normalization & Log1p: 3-4 hrs
- HVG analysis: 3-4 hrs
- Visualization with subsampling: 3-4 hrs
- Documentation and reflection: 2-3 hrs