Data Cleaning & Feature Engineering Pipeline for TNBC PRRX1 RNA-Seq Dataset
High-throughput transcriptomics datasets generated via next-generation sequencing (RNA-Seq) are highly powerful yet notoriously noisy. Biological count matrices are characterized by massive skewness, technical depth differences, and low-abundance background features. This guide walks step-by-step through our end-to-end data engineering pipeline designed to clean, normalize, and format the NCBI GEO GSE202769 breast cancer transcriptomic dataset into a machine learning-ready matrix.
Primary Project Links
- Kaggle Dataset: TNBC PRRX1 RNA-Seq (Cleaned & ML-Ready)
- GitHub Repository: mubashir1837/TNBC-PRRX1-RNA-Seq-Cleaned-ML-Ready
1. Biological Context: PRRX1 in Triple-Negative Breast Cancer
Triple-Negative Breast Cancer (TNBC) lacks estrogen receptor (ER), progesterone receptor (PR), and HER2 receptor amplification. Consequently, standard hormonal and targeted biological therapies are ineffective, making it the most aggressive and highly metastatic breast cancer subtype.
The transcription factor PRRX1 (Paired Related Homeobox 1) is a master driver of Epithelial-Mesenchymal Transition (EMT)—a dynamic developmental mechanism hijacked by carcinoma cells to shed adhesive qualities, invade blood vessels, acquire stem-like characteristics, and resist chemotherapies.
The dataset evaluated captures transcriptomic shifts in a complete factorial design ($4 \times 2 \times 2 \times 3 = 48$ samples) across:
- 4 Cellular Backgrounds: HCC3153, MFM223, SUM185, and EMG3.
- 2 Expression Constructs: Wild-Type PRRX1 (wt) vs. a non-DNA-binding structural control Homeodomain Helix 3 mutant (dH3).
- 2 Treatments: Doxycycline-induced PRRX1 expression (dox) vs. untreated controls (no_dox).
- 3 Temporal States: Day 7, Day 14, and Day 21.
2. The 5 Steps of the Cleaning & Engineering Pipeline
Step 1: Duplicate Index Resolution
Noisy raw expression files frequently report duplicates of gene identifiers due to multiple transcript isoform mapping or annotation changes. To resolve these, we aggregated duplicate rows in our gene-centric raw matrix by summing their read counts.
Why Summing? Summing is the biologically and mathematically sound method because read counts represent distinct cDNA fragments mapped to a gene locus. Summing integrates all mapped sequence features, retaining the absolute transcriptomic signal for downstream normalization.
Step 2: Library Depth Normalization (CPM)
Sequencing machines produce varying total reads per run. In this study, raw library sizes ranged from 33,000,000 to 52,000,000 reads. Comparing raw count outputs across samples directly is invalid because a higher count might simply represent deeper sequencing depth.
To align all samples on a uniform scale, we normalized raw counts to Counts Per Million (CPM) using the formula:
CPM(i, j) = (Count(i, j) / Library_Size(j)) * 1,000,000Where i represents the gene index and j represents the biological sample.Step 3: Variance-Stabilizing Log Transformation
RNA-Seq expression values are extremely skewed, spanning several orders of magnitude (from zero to hundreds of thousands) and exhibiting a negative binomial distribution. Distance-based Machine Learning models (e.g. SVM, k-NN, Neural Networks, PCA) assume linear relationships and stable variances, making them vulnerable to highly skewed input arrays.
We applied a log2(CPM + 1) transformation:
Expression(i, j) = log₂ ( CPM(i, j) + 1 )Adding a pseudocount of +1 ensures that zero expression values scale to zero, avoiding mathematical undefined negative infinity errors.Step 4: Low-Expression Noise Filtering
A high percentage of mapped genes represent unexpressed biological noise or non-functional transcription background. Submitting thousands of invariant, low-count genomic columns severely degrades machine learning training efficiency, introduces overfitting, and worsens the "curse of dimensionality."
We filtered out genes with a cumulative raw count sum of less than 50 across all 48 samples.
- Total unique genes evaluated: 20,114
- Low-expression genes discarded: 3,987 (19.82%)
- High-confidence features retained: 16,127
Step 5: Transposition & Relational Schema Construction
Standard genomic matrices place genes in rows and samples in columns. Machine learning pipelines expect the transpose (samples as rows, genes as columns) where each sample represents a separate data instance.
We transposed the filtered normalized matrix and prepended 5 tidy metadata columns parsed from biological headers:
| Label Column | Data Type | Biological Classes / Values |
|---|---|---|
| Cell_Line | Categorical | HCC3153, MFM223, SUM185, EMG3 |
| Construct | Categorical | wt (wild-type), dH3 (mutated structural control) |
| Treatment | Categorical | dox (active over-expression), no_dox (untreated control) |
| Induction | Binary | 1 (induced), 0 (uninduced control) |
| Timepoint | Numeric | 7, 14, 21 (days) |
3. Quality Control & Biological Validation
To validate our normalization pipeline, we performed four exploratory data analyses (EDA) as detailed in our plots:
- Library Sizes Check (`library_sizes.png`): Verified technical library distributions, confirming all 48 samples yielded deep sequencing readouts with no outliers, making depth scaling via CPM critical.
- PRRX1 Inducible Expression (`prrx1_induction_profile.png`): Plotted PRRX1 levels across experimental groups. In all doxycycline-treated samples, the expression of PRRX1 skyrocketed, validating that the inducible expression system worked flawlessly.
- PCA Projection (`pca_expression.png`): Projects 16,127 expression dimensions into a 2D space. Principal components 1 and 2 mapped 4 distinct clusters, validating that lineage tissue traits represent the greatest variance, followed by PRRX1 activation state.
- Pearson Correlation Heatmap (`correlation_heatmap.png`): Pairwise correlation scores showed exceptional technical reproducibility ($r > 0.90$ within replicates), guaranteeing high experimental integrity.
4. Copy-Paste python Baseline Classifier
The following Python code uses our consolidated, machine learning-ready matrix to build a multi-class Random Forest classifier predicting cancer lineages:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
# 1. Load the Consolidated ML Matrix (48 Samples x 16,132 Columns)
# Download this CSV from the Kaggle dataset link above
df = pd.read_csv("breast_cancer_ml_matrix.csv", index_col="Sample_ID")
# 2. Separate Metadata Labels from Genomics Features
metadata_columns = ["Cell_Line", "Construct", "Treatment", "Induction", "Timepoint"]
X = df.drop(columns=metadata_columns) # 16,127 high-confidence gene features
y = df["Cell_Line"] # Target label
# 3. Transform categorical classes to integers
le = LabelEncoder()
y_encoded = le.fit_transform(y)
# 4. Perform stratified train/test split (75% / 25%)
X_train, X_test, y_train, y_test = train_test_split(
X, y_encoded, test_size=0.25, random_state=42, stratify=y_encoded
)
# 5. Scale gene features (convert to Z-score)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 6. Train Random Forest Classifier
print("[*] Training Random Forest Classifier on 16,127 features...")
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train_scaled, y_train)
# 7. Model Predictions & Evaluation
y_pred = clf.predict(X_test_scaled)
print(f"\n[+] Classification Accuracy: {accuracy_score(y_test, y_pred) * 100:.2f}%")
print("\n[+] Classification Report:")
print(classification_report(y_test, y_pred, target_names=le.classes_))Conclusion
By resolving technical gene duplicates, normalizing library sizes, stabilizing variance via log-transformation, and filtering unexpressed background noise, we turned a raw biological file into a high-utility dataset ready for advanced ML models. Whether you are running neural networks for classification or modeling temporal progression, clean input remains the absolute foundation of genomics discovery.