Mục lục

Scientific Reproducibility Checks: Checklist for Validating Reproducibility Claims

Lời mở đầu: Tại sao reproducibility lại quan trọng đến thế?

Anh em ơi, có bao giờ anh em gặp tình huống này chưa: Một đồng nghiệp bảo “Chạy thử code của tớ đi, chắc chắn chạy được!” – nhưng thực tế là anh em ngồi cài đi cài lại 3 lần vẫn lỗi, rồi phải hỏi han đủ thứ? Đó chính là vấn đề reproducibility – khả năng tái tạo kết quả.

Trong thời đại AI/ML bùng nổ, reproducibility không chỉ là chuyện “code chạy được không” – nó là nền tảng của khoa học và kỹ thuật. Một nghiên cứu từ Nature (2023) cho thấy hơn 70% các thí nghiệm khoa học không thể tái tạo được khi người khác thử lại. Trong ngành phần mềm, con số này cũng không khá hơn bao nhiêu.

Vậy làm sao để đảm bảo code của mình “reproducible”? Hôm nay, Hải sẽ chia sẻ checklist chi tiết để validate reproducibility claims – từ kinh nghiệm thực chiến của một thằng từng ôm cả đống paper về ML rồi cắm đầu cài lại cho đến khi tóc bạc.

1. Phân tích vấn đề: Reproducibility là gì và tại sao nó khó?

1.1 Định nghĩa reproducibility trong khoa học và phần mềm

Reproducibility (tính tái tạo) nghĩa là: Với cùng một bộ dữ liệu đầu vào, cùng một môi trường, cùng một thuật toán, kết quả phải giống nhau (hoặc trong sai số cho phép) khi chạy lại nhiều lần.

Nhưng thực tế thì sao? Hãy xem bảng so sánh sau:

Khía cạnh	Lý thuyết	Thực tế
Kết quả	Giống hệt nhau	Có thể khác biệt
Môi trường	Đồng nhất	Khác biệt vô cùng
Thời gian	Bất kỳ lúc nào	Có thể bị ảnh hưởng
Người thực hiện	Không quan trọng	Có thể khác biệt

1.2 Các nguyên nhân khiến reproducibility thất bại

🐛 Lỗi ngẫu nhiên (Stochastic Errors)

import random

# Code A: Reproducible
random.seed(42)
result = [random.randint(1, 100) for _ in range(10)]
print(result)  # Luôn cho kết quả giống nhau

# Code B: Non-reproducible
result = [random.randint(1, 100) for _ in range(10)]
print(result)  # Mỗi lần chạy khác nhau

⏱️ Yếu tố thời gian (Timing Issues)

import time
from datetime import datetime

# Code C: Reproducible (cố định thời gian)
def process_data(data):
    start_time = datetime(2024, 1, 1, 0, 0, 0)
    # ... xử lý dữ liệu dựa trên start_time
    return result

# Code D: Non-reproducible (thời gian thực)
def process_data(data):
    start_time = datetime.now()
    # ... xử lý dữ liệu dựa trên start_time
    return result

🌐 Dependencies (Dependencies)

# Môi trường A: Numpy 1.26.4
pip install numpy==1.26.4

# Môi trường B: Numpy 1.27.0
pip install numpy==1.27.0

Chỉ khác nhau 0.1 version mà có thể ra kết quả khác nhau hoàn toàn, nhất là với ML.

2. Checklist toàn diện để validate reproducibility claims

2.1 Checklist cho Environment Setup

2.1.1 Version Locking (Khóa phiên bản)

✅ Cần có:
– File requirements.txt hoặc pyproject.toml
– File Dockerfile hoặc docker-compose.yml
– File environment.yml (nếu dùng Conda)

❌ Cần tránh:
– “Cài numpy” (không version)
– “Dùng Python mới nhất” (không cụ thể)

Ví dụ thực tế:

# requirements.txt - Tốt
numpy==1.26.4
pandas==2.1.4
scikit-learn==1.3.2
tensorflow==2.15.0

# requirements.txt - Tệ
numpy
pandas
scikit-learn

2.1.2 Containerization (Đóng gói)

Tại sao Docker quan trọng: Docker đảm bảo môi trường chạy giống hệt nhau mọi nơi.

# Dockerfile - Reproducible
FROM python:3.11-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    gcc \
    g++ \
    && rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

# Copy requirements first (tận dụng cache)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy source code
COPY . .

# Set environment variables
ENV PYTHONPATH=/app
ENV RANDOM_SEED=42

# Default command
CMD ["python", "main.py"]

2.2 Checklist cho Data Management

2.2.1 Data Versioning (Phiên bản dữ liệu)

✅ Cần có:
– Dataset ID (ví dụ: dataset-v1.2.csv)
– Checksum (SHA256) của file dữ liệu
– Data provenance (nguồn gốc dữ liệu)

❌ Cần tránh:
– “Dùng dữ liệu từ tháng trước”
– “Lấy từ link này” (link có thể die)

Ví dụ checksum:

# Tạo checksum
sha256sum dataset-v1.2.csv
# Output: 9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08

# Kiểm tra checksum
sha256sum -c checksums.txt
# Output: dataset-v1.2.csv: OK

2.2.2 Data Preprocessing (Tiền xử lý dữ liệu)

Vấn đề: Cùng một dataset, preprocessing khác nhau → kết quả khác nhau.

# Reproducible preprocessing
def preprocess_data(data, seed=42):
    # Set random seed
    np.random.seed(seed)

    # Handle missing values
    data = data.fillna(data.mean())

    # Normalize
    data = (data - data.mean()) / data.std()

    # Shuffle (với seed cố định)
    data = data.sample(frac=1, random_state=seed)

    return data

2.3 Checklist cho Algorithm Implementation

2.3.1 Randomness Control (Kiểm soát ngẫu nhiên)

Vấn đề: ML algorithms có random components (random initialization, data shuffling, etc.)

import numpy as np
import random
import tensorflow as tf

# Set seeds cho mọi thư viện
def set_seeds(seed=42):
    np.random.seed(seed)
    random.seed(seed)
    tf.random.set_seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'

# Sử dụng
set_seeds(42)
model = train_model(data)  # Kết quả giống nhau mỗi lần chạy

2.3.2 Deterministic Operations (Thao tác xác định)

Vấn đề: Một số operations không xác định được mặc định.

# Non-deterministic (mặc định)
model.fit(data, epochs=10)

# Deterministic (có config)
model.fit(
    data,
    epochs=10,
    shuffle=False,  # Không xáo trộn dữ liệu
    batch_size=32,
    callbacks=[tf.keras.callbacks.EarlyStopping(
        restore_best_weights=True,
        patience=5
    )]
)

2.4 Checklist cho Experiment Tracking

2.4.1 Experiment Logging (Ghi log thí nghiệm)

✅ Cần có:
– Git commit hash
– Timestamp
– Hardware specs (CPU, RAM, GPU)
– Software versions
– Random seeds used
– Data versions

Ví dụ logging:

import json
import hashlib
import platform
import psutil

def log_experiment_config(config):
    log = {
        'timestamp': datetime.now().isoformat(),
        'git_commit': get_git_commit_hash(),
        'python_version': platform.python_version(),
        'numpy_version': np.__version__,
        'random_seed': config['random_seed'],
        'hardware': {
            'cpu': platform.processor(),
            'ram': f"{psutil.virtual_memory().total / 1e9:.1f} GB",
            'gpu': get_gpu_info()  # Nếu có
        },
        'data_checksum': hashlib.sha256(open(config['data_path'], 'rb').read()).hexdigest()
    }

    with open('experiment_log.json', 'w') as f:
        json.dump(log, f, indent=2)

2.4.2 Results Validation (Xác thực kết quả)

✅ Cần có:
– Expected output (kết quả mong đợi)
– Tolerance thresholds (ngưỡng sai số cho phép)
– Statistical tests (nếu applicable)

Ví dụ validation:

def validate_results(actual, expected, tolerance=1e-5):
    """
    Validate reproducibility của kết quả.

    Args:
        actual: Kết quả thực tế
        expected: Kết quả mong đợi
        tolerance: Ngưỡng sai số cho phép

    Returns:
        bool: True nếu kết quả nằm trong tolerance
        str: Message báo lỗi nếu có
    """
    if isinstance(actual, (int, float)):
        if abs(actual - expected) > tolerance:
            return False, f"Khác biệt: {abs(actual - expected):.6f} > {tolerance}"
        return True, "OK"

    if isinstance(actual, np.ndarray):
        if not np.allclose(actual, expected, atol=tolerance):
            diff = np.max(np.abs(actual - expected))
            return False, f"Max diff: {diff:.6f} > {tolerance}"
        return True, "OK"

    return False, "Không hỗ trợ kiểu dữ liệu này"

3. Use Case kỹ thuật: Khi hệ thống ML cần reproducibility

3.1 Bối cảnh

Một startup AI vừa phát triển model phát hiện ung thư vú với accuracy 95%. Họ cần deploy model này cho các bệnh viện, nhưng gặp vấn đề: Mỗi lần chạy lại, accuracy lại khác nhau (90% – 98%).

3.2 Vấn đề

Nguyên nhân gốc rễ:
1. Dữ liệu training được shuffle ngẫu nhiên mỗi lần
2. Model initialization khác nhau
3. GPU operations không deterministic
4. Không version control dataset

3.3 Giải pháp

3.3.1 Mã nguồn reproducible

import os
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from datetime import datetime

class ReproducibleCancerDetector:
    def __init__(self, seed=42):
        self.seed = seed
        self.set_reproducible_environment()
        self.model = None

    def set_reproducible_environment(self):
        """Set up reproducible environment"""
        np.random.seed(self.seed)
        random.seed(self.seed)
        tf.random.set_seed(self.seed)

        # Set environment variables
        os.environ['PYTHONHASHSEED'] = str(self.seed)
        os.environ['TF_DETERMINISTIC_OPS'] = '1'
        os.environ['TF_CUDNN_DETERMINISTIC'] = '1'

    def load_data(self, data_path, test_size=0.2):
        """Load and preprocess data"""
        # Load data
        data = np.load(data_path)

        # Split data (với seed cố định)
        X_train, X_test, y_train, y_test = train_test_split(
            data['images'], data['labels'],
            test_size=test_size,
            random_state=self.seed,
            stratify=data['labels']
        )

        return X_train, X_test, y_train, y_test

    def build_model(self):
        """Build ML model"""
        model = tf.keras.Sequential([
            tf.keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(256, 256, 3)),
            tf.keras.layers.MaxPooling2D((2, 2)),
            tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
            tf.keras.layers.MaxPooling2D((2, 2)),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(128, activation='relu'),
            tf.keras.layers.Dropout(0.5),
            tf.keras.layers.Dense(1, activation='sigmoid')
        ])

        return model

    def train(self, X_train, y_train, X_val, y_val, epochs=50):
        """Train model with reproducible settings"""
        # Build model
        self.model = self.build_model()

        # Compile model
        self.model.compile(
            optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
            loss='binary_crossentropy',
            metrics=['accuracy']
        )

        # Train model
        history = self.model.fit(
            X_train, y_train,
            epochs=epochs,
            batch_size=32,
            validation_data=(X_val, y_val),
            shuffle=True,  # Vẫn shuffle, nhưng với seed cố định
            callbacks=[
                tf.keras.callbacks.EarlyStopping(
                    monitor='val_loss',
                    patience=10,
                    restore_best_weights=True
                )
            ]
        )

        return history

    def evaluate(self, X_test, y_test):
        """Evaluate model"""
        results = self.model.evaluate(X_test, y_test, verbose=0)
        return dict(zip(self.model.metrics_names, results))

    def save_model(self, path):
        """Save model with metadata"""
        # Save model
        self.model.save(path)

        # Save metadata
        metadata = {
            'seed': self.seed,
            'timestamp': datetime.now().isoformat(),
            'python_version': platform.python_version(),
            'tensorflow_version': tf.__version__,
            'numpy_version': np.__version__,
            'dataset_checksum': self.calculate_dataset_checksum()
        }

        with open(f"{path}/metadata.json", 'w') as f:
            json.dump(metadata, f, indent=2)

    def calculate_dataset_checksum(self):
        """Calculate checksum của dataset"""
        # Giả sử dataset được lưu ở một vị trí cố định
        dataset_path = '/data/cancer_dataset_v1.2.npz'
        return hashlib.sha256(open(dataset_path, 'rb').read()).hexdigest()

# Sử dụng
detector = ReproducibleCancerDetector(seed=42)
X_train, X_test, y_train, y_test = detector.load_data('data/cancer_dataset_v1.2.npz')
history = detector.train(X_train, y_train, X_test, y_test, epochs=50)
results = detector.evaluate(X_test, y_test)
detector.save_model('models/cancer_detector_v1')

3.3.2 Validation và Testing

import unittest

class TestReproducibility(unittest.TestCase):
    def test_consistency(self):
        """Test xem model có cho kết quả nhất quán không"""
        # Train 2 model với cùng seed
        model1 = ReproducibleCancerDetector(seed=42)
        model2 = ReproducibleCancerDetector(seed=42)

        # Giả sử đã có dữ liệu train và test
        X_train, X_test, y_train, y_test = load_test_data()

        # Train cả hai model
        history1 = model1.train(X_train, y_train, X_test, y_test, epochs=10)
        history2 = model2.train(X_train, y_train, X_test, y_test, epochs=10)

        # So sánh weights
        weights1 = model1.model.get_weights()
        weights2 = model2.model.get_weights()

        for w1, w2 in zip(weights1, weights2):
            np.testing.assert_array_almost_equal(w1, w2, decimal=5)

        # So sánh accuracy
        acc1 = history1.history['accuracy'][-1]
        acc2 = history2.history['accuracy'][-1]

        self.assertAlmostEqual(acc1, acc2, places=5)

    def test_different_seed(self):
        """Test xem seed khác nhau có cho kết quả khác nhau không"""
        model1 = ReproducibleCancerDetector(seed=42)
        model2 = ReproducibleCancerDetector(seed=43)

        X_train, X_test, y_train, y_test = load_test_data()

        history1 = model1.train(X_train, y_train, X_test, y_test, epochs=10)
        history2 = model2.train(X_train, y_train, X_test, y_test, epochs=10)

        acc1 = history1.history['accuracy'][-1]
        acc2 = history2.history['accuracy'][-1]

        self.assertNotAlmostEqual(acc1, acc2, places=2)

if __name__ == '__main__':
    unittest.main()

4. Công cụ và thư viện hỗ trợ reproducibility

4.1 MLflow – Quản lý experiments

import mlflow
import mlflow.sklearn
from mlflow.tracking import MlflowClient

# Start tracking
mlflow.set_experiment("cancer_detection")

with mlflow.start_run(run_name="reproducible_training"):
    # Log parameters
    mlflow.log_param("random_seed", 42)
    mlflow.log_param("learning_rate", 0.001)
    mlflow.log_param("epochs", 50)

    # Train model
    model = train_model(data, seed=42)

    # Log metrics
    mlflow.log_metric("accuracy", 0.95)
    mlflow.log_metric("loss", 0.12)

    # Log model
    mlflow.sklearn.log_model(model, "cancer_detector")

    # Log artifacts
    mlflow.log_artifact("experiment_log.json")
    mlflow.log_artifact("requirements.txt")

4.2 DVC – Data Version Control

# dvc.yaml
stages:
  prepare:
    cmd: python prepare_data.py
    deps:
      - data/raw
    outs:
      - data/processed
    meta:
      seed: 42

  train:
    cmd: python train.py
    deps:
      - data/processed
      - src/model.py
      - src/utils.py
    params:
      - prepare.seed
      - train.epochs
      - train.learning_rate
    outs:
      - models/cancer_detector
    metrics:
      - metrics/accuracy.json

4.3 Hydra – Configuration Management

# config.yaml
defaults:
  - _self_

seed: 42
data:
  path: "data/cancer_dataset_v1.2.npz"
  test_size: 0.2
train:
  epochs: 50
  batch_size: 32
  learning_rate: 0.001

# main.py
import hydra
from omegaconf import DictConfig

@hydra.main(config_path="config.yaml")
def main(cfg: DictConfig):
    print(f"Running with seed: {cfg.seed}")
    print(f"Data path: {cfg.data.path}")
    # ... training code

if __name__ == "__main__":
    main()

5. Best Practices cho reproducibility

5.1 Documentation (Tài liệu hóa)

✅ Cần có:
– README chi tiết
– Installation guide
– Data acquisition instructions
– Expected results

❌ Cần tránh:
– “Chạy file main.py là được”
– “Cài mấy cái library đi”

Ví dụ README:

# Cancer Detector - Reproducible ML System

## Prerequisites

- Python 3.11+
- NVIDIA GPU (optional but recommended)

## Installation

```bash
# 1. Clone repository
git clone https://github.com/your-org/cancer-detector.git
cd cancer-detector

# 2. Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# venv\Scripts\activate  # Windows

# 3. Install dependencies
pip install -r requirements.txt

# 4. Setup Docker (optional)
docker build -t cancer-detector .

Data Requirements

Dataset: cancer_dataset_v1.2.npz
Checksum: 9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08
Source: Link to dataset

Running

# With Python
python main.py --seed=42 --data_path=data/cancer_dataset_v1.2.npz

# With Docker
docker run --rm -v $(pwd)/data:/app/data cancer-detector --seed=42

Expected Results

Accuracy: ~95% ± 1%
Loss: ~0.15 ± 0.05
Training time: ~10 minutes on GPU

Troubleshooting

Accuracy too low (< 90%)

Check data checksum
Verify random seed is set
Ensure GPU is available (if required)

Docker build fails

Check NVIDIA driver version
Verify Docker Compose is installed


### 5.2 Testing (Kiểm thử)

```python
import pytest
import numpy as np

def test_reproducibility():
    """Test reproducibility của toàn bộ pipeline"""
    # Setup
    seed = 42
    np.random.seed(seed)

    # Generate data
    data = generate_test_data(seed)

    # Run pipeline twice
    result1 = run_pipeline(data, seed)
    result2 = run_pipeline(data, seed)

    # Validate
    assert np.allclose(result1, result2, atol=1e-5), "Results not reproducible"

def test_different_seeds():
    """Test xem seed khác nhau có cho kết quả khác nhau không"""
    seed1, seed2 = 42, 43

    data = generate_test_data(seed1)

    result1 = run_pipeline(data, seed1)
    result2 = run_pipeline(data, seed2)

    assert not np.allclose(result1, result2, atol=1e-5), "Different seeds gave same results"

def test_data_integrity():
    """Test tính toàn vẹn của dữ liệu"""
    expected_checksum = "9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08"
    actual_checksum = calculate_checksum("data/cancer_dataset_v1.2.npz")

    assert actual_checksum == expected_checksum, "Data checksum mismatch"

6. Tương lai của reproducibility: Xu hướng và thách thức

6.1 Emerging Technologies (Công nghệ mới)

6.1.1 WebAssembly for Reproducible Computing

Vấn đề: Môi trường Python khác nhau → kết quả khác nhau.

Giải pháp: Compile code sang WebAssembly.

# Sử dụng Pyodide (Python trên WebAssembly)
import pyodide

# Code chạy trong môi trường cô lập, giống hệt nhau mọi nơi
def reproducible_function(data):
    # Mọi operations đều deterministic trong Pyodide
    return processed_data

6.1.2 Federated Learning và Reproducibility

Thách thức: Data distributed across multiple devices → làm sao reproduce?

Giải pháp: Simulation environments.

class FederatedReproducibleSimulator:
    def __init__(self, num_clients=100, seed=42):
        self.num_clients = num_clients
        self.seed = seed
        self.set_seeds()

    def set_seeds(self):
        """Set seeds cho mọi client"""
        self.client_seeds = np.random.RandomState(self.seed).randint(
            0, 10000, size=self.num_clients
        )

    def simulate_client(self, client_id, data):
        """Simulate client training"""
        np.random.seed(self.client_seeds[client_id])

        # Local training
        model = create_model()
        history = model.fit(
            data,
            epochs=5,
            batch_size=32,
            shuffle=True
        )

        return model.get_weights()

    def aggregate_results(self, client_results):
        """Aggregate results từ mọi client"""
        # Weighted average
        aggregated = np.average(client_results, axis=0)
        return aggregated

6.2 Challenges (Thách thức)

6.2.1 Hardware Variability (Sự khác biệt phần cứng)

Vấn đề: CPU vs GPU vs TPU cho ra kết quả khác nhau.

Giải pháp: Hardware abstraction layers.

import torch

def set_deterministic_mode(hardware='gpu'):
    """Set deterministic mode dựa trên hardware"""
    if hardware == 'cpu':
        torch.set_deterministic(True)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
    elif hardware == 'gpu':
        torch.backends.cuda.deterministic = True
        torch.backends.cuda.benchmark = False
    elif hardware == 'tpu':
        # TPU có ít options hơn
        pass

6.2.2 Time-dependent Operations (Thao tác phụ thuộc thời gian)

Vấn đề: Learning rate schedules, early stopping, etc.

Giải pháp: Fixed schedules.

class FixedLearningRateScheduler:
    def __init__(self, initial_lr=0.001, decay_rate=0.95, seed=42):
        self.initial_lr = initial_lr
        self.decay_rate = decay_rate
        self.seed = seed
        self.rng = np.random.RandomState(seed)

    def get_lr(self, epoch):
        """Get learning rate for epoch"""
        # Fixed decay schedule
        return self.initial_lr * (self.decay_rate ** epoch)

    def get_dropout_rate(self, layer_id):
        """Get dropout rate (fixed for reproducibility)"""
        # Fixed dropout rates
        return 0.5 if layer_id % 2 == 0 else 0.3

Tổng kết: 3 điểm cốt lõi về reproducibility

Reproducibility là nền tảng của khoa học và kỹ thuật – Không có nó, chúng ta không thể tin tưởng vào kết quả, không thể tái sử dụng code, và không thể phát triển bền vững.
Checklist là chìa khóa thành công – Sử dụng checklist chi tiết để validate reproducibility claims, từ environment setup đến data management, algorithm implementation, và experiment tracking.
Công cụ và best practices là cần thiết – MLflow, DVC, Hydra, Docker, và testing framework giúp chúng ta xây dựng hệ thống reproducible một cách chuyên nghiệp.

Câu hỏi thảo luận

Anh em đã từng gặp vấn đề reproducibility bao giờ chưa? Kinh nghiệm của anh em là gì? Có công cụ hay kỹ thuật nào hay ho muốn chia sẻ không?

Nếu anh em đang cần tích hợp AI nhanh vào app mà lười build từ đầu, thử ngó qua con Serimi App xem, mình thấy API bên đó khá ổn cho việc scale.

Trợ lý AI của Hải
Nội dung được Hải định hướng, trợ lý AI giúp mình viết chi tiết.