七三笔记

多分类

参考

交叉验证

下面代码中stratify参数会保证测试集中标签y的各个值至少存在一个吗？ ``` from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state, stratify=y) ```

尽量但不保证

`stratify=y` 的作用是按 **原始标签的比例** 把样本分到训练集和测试集，但它 **不承诺** 每个标签在测试集里至少出现一次。是否“至少有一个”取决于两个因素： 1. 该标签在总体里的 **出现次数**（频数）。 2. 测试集的 **绝对大小**（`test_size` 乘以总样本量）。举例说明： - 总体 1000 条样本，某罕见标签只占 5 条（0.5%）。若 `test_size=0.2`，则测试集理论容量 200 条，该标签期望只有 1 条。一旦随机划分时这 5 条里有一条被分到训练集，测试集就可能 **0 条**，出现空类。此时 `StratifiedShuffleSplit` 会抛错：`ValueError: The least populated class in y has only X members, which is less than n_splits=1.` 但 `train_test_split` 默认用的是 `StratifiedShuffleSplit(n_splits=1)`，在 **1 次划分** 场景下 **不会提前检查**，于是直接给出空类，**不保证至少一个**。 - 如果该标签在总体里有 10 条以上，且 `test_size` 不是极小，那么通常每个类在测试集都会至少分到 1 条，但这只是“大概率”，而非“硬保证”。结论： `stratify=y` **不会** 强行保证测试集里每个标签都至少出现一次；它只保证 **比例一致**，当某类样本太少时，测试集仍可能出现空类。如果你需要“每类至少一条”的硬性约束，必须自己额外检查或采用分层采样后手动补样。

from sklearn.model_selection import StratifiedKFold import numpy as np # 示例数据 X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]]) y = np.array([0, 0, 0, 1, 1, 1]) # 3个类别0，3个类别1 # 创建 StratifiedKFold 对象 skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42) # 进行交叉验证划分 for fold, (train_index, test_index) in enumerate(skf.split(X, y)): print(f"Fold {fold + 1}:") print(f" 训练集索引: {train_index}") print(f" 测试集索引: {test_index}") print(f" 训练集标签分布: {y[train_index]}") print(f" 测试集标签分布: {y[test_index]}") print("-" * 40)

Fold 1: 训练集索引: [0 2 3 5] 测试集索引: [1 4] 训练集标签分布: [0 0 1 1] 测试集标签分布: [0 1] Fold 2: 训练集索引: [0 1 4 5] 测试集索引: [2 3] 训练集标签分布: [0 0 1 1] 测试集标签分布: [0 1] Fold 3: 训练集索引: [1 2 3 4] 测试集索引: [0 5] 训练集标签分布: [0 0 1 1] 测试集标签分布: [0 1]

from sklearn.model_selection import KFold # 普通 KFold kf = KFold(n_splits=3, shuffle=True, random_state=42) print("普通 KFold 结果:") for fold, (train_index, test_index) in enumerate(kf.split(X, y)): print(f"Fold {fold + 1}: 测试集标签 {y[test_index]}") print("\nStratifiedKFold 结果:") # StratifiedKFold skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42) for fold, (train_index, test_index) in enumerate(skf.split(X, y)): print(f"Fold {fold + 1}: 测试集标签 {y[test_index]}")

普通 KFold 结果: Fold 1: 测试集标签 [0 1] Fold 2: 测试集标签 [0 1] Fold 3: 测试集标签 [0 1] # 可能不平衡 StratifiedKFold 结果: Fold 1: 测试集标签 [0 1] # 始终保持平衡 Fold 2: 测试集标签 [0 1] Fold 3: 测试集标签 [0 1]

from collections import Counter # 创建不平衡数据集 X = np.random.randn(100, 5) y = np.array([0] * 90 + [1] * 10) # 90个类别0，10个类别1 print("原始数据分布:", Counter(y)) skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) for fold, (train_index, test_index) in enumerate(skf.split(X, y)): train_dist = Counter(y[train_index]) test_dist = Counter(y[test_index]) print(f"Fold {fold + 1}:") print(f" 训练集 - 类别0: {train_dist[0]}, 类别1: {train_dist[1]}") print(f" 测试集 - 类别0: {test_dist[0]}, 类别1: {test_dist[1]}") print(f" 训练集比例 - 类别1: {train_dist[1]/len(train_index):.2%}") print(f" 测试集比例 - 类别1: {test_dist[1]/len(test_index):.2%}") print("-" * 50)

from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score from sklearn.datasets import make_classification # 创建示例数据集 X, y = make_classification(n_samples=1000, n_classes=3, weights=[0.7, 0.2, 0.1], random_state=42) # 使用 StratifiedKFold 进行交叉验证 skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = [] for fold, (train_index, test_index) in enumerate(skf.split(X, y)): X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] # 训练模型 model = RandomForestClassifier(random_state=42) model.fit(X_train, y_train) # 预测并评估 y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) scores.append(accuracy) print(f"Fold {fold + 1} - 准确率: {accuracy:.4f}") print(f"\n平均准确率: {np.mean(scores):.4f} (±{np.std(scores):.4f})")

主要优势

保持类别比例：每个折叠中各类别比例与原始数据一致减少偏差：避免某些折叠中缺失重要类别更可靠的评估：对于不平衡数据，评估结果更加稳定可靠适合小数据集：确保每个类别在训练和测试中都有代表性样本

适用场景

分类问题，特别是多分类问题不平衡数据集需要稳定评估模型性能的场景小样本数据集 StratifiedKFold 是处理分类问题时首选的交叉验证方法，能够提供更加可靠和稳定的性能评估。

参考

七三笔记路线：学习，记录，分享