七三笔记

交叉熵使用示例交叉熵API说明 one hot标签计算交叉熵索引标签计算交叉熵批次损失计算交叉熵公式交叉熵权重 BCELoss 参考

交叉熵使用示例

文本类

 
文本拆分为单词，单词转化为索引，进而句子转化为索引向量
入模型时，需要将索引向量转化为矩阵，即每个单词化为一个向量
预测时，又需要将向量转化为索引，进而与字典中的单词对应

 
在索引转向量时，是随机的，
一个随机的向量反向对应索引时，偏差就比较大
可以在入模前就训练一批向量参数，使 索引 -- 向量参数 -- 索引 之间有一一对应关系

一个索引，比如index=2，对应的就是向量a，
如果得到一个向量a，就立即确定它唯一对应index=2

 
进一步简化，直接求解一个参数矩阵,func(index) ~ func(param_mat) ~ one hot向量 ~ 最大值的位置索引

模型输入的是index,param_mat即包含了index的信息，又包含了模型的规律，还能通过func转化为one hot向量

文本类交叉熵举例

 
import torch
from torch import nn

dict_count = 3 #有三个单词,索引为0，1，2
seq_len = 5
batch_size = 2


#`low` (inclusive) and :attr:`high` (exclusive).
data = torch.randint(low=0, high=dict_count, size=(batch_size,seq_len))
data
tensor([[0, 0, 2, 1, 2],
        [0, 2, 0, 0, 0]])

 
label = data
label.shape
torch.Size([2, 5])

#对X进行编码
embed = nn.Embedding(num_embeddings=dict_count, embedding_dim=dict_count, padding_idx=0)
X = embed(data)

现在求X与label之间的偏差

 
注意X与label的shape并不一样，交叉䊞能自动处理这一点，
但第1维必须是一样的，最多2维，高于2维的必须展平 
认为X[0]的向量，对应y[0]转化后的One hot向量

 
X = X.reshape(-1,dict_count)
label = label.reshape(-1)
print(X.shape,label.shape)  #torch.Size([10, 3]) torch.Size([10])

loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(X,label)
loss
tensor(1.3767, grad_fn=NllLossBackward0)

交叉熵API说明

torch.nn.CrossEntropyLoss()

 
This criterion computes the cross entropy loss between input logits and target.
criterion  美 /kraɪˈtɪriən/ n.标准;(评判或作决定的)准则;原则

Init signature:
nn.CrossEntropyLoss(
    weight: Optional[torch.Tensor] = None,
    size_average=None,
    ignore_index: int = -100,
    reduce=None,
    reduction: str = 'mean',
    label_smoothing: float = 0.0,
) -> None

使用

 
loss_fn=nn.CrossEntropyLoss()
loss = loss_fn(input,target)

input:  输入
target: 标签

input

 
input没啥可注意的，重点是后面的target 

unnormalized  非归一化;非标准的

The `input` is expected to contain the unnormalized logits for each class 
(which do `not` need to be positive or sum to 1, in general).

`input` has to be a Tensor of size 
:math:`(C)` for unbatched input,
:math:`(minibatch, C)` or 
:math:`(minibatch, C, d_1, d_2, ..., d_K)` with :math:`K \geq 1` for the `K`-dimensional case. 
The last being useful for higher dimension inputs, 
such as computing cross entropy loss per-pixel for 2D images.

交叉熵的输入(input)的shape，可以是
[特征维度]，没有批次的维度，只有特征的维度，
言外之意就是一行数据，一个样本，
这个特征维度就是模型的最后一层的维度，也是标签的维度
比如，模型输出[0.1,0.9] 跟对应的标签[0,1]求损失，
这是标准的，也是直接按数学理论转换过来的求交叉熵的格式

[batch_size, C]，这说明交叉熵可以按批次求损失 

[batch_size,C,H,W] 这是典型的2维图像的格式，
直接将这个格式作为交叉熵的损失函数也是可以的

target

 
indices 英/ˈɪndɪsiːz/  美/ˈɪndɪsiːz/ （index 的复数形式之一）;索引;标志，迹象;

The `target` that this criterion expects should contain either:
- Class indices in the range :math:`[0, C)` where :math:`C` is the number of classes; 
假设特征数C=3,那么
for i in range(3):
    print(i)

输出：
0
1
2
意思是说，如果标签类别有3,那么标签的编码应该是[0,1,2] 

这就是后面的 索引标签计算交叉熵

loss = loss_fn(y_pred, y) y是1维列表

 
import os 
from tpf import pkl_save,pkl_load 
BASE_DIR = "/wks/datasets/hotel_reader"
file_path = os.path.join(BASE_DIR,'data_pkl/word.pkl')
X_train,y_train,X_test,y_test,words_set,word2idx,idx2word = pkl_load(file_path)

# 字典长度
dict_len = len(words_set)

# 序列长度
seq_len = 512

# 训练集
X_train1 = []
for x in X_train:
    temp = x + ["<PAD>"] * seq_len
    X_train1.append(temp[:seq_len])

# 测试集
X_test1 = []
for x in X_test:
    temp = x + ["<PAD>"] * seq_len
    X_test1.append(temp[:seq_len])


"""
    索引向量化
"""
# 训练集向量化

X_train2 = []
for x in X_train1:
    temp = []
    for word in x:
        idx = word2idx[word] if word in word2idx else word2idx["<UNK>"]
        temp.append(idx)
    X_train2.append(temp)
    

# 测试集向量化

X_test2 = []
for x in X_test1:
    temp = []
    for word in x:
        idx = word2idx[word] if word in word2idx else word2idx["<UNK>"]
        temp.append(idx)
    X_test2.append(temp)


"""
    构建数据集
"""
import numpy as np
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import torch
from torch import nn
from torch.nn import functional as F

class MyDataSet(Dataset):
    
    def __init__(self, X=X_train2, y=y_train):
        self.X = X
        self.y = y
    
    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, idx):
        x = self.X[idx]
        y = self.y[idx]
        
        return torch.tensor(data=x).long(), torch.tensor(data=y).long()

train_dataset = MyDataSet(X=X_train2, y=y_train)
train_dataloader = DataLoader(dataset=train_dataset, shuffle=True, batch_size=128)

test_dataset = MyDataSet(X=X_test2, y=y_test)
test_dataloader = DataLoader(dataset=test_dataset, shuffle=False, batch_size=256)
train_dataset[0][0][:7]

tensor([ 5004, 10135, 20586,  9530, 16808, 14513,  9530])

 
train_dataset[0][1]
1
    

for X,y in test_dataloader:
    print(X.shape,y.shape)
    break

torch.Size([256, 512]) torch.Size([256])


二分类问题中，loss = loss_fn(y_pred, y)， y是1维列表

标签y是long类型，类别是用整数表示的

 
"""
    构建数据集
"""
import numpy as np
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import torch
from torch import nn
from torch.nn import functional as F

class MyDataSet(Dataset):
    
    def __init__(self,X,y):
        """
        构建数据集
        """
        self.X = X
        self.y =  y.reshape(-1)
        # self.y = y.reshape(-1,1)
    
    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, idx):
        x = self.X[idx]
        y = self.y[idx]

        return torch.tensor(data=x).float(), torch.tensor(data=y).long()

one hot标签计算交叉熵

 
import torch
from torch import nn 
import numpy as np 

input = torch.tensor([0,0,1],dtype=torch.float32,requires_grad=True)
# target = torch.tensor(2,dtype=torch.long)  # 0.5514
target = torch.tensor([0,1,0],dtype=torch.float32)  # 1.5514
# target = torch.tensor(0,dtype=torch.long)             # 1.5514

loss = nn.CrossEntropyLoss()
output = loss(input, target)
output.backward()
output
tensor(1.5514, grad_fn=DivBackward1)

索引标签计算交叉熵

 
import torch
from torch import nn 
import numpy as np 

input = torch.tensor([0,0,1],dtype=torch.float32,requires_grad=True)
# target = torch.tensor(2,dtype=torch.long)  # 0.5514
# target = torch.tensor([0,1,0],dtype=torch.float32)  # 1.5514
target = torch.tensor(0,dtype=torch.long)             # 1.5514

loss = nn.CrossEntropyLoss()
output = loss(input, target)
output.backward()
output
tensor(1.5514, grad_fn=NllLossBackward0)

该例表明，交叉熵在输入的标签是整数时，默认将其转换为one hot标签，无须再手工转换，其效果与One Hot标签一致

批次损失计算

reduction: str = 'mean'

 
每个样本与标签都有一个损失，
如果输入的是批次，交叉熵默认情况下求出的损失是批次样本损失的均值

reduce参数已废弃，看reduction，其默认值为mean 

batch_size = N 



where 
:math:`x` is the input, 
:math:`y` is the target, 
:math:`w` is the weight,
:math:`C` is the number of classes, and 
:math:`N` spans the minibatch dimension

批次损失计算

 
import torch
from torch import nn 
import numpy as np 

# loss 1.0514
x = [[0,0,1],[0,0,1]]
y = [[0,1,0],[0,0,1]]

# 1.5514
x = [[0,0,1]]
y = [[0,1,0]]

# 0.5514
x = [[0,0,1]]
y = [[0,0,1]]

input = torch.tensor(x,dtype=torch.float32,requires_grad=True)
# target = torch.tensor(2,dtype=torch.long)   
target = torch.tensor(y,dtype=torch.float32)   
# target = torch.tensor(0,dtype=torch.long)            

loss = nn.CrossEntropyLoss()
output = loss(input, target)
output.backward()
output

(0.5514+1.5514)/2 = 1.0514

每个样本与标签都有一个损失，
如果输入的是批次，交叉熵默认情况下求出的损失是批次样本损失的均值

转成索引标签也一样

 
import torch
from torch import nn 
import numpy as np 

# loss 1.0514
x = [[0,0,1],[0,0,1]]
y = [[0,1,0],[0,0,1]]
y = [1,2]  #按one hot编码的方式，这时共有3个类别 

input = torch.tensor(x,dtype=torch.float32,requires_grad=True)
target = torch.tensor(y,dtype=torch.long)  

loss = nn.CrossEntropyLoss()
output = loss(input, target)
output.backward()
output
tensor(1.0514, grad_fn=NllLossBackward0)

交叉熵公式

借pytorch交叉熵的API文档推一下交叉熵的数学公式

 
Probabilities for each class; 
useful when labels beyond a single class per minibatch item are required, 
such as for blended labels, label smoothing, etc. 
The unreduced (i.e. with :attr:`reduction` set to ``'none'``) loss for this case can be described as:

 
.. math::
    \ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad
    l_n = - \sum_{c=1}^C w_c \log \frac{\exp(x_{n,c})}{\sum_{i=1}^C \exp(x_{n,i})} y_{n,c}

 
where 
:math:`x` is the input, 
:math:`y` is the target, 
:math:`w` is the weight,
:math:`C` is the number of classes, and 
:math:`N` spans the minibatch dimension as well as
:math:`d_1, ..., d_k` for the `K`-dimensional case.

 
上图就是softmax，对x中的每个元素进行指数运算，然后再除以指数的和，
其和为1,
它真正要表达的意思是，概率和为1，

比如，二分类问题，类型共有两个，[发生，不发生]，[是，否]，
总类别个数为C，这是全集，
一个事物不在这个类型，就必定在那个类别中

softmax处理的对象是模型的输出，为什么不对标签这么处理？
因为标签本身很固定，确切地说是你亲自编码的，它属于哪个类别是铁板钉钉的事，一算就知 
而模型的输出那可真是乱七八糟，五花八门，
你期望模型输出0.9，期望它接近你的标签1,
但实际上它可能输出的是-0.7,直接就违反了概率要大于1的理论基础啊 
怎么办？
常用的办法是用softmax处理一下，
将模型在各个类别上的差异，转化为符合概率理论的差异

 
xn是一个向量，指批次中第n个样本，它有C个维度（C是类别个数），

yn就是标签，也是一个向量，也有C个维度（C是类别个数），
特别地，如果对标签使用one hot编码（实际上，通常，就是这么做的）
yn中只有一个维度为1,其他维度皆为0

比如，二分类问题，类型共有两个，one hot编码是下面这样的 
发生用[1,0]向量表示，不发生用[0,1]向量表示
二者在距离上完全一样，没有差异 

0乘以任何数还是0,C维度的计算，只算标签为1的那个维度就可以了，
同时先不考虑权重的问题，上面的公式简化为：

 
如此，每个样本与标签的交叉熵损失，转化为模型输出的 信息量 
训练的目的转化为，让模型输出的信息量，越来越小，
小到很容易就分出是哪个类别时，就达到目的了

xn是一个向量，xnc指与yn标签1所在索引对应的x的值，
因为其他位置都是与0相乘，乘完就消失了
因此P(xn)就 只是 其中与标签1对应索引位置的x占xn的概率

交叉熵权重

 
It is useful when training a classification problem with `C` classes.
If provided, the optional argument :attr:`weight` should be a 1D `Tensor`
assigning weight to each of the classes.
This is particularly useful when you have an unbalanced training set.

torch.nn.BCELoss（Binary Cross-Entropy Loss）

 
用途：做单标签二分类（或多标签、多类别二分类）时，把 单个 sigmoid 输出 与 0/1 标签 做逐元素交叉熵计算。

前提：
网络最后一层 不加 softmax/sigmoid，
而是在 损失函数外部手动做 sigmoid（或直接让网络输出 logits，再配合 BCEWithLogitsLoss）。

标签 y 必须是 float 型的 0/1 张量，形状与网络输出完全一致。

单标签二分类（一个样本只有一个正/负标签）

 
import torch
import torch.nn as nn

# 假设 batch_size=4，输出维度=1（正类概率）
logits = torch.tensor([2.0, -1.0, 0.5, 3.0])       # 网络最后输出
labels = torch.tensor([1.0, 0.0, 0.0, 1.0])        # 0/1 标签

# 先把 logits 压缩到 (0,1)
probs = torch.sigmoid(logits)

# 实例化 BCELoss
criterion = nn.BCELoss()
loss = criterion(probs, labels)

print(loss.item())      # 0.36571353673934937

 
# 3 个样本，每个样本有 5 个可能的标签，可重叠
logits = torch.randn(3, 5)          # 网络输出
labels = torch.tensor([[1, 0, 1, 0, 0],
                        [0, 0, 0, 1, 1],
                        [1, 1, 0, 0, 0]], dtype=torch.float)

probs = torch.sigmoid(logits)       # 逐元素映射到 (0,1)
criterion = nn.BCELoss()
loss = criterion(probs, labels)
print(loss.item())   # 0.7811753749847412

 
二分类需要模型输出到1维向量上，标签也是1维 

多分类，模型输出与标签的维度都是2维

 
model = MyNet()              # 最后一层是 nn.Linear(..., 1) 或 nn.Linear(..., num_classes)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.BCELoss()

for x, y in loader:          # y: float 型的 0/1 张量
    optimizer.zero_grad()
    logits = model(x)        # 不 sigmoid
    loss = criterion(torch.sigmoid(logits), y)
    loss.backward()
    optimizer.step()

 

问题	                                  正确做法
网络最后一层直接接了 sigmoid，又用 BCELoss	去掉最后 sigmoid，或将损失函数改为 BCEWithLogitsLoss（自带数值稳定的 sigmoid）
标签是 LongTensor 或含 2,3,4…	           转成 FloatTensor 并确保只有 0/1
类别不平衡	                               使用 pos_weight 参数（BCEWithLogitsLoss 支持）

 
    一句话总结：
    当你需要 逐元素 比较 概率值 和 0/1 标签 时，用 BCELoss；若想让损失函数内部做 sigmoid 并提高数值稳定性，用 BCEWithLogitsLoss。

 
BCEWithLogitsLoss = Sigmoid + BCELoss 的合并实现，数值更稳定；
当正负样本极度不均衡时，可通过 pos_weight 给“正样本损失”加权，抵消类别失衡。

 
场景设定
1000 个样本，只有 20 个正样本（正:负 ≈ 1:49）。
网络对每个样本输出 单个 logit（二分类）。

 
计算 pos_weight
公式：pos_weight = (#负样本) / (#正样本)

pos_w = torch.tensor([(1000-20)/20])   # tensor([49.0])

代码示例

 
import torch
import torch.nn as nn

# 模拟 batch_size=4，每个样本 1 个 logit
logits = torch.tensor([ 1.5, -2.0, 0.3, -0.8 ])   # 网络输出
labels = torch.tensor([ 1.0,  0.0, 0.0,  1.0 ])   # 0/1 标签（float）

# 正样本极少，用 pos_weight 增加其损失权重
pos_weight = torch.tensor([49.0])                 # 49 = 980/20

criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
loss = criterion(logits, labels)

print(loss.item())     # 17.058616638183594  比不加权重时大，迫使模型更关注正类

 
pos_weight 给“稀缺类别”的损失乘上 大于 1 的系数，
使模型在反向传播时 更关注少数类，从而缓解类别不平衡。

参考

七三笔记路线：学习，记录，分享