Bingwanxing's Blog

The-first-kaggle-experience-update

Posted on 2025-03-19
Symbols count in article: 5.2k Reading time ≈ 9 mins.

Data Process

1.after checking the data file,I found that the keyword and location column is NaN.Therefore,I fill “unkown” to replace NaN.

data_train['keyword'] = data_train['keyword'].fillna('uknown_keyword')
data_train['location'] = data_train['location'].fillna('uknown_location')

2.I use Pandas to combine the keyword,location,and text columns of each row in the dataset into a single string,and stores these strings in a list.

corpus = data_train.apply(lambda row:f"keyword:{row['keyword']} | location:{row['location']} | text: {row['text']}",axis=1).tolist()

3.I use BertTokenizer to tokenize and preprocessing the list.

max_len = 0
for sent in corpus:
    # 将文本分词，并添加 `[CLS]` 和 `[SEP]` 符号
    input_ids = tokenizer.encode(sent, add_special_tokens=True)
    max_len = max(max_len, len(input_ids))
    
print('Max sentence length: ', max_len)

max_len = 0
for sent in corpuss:
    input_ids = tokenizer.encode(sent, add_special_tokens=True)
    max_len = max(max_len, len(input_ids))
    
print('Max sentence length: ', max_len)

input_ids = []
attention_masks = []

for sent in corpus:
    X_bert = tokenizer.encode_plus(
        sent,
        add_special_tokens = True,
        max_length = 128,
        pad_to_max_length = True,
        return_attention_mask = True,
        return_tensors = 'pt',
    )
    input_ids.append(X_bert['input_ids'])
    attention_masks.append(X_bert['attention_mask'])

input_ids = torch.cat(input_ids,dim=0)
attention_masks = torch.cat(attention_masks,dim=0)
labels = torch.tensor(y, dtype=torch.long)

print('Original: ', corpus[0])
print('Token IDs:', input_ids[0])

input_idss = []
attention_maskss = []

for sent in corpuss:
    encoded_dicts = tokenizer.encode_plus(
                        sent,                      
                        add_special_tokens = True, 
                        max_length = 128,           
                        pad_to_max_length = True,
                        return_attention_mask = True,   
                        return_tensors = 'pt',     
                   )
    input_idss.append(encoded_dicts['input_ids'])
    attention_maskss.append(encoded_dicts['attention_mask'])

input_idss = torch.cat(input_idss, dim=0)
attention_maskss = torch.cat(attention_maskss, dim=0)

4.I use pytorch Dataset to transform the tokenized corpus and corpuss to a dataset used in BERT.

from torch.utils.data import TensorDataset

dataset = TensorDataset(input_ids, attention_masks, labels)
datasets = TensorDataset(input_idss, attention_maskss)#dataset for test doesn't include labels

Model Train

I choose BERT and set

optimizer = AdamW(model.parameters(),
                 lr = 2e-5,
                 eps = 1e-8
                 )

from transformers import get_linear_schedule_with_warmup

epochs = 2

total_steps = len(train_dataloader) * epochs

scheduler = get_linear_schedule_with_warmup(optimizer,
                                           num_warmup_steps = 0,
                                           num_training_steps = total_steps)

Then,I conduct training in small batches

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# 存储训练和评估的 loss、准确率、训练时长等统计指标, 
training_stats = []

total_t0 = time.time()

for epoch_i in range(0, epochs):
    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # 统计单次 epoch 的训练时间
    t0 = time.time()

    # 重置每次 epoch 的训练总 loss
    total_train_loss = 0

    # 将模型设置为训练模式。这里并不是调用训练接口的意思
    model.train()

    # 训练集小批量迭代
    for step, batch in enumerate(train_dataloader):

        # 每经过40次迭代，就输出进度信息
        if step % 40 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        # print(b_input_ids.shape)  # 应为 (batch_size, max_length)
        # print(b_input_mask.shape) # 同上
        # print(b_labels.shape)     # 应为 (batch_size,)

        model.zero_grad()#每次计算梯度前将其清零，因为pytorch梯度是累加的
    
        #前向
        # loss, logits = model(b_input_ids, 
        #                          token_type_ids=None, 
        #                          attention_mask=b_input_mask, 
        #                          labels=b_labels)
    
        outputs = model(
            input_ids=b_input_ids,
            attention_mask=b_input_mask,
            labels=b_labels  # 确保传递了 labels
        )
        
        # 从 outputs 中提取 loss 和 logits
        loss = outputs.loss  # 直接访问 loss 属性
        logits = outputs.logits
    
        print("模型返回的 loss:", loss.item())
    
        # 累加 loss
        total_train_loss += loss.item()
    
        # 反向传播
        loss.backward()
    
        # 梯度裁剪，避免出现梯度爆炸情况
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    
        # 更新参数
        optimizer.step()
    
        # 更新学习率
        scheduler.step()

    # 平均训练误差
    avg_train_loss = total_train_loss / len(train_dataloader)            
    
    # 单次 epoch 的训练时长
    training_time = format_time(time.time() - t0)

Model Prediction

I use model.eval to get the prediction result

model.eval()

# Tracking variables 
predictions , true_labels = [], []

# 预测
for batch in test_dataloader:
  # 将数据加载到 gpu 中
  batch = tuple(t.to(device) for t in batch)
  b_input_ids, b_input_mask = batch
  
  # 不需要计算梯度
  with torch.no_grad():
      # 前向传播，获取预测结果
      outputs = model(b_input_ids,attention_mask=b_input_mask)

  logits = outputs.logits

  # 将结果加载到 cpu 中
  batch_logits = logits.detach().cpu().numpy()
  
  # 存储预测结果和 labels
  predictions.append(batch_logits)

flat_predictions = np.concatenate(predictions, axis=0)
predicted_labels = np.argmax(flat_predictions, axis=1)

Why?

This chanllenge is a classification problem,and BERT has these suitable features:
1.Superior Understanding of Context from Both Directions;
2.Effectiveness in Pre-training and Fine-tuning;What we need to do is to add an untrained neuronal layer at the end, and then train a new model to complete our classification task;
3.Deep Contextual Analysis;
4.Versatility Across Multiple Tasks

Limitation

Long Training Time:I only trained for 2 epochs,but it took 10941.0s.

TO DO

1.Increase the train epochs;
2.Try other models or methods;
3.Add error blog.

'The-first-kaggle-experience'

Posted on 2025-02-26
Symbols count in article: 1.9k Reading time ≈ 3 mins.

Data Process

1.after checking the data file,I found that the keyword and location column is NaN.Therefore,I fill “unkown” to replace NaN.

data_train['keyword'] = data_train['keyword'].fillna('uknown_keyword')
data_train['location'] = data_train['location'].fillna('uknown_location')

2.I use Pandas to combine the keyword,location,and text columns of each row in the dataset into a single string,and stores these strings in a list.

corpus = data_train.apply(lambda row:f"keyword:{row['keyword']} | location:{row['location']} | text: {row['text']}",axis=1).tolist()

3.I use BertTokenizer to tokenize and preprocessing the list.

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
X_bert=tokenizer(
    corpus,
    padding=True,
    truncation=True,
    return_tensors='tf'
)

4.I use TensorFlow Dataset to transform the X.Bert to a dataset used in BERT.

dataset = tf.data.Dataset.from_tensor_slices((
    {"input_ids": input_ids, "attention_mask": attention_mask},
    y
)).batch(32)

Model Train

I choose BERT and set

optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']

Why?

Limitation

1.Long Training Time:I only trained for 3 epochs,but it took 12719.4s.
2.Not so superior performance:the final score is only 0.57033:Perhaps this is because BERT may not be as effective as traditional machine learning models (e.g., Naive Bayes) or shallow neural networks (e.g., CNNs) for some simple text classification tasks, such as short text classification or binary classification problems, and they also have advantages in terms of computing resources and training time.

TO DO

1.Increase the train epochs;
2.Try other models or methods;
3.Add error blog.

On-device LLM Deployment-3

Posted on 2024-10-26
Symbols count in article: 988 Reading time ≈ 2 mins.

This week focus on paper reading and basic knowledge learning.

First,I summarised the difference of QAT and PTQ.

Quantization-Aware Training (QAT) Core Method

Simulated Quantization During Training,Full Model Fine-Tuning,Quantizing Weights and Activations.

Post-Training Quantization (PTQ) Core Method

Quantization After Training,Static Quantization,Dynamic Quantization.

Here is a chart about comparison of them.

Feature	Quantization-Aware Training(QAT)	Post-Training Quantization (PTQ)
Training	Requires retraining with quantization	No retraining needed, applied aftertraining
Use Case	Best for high-precision tasks on resource.constrained devices	Fast deployment, suited for simplerquantization tasks
Accuracy Loss	Minimal, close to floating-point accuracy	Potential for higher accuracy loss
Efficiency Gains	High efficiency on low-precision hardware	Also boosts efficiency, but lessoptimal for some models
Complexity	Higher complexity due to simulatedquantization during training	Simpler implementation

Secondly,I go over knowledge of transfomer,whose note is on the goodnotes.

我总感觉像生活在大海上，受到威胁，然而心中存有巨大的幸福。————加谬

On-device LLM Deployment-2

Posted on 2024-10-18 Edited on 2024-10-26
Symbols count in article: 2.3k Reading time ≈ 4 mins.

After last week’s exploration,I got more practised.So I have basically done the Deployment on Android,which should be working but got a problem.The problem is every time I open the app containing the Gemma-2-2b,the stimulater got crashed.

I think it’s because the configuration of stimulater is not enough.So there is two ways to solve it:1,use a real phone;2,finetune the model.

I choose the latter.

So the way became so clear,now my work is to use some finetune strategies to make the model works better.

Firstly I tried LoRA.And I will try DV and prunning next week.(no Qutization because the model has already been 8-bit.)(now I found this idea is wrong:d.)

And here’s summary of four ways to optimize LLMs:

Quantization

Quantization minimizes the precision (number after decimal) of model parameters, leading to reduced memory usage and faster computation, which is vital for deploying LLMs on devices with limited resources.

Trade-off: While quantization offers efficiency gains, a potential downside is a slight decrease in model accuracy. The impact on accuracy can vary depending on the specific technique and the task at hand.

Knowledge Distillation

This involves training a smaller “student” model to replicate the performance of a larger “teacher” model. The student model learns to mimic the outputs of the teacher model, achieving similar results with fewer parameters.

Pruning

Pruning involves removing parts of a model (such as neurons or entire layers) that contribute little to the output, effectively reducing the model size and improving computational efficiency without significantly affecting accuracy.

Tuning

Finetuning involves adapting pre-trained models to specific tasks, enhancing their effectiveness without expansive retraining:
Trade-off: It risks overfitting, where the model performs well on training data but poorly on unseen data.

Last Layer Tuning adjusts the final layer of the model, optimizing it for tasks closely related to the original training. The trade off is, that it might not capture complex patterns as that require deeper model adjustments.

Adapter Layer Tuning integrates trainable modules between layers, tailoring the model’s output. As it adds extra parameters to the model, which can increase the computational load.

Prefix Tuning adds tunable parameters at the input sequence start, directing the model’s focus. Trade off is that it requires careful design of the prefix layer to avoid introducing biases.

Low Rank Adaptation (LoRA) employs low-rank matrices for efficient weight adaptation.
potentially impacting the model’s adaptability to highly complex tasks. Trade off is, potentially impacting the model’s adaptability to highly complex tasks.

我觉得神还缺少某些某些东西，只要不存在与之相对的东西。————吕西安

On-device LLM Deployment-1

Posted on 2024-10-11 Edited on 2024-10-26
Symbols count in article: 535 Reading time ≈ 1 mins.

Tried some tutorial about how to deploy a model on Android,but I got lost.Here is the process.

At first,I wanted to change the type of model to TFlite.It is not so easy.

And then,I asked a guy online for some sugggestions.He told me I can use onnx framework and so on,no need for changing the type of model.So I began to learn some something about onnx and tried to use it,finding the framework hard to work.

Eventually,I tried to deploy some tiny models(like CV) following the tutorial online.

Anyways,this week’s exploration is such a chaos.But this week’s exploration gives me a lot of exprience and it’s necessary.

没有生活的的绝望就没有对生活的爱。————加谬

HELLO WORLD

Posted on 2024-10-10 Edited on 2024-10-21
Symbols count in article: 161 Reading time ≈ 1 mins.

引言：

悠古的回响寥寥，夏娃亚当偷尝禁果的欢欣仍刺激着无数人，
时间流过，我看不见神，
胸中依然感到青春的震荡，
沐浴着新生的水，
“HELLO WORLD”

This blog will contain my code stuff and learning process…and maybe more.

这个博客会包含我的代码相关和学习历程…也许还有别的。