Data Process
1.after checking the data file,I found that the keyword and location column is NaN.Therefore,I fill “unkown” to replace NaN.
data_train['keyword'] = data_train['keyword'].fillna('uknown_keyword')
data_train['location'] = data_train['location'].fillna('uknown_location')
2.I use Pandas to combine the keyword,location,and text columns of each row in the dataset into a single string,and stores these strings in a list.
corpus = data_train.apply(lambda row:f"keyword:{row['keyword']} | location:{row['location']} | text: {row['text']}",axis=1).tolist()
3.I use BertTokenizer to tokenize and preprocessing the list.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
X_bert=tokenizer(
corpus,
padding=True,
truncation=True,
return_tensors='tf'
)
4.I use TensorFlow Dataset to transform the X.Bert to a dataset used in BERT.
dataset = tf.data.Dataset.from_tensor_slices((
{"input_ids": input_ids, "attention_mask": attention_mask},
y
)).batch(32)
Model Train
I choose BERT and set
optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']
Why?
This chanllenge is a classification problem,and BERT has these suitable features:
1.Superior Understanding of Context from Both Directions;
2.Effectiveness in Pre-training and Fine-tuning;
3.Deep Contextual Analysis;
4.Versatility Across Multiple Tasks
Limitation
1.Long Training Time:I only trained for 3 epochs,but it took 12719.4s.
2.Not so superior performance:the final score is only 0.57033:Perhaps this is because BERT may not be as effective as traditional machine learning models (e.g., Naive Bayes) or shallow neural networks (e.g., CNNs) for some simple text classification tasks, such as short text classification or binary classification problems, and they also have advantages in terms of computing resources and training time.
TO DO
1.Increase the train epochs;
2.Try other models or methods;
3.Add error blog.