On-device LLM Deployment-2

After last week’s exploration,I got more practised.So I have basically done the Deployment on Android,which should be working but got a problem.The problem is every time I open the app containing the Gemma-2-2b,the stimulater got crashed.

I think it’s because the configuration of stimulater is not enough.So there is two ways to solve it:1,use a real phone;2,finetune the model.

I choose the latter.

So the way became so clear,now my work is to use some finetune strategies to make the model works better.

Firstly I tried LoRA.And I will try DV and prunning next week.(no Qutization because the model has already been 8-bit.)(now I found this idea is wrong:d.)

And here’s summary of four ways to optimize LLMs:

Quantization

Quantization minimizes the precision (number after decimal) of model parameters, leading to reduced memory usage and faster computation, which is vital for deploying LLMs on devices with limited resources.

Trade-off: While quantization offers efficiency gains, a potential downside is a slight decrease in model accuracy. The impact on accuracy can vary depending on the specific technique and the task at hand.

Knowledge Distillation

This involves training a smaller “student” model to replicate the performance of a larger “teacher” model. The student model learns to mimic the outputs of the teacher model, achieving similar results with fewer parameters.

Pruning

Pruning involves removing parts of a model (such as neurons or entire layers) that contribute little to the output, effectively reducing the model size and improving computational efficiency without significantly affecting accuracy.

Tuning

Finetuning involves adapting pre-trained models to specific tasks, enhancing their effectiveness without expansive retraining:
Trade-off: It risks overfitting, where the model performs well on training data but poorly on unseen data.

Last Layer Tuning adjusts the final layer of the model, optimizing it for tasks closely related to the original training. The trade off is, that it might not capture complex patterns as that require deeper model adjustments.

Adapter Layer Tuning integrates trainable modules between layers, tailoring the model’s output. As it adds extra parameters to the model, which can increase the computational load.

Prefix Tuning adds tunable parameters at the input sequence start, directing the model’s focus. Trade off is that it requires careful design of the prefix layer to avoid introducing biases.

Low Rank Adaptation (LoRA) employs low-rank matrices for efficient weight adaptation.
potentially impacting the model’s adaptability to highly complex tasks. Trade off is, potentially impacting the model’s adaptability to highly complex tasks.

我觉得神还缺少某些某些东西，只要不存在与之相对的东西。————吕西安