python – Efficiency of fine-tuned Apple Basis Mannequin (.fmadapter) drops considerably after deployment to iOS/Swift app

0
4
python – Efficiency of fine-tuned Apple Basis Mannequin (.fmadapter) drops considerably after deployment to iOS/Swift app


I am dealing with a difficult problem with a fine-tuned Apple Basis Mannequin and would recognize any insights.

🎯 The Objective
I’ve fine-tuned an Apple Basis Mannequin on my customized dataset. The objective is to combine this mannequin into an iOS utility for on-device inference.

✅ What Works: Coaching & Testing
The fine-tuning course of went very nicely.

Mannequin: Apple Basis Mannequin Model: 26.0.0

Wonderful-tuning Knowledge: A customized dataset of 37 question-answer pairs in JSONL format

Surroundings: Python With PyTorch (Kaggle: Linux TPU Machine )

Efficiency: Throughout coaching and testing in my Python atmosphere, the mannequin’s efficiency was wonderful. It gave correct and related solutions on my validation set.

❌ The Drawback: Poor Efficiency in iOS App
After the profitable testing part, I exported the mannequin to an .fmadapter file to be used with Core ML. I loaded this mannequin into my iOS undertaking utilizing Swift.

Nevertheless, once I run inference contained in the app, the mannequin’s efficiency is extraordinarily poor. It offers nonsensical, irrelevant, or fully incorrect solutions, even to the identical prompts that labored completely throughout testing.

🤔 What I’ve Already Checked
I think there is a mismatch between my coaching atmosphere and the on-device execution. Right here’s what I’ve investigated up to now:

Immediate Formatting: I attempted to make sure the enter immediate format in my Swift code is similar to the one used throughout coaching.

Inference Parameters: I’ve set parameters like temperature to a low worth in Swift to get extra deterministic outcomes, however the output continues to be unhealthy.

Quantization: I perceive the .fmadapter conversion entails quantization, which could have an effect on efficiency, however the drop appears too extreme to be brought on by this alone.

❓ My Query
What might be inflicting this dramatic distinction in efficiency between my testing atmosphere and the on-device Core ML execution? Are there frequent errors associated to immediate formatting, character encoding, or the .fmadapter conversion that I is perhaps lacking?

Any assist or options for debugging this might be vastly appreciated!

LEAVE A REPLY

Please enter your comment!
Please enter your name here