Room: Exhibit Hall
Purpose: Inspired by recent success of the transfer learning in convolutional neural network (CNN), this work aims to classify the ultrasound tongue image using fine-tune CNN-based approach, with the goal to facilitate the ultrasound tongue image analysis for the clinical linguists.
Methods: Recently, convolutional neural network (CNN) based tongue gesture classification provides state-of-the-art performance, with comparison to handcrafted features-based shallow classifiers. However, very large labeled dataset is needed to train the CNN, which is laborious to obtain the data. In our experiment, we explore to fine-turn the pre-trained model. Firstly, we preprocess the image to fit the input of the neural network architecture. Then, by retraining the weights in the last three layers of the pre-trained model, and treating previous layers as the fixed feature extractor, the new trained model can be used to classify the tongue gestures even with small labeled dataset. This is motivated by the observation that the earlier features in the neural network contain important features, for example, edges in the image.
Results: A set of 20000 ultrasound tongue images from a single speaker (Chinese native speaker), and the images were labeled as exhibiting one of the following constants: /p, t, k, l/. As the benchmark, PCA are used to extract the feature (top 100 components), and support vector machine (SVM) is used as the classifier. The accuracy of this approach is 86.7%. We compare the fine-tuned CNN and the CNN trained from scratch, with the benchmark. Both follow the sample architectures (VGGNet). The fine-tuned performed considerable better, achieving 97.2% accuracy, while the CNN trained from the scratch provides 91.3% accuracy.
Conclusion: This work indicates the potential to fine-tune the pre-trained convolutional neural network for the tongue gesture classification task. With limited labeled dataset, this kind of transfer-learning method can provide superior performance.