Abstract
Skeleton action recognition has advanced considerably in recent years, but the integration of skeleton action recognition with zero-shot learning remains relatively underexplored. Traditional mainstream methods primarily focus on aligning textual and skeleton data, often neglecting the semantic feature compensation for skeleton visual features, and Semantic feature compensation involves using multiple semantic features to fill the information gaps in skeletal visuals, providing different descriptive references for visual features through rich semantic information. To address this limitation, this paper designed the top semantic embedding (TSE) framework to enrich the embedding of semantic features into visual features. Through a scoring mechanism, the framework selects the best semantic-visual feature pairs to improve the model’s learning performance. Additionally, to further strengthen the connection between semantic and visual modalities, the Dynamic Feature Matching (DFM) method was proposed. By constructing a multimodal attention matrix between semantic and visual features, the DFM module enables visual features to adaptively match the most relevant semantic features, creating tighter connections between the two modalities. Experiments were conducted on three benchmark datasets. The proposed method achieved an accuracy of 82.63% on the NTU-60 dataset, 77.23% on the PKU-MMD dataset, and 58.21% on the NTU-120 dataset. The experimental results demonstrate the effectiveness and strong performance of the proposed method, similarly, compared to the state-of-the-art method SMIE, it shows improvements of 4.65%, 8.08%, and 1.11% on the three datasets, respectively.








Similar content being viewed by others
Data Availability
Data will be made available on reasonable request.
References
Yue, G., Jiao, G., Li, C., Xiang, J.: When cnn meet with vit: decision-level feature fusion for camouflaged object detection. Vis. Comput. 41(6), 3957–3972 (2025)
Yue, G., Jiao, G., Xiang, J.: Semi-supervised iterative learning network for camouflaged object detection. In: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2025). IEEE
Wang, F., Jiao, G., Yue, G.: More observation leads to more clarity: Multi-view collaboration network for camouflaged object detection. Neurocomputing, 130433 (2025)
Fang, B., Wu, W., Liu, C., Zhou, Y., Song, Y., Wang, W., Shu, X., Ji, X., Wang, J.: Uatvr: Uncertainty-adaptive text-video retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13723–13733 (2023)
Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3d object detection and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11784–11793 (2021)
Cadena, C., Dick, A.R., Reid, I.D.: Multi-modal auto-encoders as joint estimators for robotics scene understanding. In: Robotics: Science and Systems, 5 (2016)
Wang, X., Fang, Z., Li, X., Li, X., Chen, C., Liu, M.: Skeleton-in-context: Unified skeleton sequence modeling with in-context learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2436–2446 (2024)
Patraucean, V., Smaira, L., Gupta, A., Recasens, A., Markeeva, L., Banarse, D., Koppula, S., Malinowski, M., Yang, Y., Doersch, C., et al.: Perception test: A diagnostic benchmark for multimodal video models. Adv. Neural. Inf. Process. Syst. 36, 42748–42761 (2024)
Pi, R., Yao, L., Gao, J., Zhang, J., Zhang, T.: Perceptiongpt: Effectively fusing visual perception into llm. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 27124–27133 (2024)
Noh, S., Bae, K., Bae, Y., Lee, B.-D.: H\(^{\wedge }\) 3net: Irregular posture detection by understanding human character and core structures. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5631–5641 (2024)
Cunico, F., Girella, F., Avogaro, A., Emporio, M., Giachetti, A., Cristani, M.: Oo-dmvmt: A deep multi-view multi-task classification framework for real-time 3d hand gesture classification and segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2745–2754 (2023)
Li, J., Zhang, J., Schmidt, L., Ratner, A.J.: Characterizing the impacts of semi-supervised learning for weak supervision. Advances in Neural Information Processing Systems 36 (2024)
Zhang, Z., Wang, X., Zhang, Z., Shen, G., Shen, S., Zhu, W.: Unsupervised graph neural architecture search with disentangled self-supervision. Advances in Neural Information Processing Systems 36 (2024)
Guo, T., Liu, H., Chen, Z., Liu, M., Wang, T., Ding, R.: Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. Proceedings of the AAAI Conference on Artificial Intelligence 36, 762–770 (2022)
Hou, W., Chen, S., Chen, S., Hong, Z., Wang, Y., Feng, X., Khan, S., Khan, F.S., You, X.: Visual-augmented dynamic semantic prototype for generative zero-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 23627–23637 (2024)
Chen, S., Hou, W., Khan, S., Khan, F.S.: Progressive semantic-guided vision transformer for zero-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 23964–23974 (2024)
Gupta, P., Sharma, D., Sarvadevabhatla, R.K.: Syntactically guided generative embeddings for zero-shot skeleton action recognition. In: 2021 IEEE International Conference on Image Processing (ICIP), 439–443 (2021). IEEE
Li, M.-Z., Jia, Z., Zhang, Z., Ma, Z., Wang, L.: Multi-semantic fusion model for generalized zero-shot skeleton-based action recognition. In: International Conference on Image and Graphics, 68–80 (2023). Springer
Xu, H., Gao, Y., Li, J., Gao, X.: An information compensation framework for zero-shot skeleton-based action recognition. IEEE Transactions on Multimedia (2025)
Zhu, A., Ke, Q., Gong, M., Bailey, J.: Part-aware unified representation of language and skeleton for zero-shot action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18761–18770 (2024)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, 8748–8763 (2021). PMLR
Reimers, N.: Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019)
Cao, Z., Simon, T., Wei, S.-E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7291–7299 (2017)
Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., Cook, M., Moore, R.: Real-time human pose recognition in parts from single depth images. Commun. ACM 56(1), 116–124 (2013)
Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1110–1118 (2015)
Tang, S., Li, C., Zhang, P., Tang, R.: Swinlstm: Improving spatiotemporal prediction accuracy using swin transformer and lstm. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 13470–13479 (2023)
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Wang, R., Liu, J., Ke, Q., Peng, D., Lei, Y.: Dear-net: learning diversities for skeleton-based early action recognition. IEEE Trans. Multimed. 25, 1175–1189 (2021)
Wang, W., Chang, F., Liu, C., Li, G., Wang, B.: Ga-net: a guidance aware network for skeleton-based early activity recognition. IEEE Trans. Multimed. 25, 1061–1073 (2021)
Xin, W., Miao, Q., Liu, Y., Liu, R., Pun, C.-M., Shi, C.: Skeleton mixformer: Multivariate topology representation for skeleton-based action recognition. In: Proceedings of the 31st ACM International Conference on Multimedia, 2211–2220 (2023)
Liu, S., Zhang, Y., Li, W., Lin, Z., Jia, J.: Video-p2p: Video editing with cross-attention control. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8599–8608 (2024)
Lin, L., Zhang, J., Liu, J.: Actionlet-dependent contrastive learning for unsupervised skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2363–2372 (2023)
Xu, H., Gao, Y., Hui, Z., Li, J., Gao, X.: Language knowledge-assisted representation learning for skeleton-based action recognition. arXiv preprint arXiv:2305.12398 (2023)
Zhou, Y., Qiang, W., Rao, A., Lin, N., Su, B., Wang, J.: Zero-shot skeleton-based action recognition via mutual information estimation and maximization. In: Proceedings of the 31st ACM International Conference on Multimedia, 5302–5310 (2023)
Ho, T.K.: The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20(8), 832–844 (1998)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9729–9738 (2020)
Hjelm, R.D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., Bengio, Y.: Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018)
Shahroudy, A., Liu, J., Ng, T.-T., Wang, G.: Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1010–1019 (2016)
Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.-Y., Kot, A.C.: Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2684–2701 (2019)
Liu, J., Song, S., Liu, C., Li, Y., Hu, Y.: A benchmark dataset and comparison study for multi-modal human action analytics. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16(2), 1–24 (2020)
Wray, M., Larlus, D., Csurka, G., Damen, D.: Fine-grained action retrieval through multiple parts-of-speech embeddings. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 450–459 (2019)
Hubert Tsai, Y.-H., Huang, L.-K., Salakhutdinov, R.: Learning robust visual-semantic embeddings. In: Proceedings of the IEEE International Conference on Computer Vision, 3571–3580 (2017)
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: Devise: A deep visual-semantic embedding model. Advances in neural information processing systems 26 (2013)
Schonfeld, E., Ebrahimi, S., Sinha, S., Darrell, T., Akata, Z.: Generalized zero-and few-shot learning via aligned variational autoencoders. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8247–8255 (2019)
Jasani, B., Mazagonwalla, A.: Skeleton based zero shot action recognition in joint pose-language semantic space. arXiv preprint arXiv:1911.11344 (2019)
Acknowledgements
This work was supported by the National Natural Science Foundation of China, the Key Research and Development Program of Hubei Province, the Young and Middle-aged Scientific and Technological Innovation Team Plan of Hubei Higher Education Institutions, and the High-Level Talent Project of Hubei University of Technology.
Funding
This research was funded by National Natural Science Foundation of China (Grant Nos. 62376089, 62302153, 62302154), the key Research and Development Program of Hubei Province, China (Grant No. 2023BEB024), and the Young and Middle-aged Scientific and Technological Innovation Team Plan in Higher Education Institutions in Hubei Province, China (Grant No. T2023007), and the High-Level Talent Project at Hubei University of Technology (Grant No. XJ2022010901).
Author information
Authors and Affiliations
Contributions
All authors jointly conducted the primary research and contributed to the manuscript writing. All authors reviewed and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there are no conflict of interest.
Additional information
Communicated by Haojie Li.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, H., Guo, S. & Cheng, F. TDZS: top semantic embedding and dynamic feature matching for zero-shot skeleton action recognition. Multimedia Systems 32, 36 (2026). https://doi.org/10.1007/s00530-025-02109-5
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1007/s00530-025-02109-5

