TDZS: top semantic embedding and dynamic feature matching for zero-shot skeleton action recognition

Chen, Hongwei; Guo, Sheng; Cheng, Fangquan

doi:10.1007/s00530-025-02109-5

TDZS: top semantic embedding and dynamic feature matching for zero-shot skeleton action recognition

Regular Paper
Published: 06 December 2025

Volume 32, article number 36, (2026)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

5 Accesses
Explore all metrics

Abstract

Skeleton action recognition has advanced considerably in recent years, but the integration of skeleton action recognition with zero-shot learning remains relatively underexplored. Traditional mainstream methods primarily focus on aligning textual and skeleton data, often neglecting the semantic feature compensation for skeleton visual features, and Semantic feature compensation involves using multiple semantic features to fill the information gaps in skeletal visuals, providing different descriptive references for visual features through rich semantic information. To address this limitation, this paper designed the top semantic embedding (TSE) framework to enrich the embedding of semantic features into visual features. Through a scoring mechanism, the framework selects the best semantic-visual feature pairs to improve the model’s learning performance. Additionally, to further strengthen the connection between semantic and visual modalities, the Dynamic Feature Matching (DFM) method was proposed. By constructing a multimodal attention matrix between semantic and visual features, the DFM module enables visual features to adaptively match the most relevant semantic features, creating tighter connections between the two modalities. Experiments were conducted on three benchmark datasets. The proposed method achieved an accuracy of 82.63% on the NTU-60 dataset, 77.23% on the PKU-MMD dataset, and 58.21% on the NTU-120 dataset. The experimental results demonstrate the effectiveness and strong performance of the proposed method, similarly, compared to the state-of-the-art method SMIE, it shows improvements of 4.65%, 8.08%, and 1.11% on the three datasets, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from €37.37 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price includes VAT (Norway)

Instant access to the full article PDF.

Institutional subscriptions

MKTZ: multi-semantic embedding and key frame masking techniques for zero-shot skeleton action recognition

Article 03 December 2024

TMD-FS: Improving Few-Shot Object Detection with Transformer Multi-modal Directing

Human action recognition using multi-stream attention-based deep networks with heterogeneous data from overlapping sub-actions

Article 27 March 2024

Data Availability

Data will be made available on reasonable request.

References

Yue, G., Jiao, G., Li, C., Xiang, J.: When cnn meet with vit: decision-level feature fusion for camouflaged object detection. Vis. Comput. 41(6), 3957–3972 (2025)
Article Google Scholar
Yue, G., Jiao, G., Xiang, J.: Semi-supervised iterative learning network for camouflaged object detection. In: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2025). IEEE
Wang, F., Jiao, G., Yue, G.: More observation leads to more clarity: Multi-view collaboration network for camouflaged object detection. Neurocomputing, 130433 (2025)
Fang, B., Wu, W., Liu, C., Zhou, Y., Song, Y., Wang, W., Shu, X., Ji, X., Wang, J.: Uatvr: Uncertainty-adaptive text-video retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13723–13733 (2023)
Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3d object detection and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11784–11793 (2021)
Cadena, C., Dick, A.R., Reid, I.D.: Multi-modal auto-encoders as joint estimators for robotics scene understanding. In: Robotics: Science and Systems, 5 (2016)
Wang, X., Fang, Z., Li, X., Li, X., Chen, C., Liu, M.: Skeleton-in-context: Unified skeleton sequence modeling with in-context learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2436–2446 (2024)
Patraucean, V., Smaira, L., Gupta, A., Recasens, A., Markeeva, L., Banarse, D., Koppula, S., Malinowski, M., Yang, Y., Doersch, C., et al.: Perception test: A diagnostic benchmark for multimodal video models. Adv. Neural. Inf. Process. Syst. 36, 42748–42761 (2024)
Google Scholar
Pi, R., Yao, L., Gao, J., Zhang, J., Zhang, T.: Perceptiongpt: Effectively fusing visual perception into llm. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 27124–27133 (2024)
Noh, S., Bae, K., Bae, Y., Lee, B.-D.: H\(^{\wedge }\) 3net: Irregular posture detection by understanding human character and core structures. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5631–5641 (2024)
Cunico, F., Girella, F., Avogaro, A., Emporio, M., Giachetti, A., Cristani, M.: Oo-dmvmt: A deep multi-view multi-task classification framework for real-time 3d hand gesture classification and segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2745–2754 (2023)
Li, J., Zhang, J., Schmidt, L., Ratner, A.J.: Characterizing the impacts of semi-supervised learning for weak supervision. Advances in Neural Information Processing Systems 36 (2024)
Zhang, Z., Wang, X., Zhang, Z., Shen, G., Shen, S., Zhu, W.: Unsupervised graph neural architecture search with disentangled self-supervision. Advances in Neural Information Processing Systems 36 (2024)
Guo, T., Liu, H., Chen, Z., Liu, M., Wang, T., Ding, R.: Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. Proceedings of the AAAI Conference on Artificial Intelligence 36, 762–770 (2022)
Article Google Scholar
Hou, W., Chen, S., Chen, S., Hong, Z., Wang, Y., Feng, X., Khan, S., Khan, F.S., You, X.: Visual-augmented dynamic semantic prototype for generative zero-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 23627–23637 (2024)
Chen, S., Hou, W., Khan, S., Khan, F.S.: Progressive semantic-guided vision transformer for zero-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 23964–23974 (2024)
Gupta, P., Sharma, D., Sarvadevabhatla, R.K.: Syntactically guided generative embeddings for zero-shot skeleton action recognition. In: 2021 IEEE International Conference on Image Processing (ICIP), 439–443 (2021). IEEE
Li, M.-Z., Jia, Z., Zhang, Z., Ma, Z., Wang, L.: Multi-semantic fusion model for generalized zero-shot skeleton-based action recognition. In: International Conference on Image and Graphics, 68–80 (2023). Springer
Xu, H., Gao, Y., Li, J., Gao, X.: An information compensation framework for zero-shot skeleton-based action recognition. IEEE Transactions on Multimedia (2025)
Zhu, A., Ke, Q., Gong, M., Bailey, J.: Part-aware unified representation of language and skeleton for zero-shot action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18761–18770 (2024)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, 8748–8763 (2021). PMLR
Reimers, N.: Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019)
Cao, Z., Simon, T., Wei, S.-E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7291–7299 (2017)
Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., Cook, M., Moore, R.: Real-time human pose recognition in parts from single depth images. Commun. ACM 56(1), 116–124 (2013)
Article Google Scholar
Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1110–1118 (2015)
Tang, S., Li, C., Zhang, P., Tang, R.: Swinlstm: Improving spatiotemporal prediction accuracy using swin transformer and lstm. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 13470–13479 (2023)
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Wang, R., Liu, J., Ke, Q., Peng, D., Lei, Y.: Dear-net: learning diversities for skeleton-based early action recognition. IEEE Trans. Multimed. 25, 1175–1189 (2021)
Article Google Scholar
Wang, W., Chang, F., Liu, C., Li, G., Wang, B.: Ga-net: a guidance aware network for skeleton-based early activity recognition. IEEE Trans. Multimed. 25, 1061–1073 (2021)
Article Google Scholar
Xin, W., Miao, Q., Liu, Y., Liu, R., Pun, C.-M., Shi, C.: Skeleton mixformer: Multivariate topology representation for skeleton-based action recognition. In: Proceedings of the 31st ACM International Conference on Multimedia, 2211–2220 (2023)
Liu, S., Zhang, Y., Li, W., Lin, Z., Jia, J.: Video-p2p: Video editing with cross-attention control. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8599–8608 (2024)
Lin, L., Zhang, J., Liu, J.: Actionlet-dependent contrastive learning for unsupervised skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2363–2372 (2023)
Xu, H., Gao, Y., Hui, Z., Li, J., Gao, X.: Language knowledge-assisted representation learning for skeleton-based action recognition. arXiv preprint arXiv:2305.12398 (2023)
Zhou, Y., Qiang, W., Rao, A., Lin, N., Su, B., Wang, J.: Zero-shot skeleton-based action recognition via mutual information estimation and maximization. In: Proceedings of the 31st ACM International Conference on Multimedia, 5302–5310 (2023)
Ho, T.K.: The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20(8), 832–844 (1998)
Article Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9729–9738 (2020)
Hjelm, R.D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., Bengio, Y.: Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018)
Shahroudy, A., Liu, J., Ng, T.-T., Wang, G.: Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1010–1019 (2016)
Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.-Y., Kot, A.C.: Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2684–2701 (2019)
Article Google Scholar
Liu, J., Song, S., Liu, C., Li, Y., Hu, Y.: A benchmark dataset and comparison study for multi-modal human action analytics. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16(2), 1–24 (2020)
Wray, M., Larlus, D., Csurka, G., Damen, D.: Fine-grained action retrieval through multiple parts-of-speech embeddings. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 450–459 (2019)
Hubert Tsai, Y.-H., Huang, L.-K., Salakhutdinov, R.: Learning robust visual-semantic embeddings. In: Proceedings of the IEEE International Conference on Computer Vision, 3571–3580 (2017)
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: Devise: A deep visual-semantic embedding model. Advances in neural information processing systems 26 (2013)
Schonfeld, E., Ebrahimi, S., Sinha, S., Darrell, T., Akata, Z.: Generalized zero-and few-shot learning via aligned variational autoencoders. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8247–8255 (2019)
Jasani, B., Mazagonwalla, A.: Skeleton based zero shot action recognition in joint pose-language semantic space. arXiv preprint arXiv:1911.11344 (2019)

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China, the Key Research and Development Program of Hubei Province, the Young and Middle-aged Scientific and Technological Innovation Team Plan of Hubei Higher Education Institutions, and the High-Level Talent Project of Hubei University of Technology.

Funding

This research was funded by National Natural Science Foundation of China (Grant Nos. 62376089, 62302153, 62302154), the key Research and Development Program of Hubei Province, China (Grant No. 2023BEB024), and the Young and Middle-aged Scientific and Technological Innovation Team Plan in Higher Education Institutions in Hubei Province, China (Grant No. T2023007), and the High-Level Talent Project at Hubei University of Technology (Grant No. XJ2022010901).

Author information

Authors and Affiliations

School of Computer Science, Hubei University of Technology, Wuhan, 430068, Hubei, China
Hongwei Chen & Sheng Guo
Wuhan Second Ship Design and Research Institute, Wuhan, 430068, Hubei, China
Fangquan Cheng

Authors

Hongwei Chen
View author publications
Search author on:PubMed Google Scholar
Sheng Guo
View author publications
Search author on:PubMed Google Scholar
Fangquan Cheng
View author publications
Search author on:PubMed Google Scholar

Contributions

All authors jointly conducted the primary research and contributed to the manuscript writing. All authors reviewed and approved the final manuscript.

Corresponding author

Correspondence to Sheng Guo.

Ethics declarations

Conflict of interest

The authors declare that there are no conflict of interest.

Additional information

Communicated by Haojie Li.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, H., Guo, S. & Cheng, F. TDZS: top semantic embedding and dynamic feature matching for zero-shot skeleton action recognition. Multimedia Systems 32, 36 (2026). https://doi.org/10.1007/s00530-025-02109-5

Download citation

Received: 27 May 2025
Accepted: 19 November 2025
Published: 06 December 2025
Version of record: 06 December 2025
DOI: https://doi.org/10.1007/s00530-025-02109-5

TDZS: top semantic embedding and dynamic feature matching for zero-shot skeleton action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MKTZ: multi-semantic embedding and key frame masking techniques for zero-shot skeleton action recognition

TMD-FS: Improving Few-Shot Object Detection with Transformer Multi-modal Directing

Human action recognition using multi-stream attention-based deep networks with heterogeneous data from overlapping sub-actions

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

TDZS: top semantic embedding and dynamic feature matching for zero-shot skeleton action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

MKTZ: multi-semantic embedding and key frame masking techniques for zero-shot skeleton action recognition

TMD-FS: Improving Few-Shot Object Detection with Transformer Multi-modal Directing

Human action recognition using multi-stream attention-based deep networks with heterogeneous data from overlapping sub-actions

Explore related subjects

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now