Skip to main content
Log in

TDZS: top semantic embedding and dynamic feature matching for zero-shot skeleton action recognition

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Skeleton action recognition has advanced considerably in recent years, but the integration of skeleton action recognition with zero-shot learning remains relatively underexplored. Traditional mainstream methods primarily focus on aligning textual and skeleton data, often neglecting the semantic feature compensation for skeleton visual features, and Semantic feature compensation involves using multiple semantic features to fill the information gaps in skeletal visuals, providing different descriptive references for visual features through rich semantic information. To address this limitation, this paper designed the top semantic embedding (TSE) framework to enrich the embedding of semantic features into visual features. Through a scoring mechanism, the framework selects the best semantic-visual feature pairs to improve the model’s learning performance. Additionally, to further strengthen the connection between semantic and visual modalities, the Dynamic Feature Matching (DFM) method was proposed. By constructing a multimodal attention matrix between semantic and visual features, the DFM module enables visual features to adaptively match the most relevant semantic features, creating tighter connections between the two modalities. Experiments were conducted on three benchmark datasets. The proposed method achieved an accuracy of 82.63% on the NTU-60 dataset, 77.23% on the PKU-MMD dataset, and 58.21% on the NTU-120 dataset. The experimental results demonstrate the effectiveness and strong performance of the proposed method, similarly, compared to the state-of-the-art method SMIE, it shows improvements of 4.65%, 8.08%, and 1.11% on the three datasets, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from €37.37 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price includes VAT (Norway)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data Availability

Data will be made available on reasonable request.

References

  1. Yue, G., Jiao, G., Li, C., Xiang, J.: When cnn meet with vit: decision-level feature fusion for camouflaged object detection. Vis. Comput. 41(6), 3957–3972 (2025)

    Article  Google Scholar 

  2. Yue, G., Jiao, G., Xiang, J.: Semi-supervised iterative learning network for camouflaged object detection. In: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2025). IEEE

  3. Wang, F., Jiao, G., Yue, G.: More observation leads to more clarity: Multi-view collaboration network for camouflaged object detection. Neurocomputing, 130433 (2025)

  4. Fang, B., Wu, W., Liu, C., Zhou, Y., Song, Y., Wang, W., Shu, X., Ji, X., Wang, J.: Uatvr: Uncertainty-adaptive text-video retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13723–13733 (2023)

  5. Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3d object detection and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11784–11793 (2021)

  6. Cadena, C., Dick, A.R., Reid, I.D.: Multi-modal auto-encoders as joint estimators for robotics scene understanding. In: Robotics: Science and Systems, 5 (2016)

  7. Wang, X., Fang, Z., Li, X., Li, X., Chen, C., Liu, M.: Skeleton-in-context: Unified skeleton sequence modeling with in-context learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2436–2446 (2024)

  8. Patraucean, V., Smaira, L., Gupta, A., Recasens, A., Markeeva, L., Banarse, D., Koppula, S., Malinowski, M., Yang, Y., Doersch, C., et al.: Perception test: A diagnostic benchmark for multimodal video models. Adv. Neural. Inf. Process. Syst. 36, 42748–42761 (2024)

    Google Scholar 

  9. Pi, R., Yao, L., Gao, J., Zhang, J., Zhang, T.: Perceptiongpt: Effectively fusing visual perception into llm. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 27124–27133 (2024)

  10. Noh, S., Bae, K., Bae, Y., Lee, B.-D.: H\(^{\wedge }\) 3net: Irregular posture detection by understanding human character and core structures. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5631–5641 (2024)

  11. Cunico, F., Girella, F., Avogaro, A., Emporio, M., Giachetti, A., Cristani, M.: Oo-dmvmt: A deep multi-view multi-task classification framework for real-time 3d hand gesture classification and segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2745–2754 (2023)

  12. Li, J., Zhang, J., Schmidt, L., Ratner, A.J.: Characterizing the impacts of semi-supervised learning for weak supervision. Advances in Neural Information Processing Systems 36 (2024)

  13. Zhang, Z., Wang, X., Zhang, Z., Shen, G., Shen, S., Zhu, W.: Unsupervised graph neural architecture search with disentangled self-supervision. Advances in Neural Information Processing Systems 36 (2024)

  14. Guo, T., Liu, H., Chen, Z., Liu, M., Wang, T., Ding, R.: Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. Proceedings of the AAAI Conference on Artificial Intelligence 36, 762–770 (2022)

    Article  Google Scholar 

  15. Hou, W., Chen, S., Chen, S., Hong, Z., Wang, Y., Feng, X., Khan, S., Khan, F.S., You, X.: Visual-augmented dynamic semantic prototype for generative zero-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 23627–23637 (2024)

  16. Chen, S., Hou, W., Khan, S., Khan, F.S.: Progressive semantic-guided vision transformer for zero-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 23964–23974 (2024)

  17. Gupta, P., Sharma, D., Sarvadevabhatla, R.K.: Syntactically guided generative embeddings for zero-shot skeleton action recognition. In: 2021 IEEE International Conference on Image Processing (ICIP), 439–443 (2021). IEEE

  18. Li, M.-Z., Jia, Z., Zhang, Z., Ma, Z., Wang, L.: Multi-semantic fusion model for generalized zero-shot skeleton-based action recognition. In: International Conference on Image and Graphics, 68–80 (2023). Springer

  19. Xu, H., Gao, Y., Li, J., Gao, X.: An information compensation framework for zero-shot skeleton-based action recognition. IEEE Transactions on Multimedia (2025)

  20. Zhu, A., Ke, Q., Gong, M., Bailey, J.: Part-aware unified representation of language and skeleton for zero-shot action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18761–18770 (2024)

  21. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, 8748–8763 (2021). PMLR

  22. Reimers, N.: Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019)

  23. Cao, Z., Simon, T., Wei, S.-E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7291–7299 (2017)

  24. Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., Cook, M., Moore, R.: Real-time human pose recognition in parts from single depth images. Commun. ACM 56(1), 116–124 (2013)

    Article  Google Scholar 

  25. Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1110–1118 (2015)

  26. Tang, S., Li, C., Zhang, P., Tang, R.: Swinlstm: Improving spatiotemporal prediction accuracy using swin transformer and lstm. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 13470–13479 (2023)

  27. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

  28. Wang, R., Liu, J., Ke, Q., Peng, D., Lei, Y.: Dear-net: learning diversities for skeleton-based early action recognition. IEEE Trans. Multimed. 25, 1175–1189 (2021)

    Article  Google Scholar 

  29. Wang, W., Chang, F., Liu, C., Li, G., Wang, B.: Ga-net: a guidance aware network for skeleton-based early activity recognition. IEEE Trans. Multimed. 25, 1061–1073 (2021)

    Article  Google Scholar 

  30. Xin, W., Miao, Q., Liu, Y., Liu, R., Pun, C.-M., Shi, C.: Skeleton mixformer: Multivariate topology representation for skeleton-based action recognition. In: Proceedings of the 31st ACM International Conference on Multimedia, 2211–2220 (2023)

  31. Liu, S., Zhang, Y., Li, W., Lin, Z., Jia, J.: Video-p2p: Video editing with cross-attention control. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8599–8608 (2024)

  32. Lin, L., Zhang, J., Liu, J.: Actionlet-dependent contrastive learning for unsupervised skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2363–2372 (2023)

  33. Xu, H., Gao, Y., Hui, Z., Li, J., Gao, X.: Language knowledge-assisted representation learning for skeleton-based action recognition. arXiv preprint arXiv:2305.12398 (2023)

  34. Zhou, Y., Qiang, W., Rao, A., Lin, N., Su, B., Wang, J.: Zero-shot skeleton-based action recognition via mutual information estimation and maximization. In: Proceedings of the 31st ACM International Conference on Multimedia, 5302–5310 (2023)

  35. Ho, T.K.: The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20(8), 832–844 (1998)

    Article  Google Scholar 

  36. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9729–9738 (2020)

  37. Hjelm, R.D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., Bengio, Y.: Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018)

  38. Shahroudy, A., Liu, J., Ng, T.-T., Wang, G.: Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1010–1019 (2016)

  39. Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.-Y., Kot, A.C.: Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2684–2701 (2019)

    Article  Google Scholar 

  40. Liu, J., Song, S., Liu, C., Li, Y., Hu, Y.: A benchmark dataset and comparison study for multi-modal human action analytics. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16(2), 1–24 (2020)

  41. Wray, M., Larlus, D., Csurka, G., Damen, D.: Fine-grained action retrieval through multiple parts-of-speech embeddings. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 450–459 (2019)

  42. Hubert Tsai, Y.-H., Huang, L.-K., Salakhutdinov, R.: Learning robust visual-semantic embeddings. In: Proceedings of the IEEE International Conference on Computer Vision, 3571–3580 (2017)

  43. Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: Devise: A deep visual-semantic embedding model. Advances in neural information processing systems 26 (2013)

  44. Schonfeld, E., Ebrahimi, S., Sinha, S., Darrell, T., Akata, Z.: Generalized zero-and few-shot learning via aligned variational autoencoders. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8247–8255 (2019)

  45. Jasani, B., Mazagonwalla, A.: Skeleton based zero shot action recognition in joint pose-language semantic space. arXiv preprint arXiv:1911.11344 (2019)

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China, the Key Research and Development Program of Hubei Province, the Young and Middle-aged Scientific and Technological Innovation Team Plan of Hubei Higher Education Institutions, and the High-Level Talent Project of Hubei University of Technology.

Funding

This research was funded by National Natural Science Foundation of China (Grant Nos. 62376089, 62302153, 62302154), the key Research and Development Program of Hubei Province, China (Grant No. 2023BEB024), and the Young and Middle-aged Scientific and Technological Innovation Team Plan in Higher Education Institutions in Hubei Province, China (Grant No. T2023007), and the High-Level Talent Project at Hubei University of Technology (Grant No. XJ2022010901).

Author information

Authors and Affiliations

Authors

Contributions

All authors jointly conducted the primary research and contributed to the manuscript writing. All authors reviewed and approved the final manuscript.

Corresponding author

Correspondence to Sheng Guo.

Ethics declarations

Conflict of interest

The authors declare that there are no conflict of interest.

Additional information

Communicated by Haojie Li.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, H., Guo, S. & Cheng, F. TDZS: top semantic embedding and dynamic feature matching for zero-shot skeleton action recognition. Multimedia Systems 32, 36 (2026). https://doi.org/10.1007/s00530-025-02109-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1007/s00530-025-02109-5

Keywords