Robust Reinforcement Learning for Robotic Manipulation Using Lite ViT and Saliency Maps

<p>Hao Guo<sup>1,2</sup>, Xiulai Wang<sup>1</sup>, Ningling Ma<sup>1</sup>, Yutao Zhang<sup>1</sup></p>

doi:10.25236/IJFET.2024.060510

International Journal of Frontiers in Engineering Technology, 2024, 6(5); doi: 10.25236/IJFET.2024.060510.

Robust Reinforcement Learning for Robotic Manipulation Using Lite ViT and Saliency Maps

Author(s)

Hao Guo^1,2, Xiulai Wang¹, Ningling Ma¹, Yutao Zhang¹

Corresponding Author:

Xiulai Wang

Affiliation(s)

¹Nanjing Jinling Hospital, Affiliated Hospital of Medical School, Nanjing University, Nanjing, China

²School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150000, China

Download PDF
|
Download: 64
|
View: 2612

Abstract

The advancement of reinforcement learning has revolutionized autonomous robotic manipulation, enabling robots to handle complex tasks efficiently. This study presents an innovative RL framework integrating a Transformer-based visual encoder, leveraging lite Vision Transformers and saliency maps to enhance general performance across multiple robotic manipulation tasks in the OpenAI Gym environment. Our approach employs a three-channel encoder to extract and integrate visual information from multiple perspectives, bolstering visual robustness and task performance. Additionally, we incorporate unsupervised learning, data augmentation, and Soft Actor-Critic methods to ensure data efficiency during training. Experimental results demonstrate that our method achieves superior success rates, particularly in the Push task, underscoring the efficacy of saliency maps in robotic vision tasks. This research is pivotal for applications requiring robust multi-task performance with limited training data.

Keywords

robotics; reinforcement learning; vision transformer; saliency maps

Cite This Paper

Hao Guo, Xiulai Wang, Ningling Ma, Yutao Zhang. Robust Reinforcement Learning for Robotic Manipulation Using Lite ViT and Saliency Maps. International Journal of Frontiers in Engineering Technology (2024), Vol. 6, Issue 5: 67-78. https://doi.org/10.25236/IJFET.2024.060510.

References

[1] Badia, A.P.; Piot, B.; Kapturowski, S.; Sprechmann, P.; Vitvitskyi, A.; Guo, Z.D.; Blundell, C. Agent57: Outperforming the atari human benchmark. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 507–517.

[2] Berner, C.; Brockman, G.; Chan, B.; Cheung, V.; De˛biak, P.; Dennison, C.; Farhi, D.; Fischer, Q.; Hashme, S.; Hesse, C.; et al. Dota 2 with large scale deep reinforcement learning. arXiv 2019, arXiv:1912.06680.

[3] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533.

[4] Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354.

[5] Yang, Y.; Caluwaerts, K.; Iscen, A.; Zhang, T.; Tan, J.; Sindhwani, V. Data efficient reinforcement learning for legged robots. In Proceedings of the Conference on Robot Learning, Virtual, 16–18 November 2020; pp. 1–10.

[6] Haarnoja, T.; Ha, S.; Zhou, A.; Tan, J.; Tucker, G.; Levine, S. Learning to walk via deep reinforcement learning. arXiv 2018, arXiv:1812.11103.

[7] Jangir, R.; Hansen, N.; Ghosal, S.; Jain, M.; Wang, X. Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation. IEEE Robot. Autom. Lett. 2022, 7, 3046–3053.

[8] Hansen, N.; Su, H.; Wang, X. Stabilizing deep q-learning with convnets and vision transformers under data augmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 3680–3693.

[9] Chen, T.; Xu, J.; Agrawal, P. A system for general in-hand object re-orientation. In Proceedings of the Conference on Robot Learning, Auckland, New Zealand, 8–11 November 2022; pp. 297–307.

[10] Johannink, T.; Bahl, S.; Nair, A.; Luo, J.; Kumar, A.; Loskyll, M.; Ojea, J.A.; Solowjow, E.; Levine, S. Residual Reinforcement Learning for Robot Control. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 6023–6029. https://doi.org/10.1109/ICRA.2019.8794127.

[11] Andrychowicz, O.M.; Baker, B.; Chociej, M.; Jozefowicz, R.; McGrew, B.; Pachocki, J.; Petron, A.; Plappert, M.; Powell, G.; Ray, A.; et al. Learning dexterous in-hand manipulation. Int. J. Robot. Res. 2020, 39, 3–20.

[12] Levine, S.; Pastor, P.; Krizhevsky, A.; Ibarz, J.; Quillen, D. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. Int. J. Robot. Res. 2018, 37, 421–436.

[13] OpenAI, O.; Plappert, M.; Sampedro, R.; Xu, T.; Akkaya, I.; Kosaraju, V.; Welinder, P.; D’Sa, R.; Petron, A.; Pinto, H.P.d.O.; et al. Asymmetric self-play for automatic goal discovery in robotic manipulation. arXiv 2021, arXiv:2101.04882.

[14] Gu, S.; Holly, E.; Lillicrap, T.; Levine, S. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–03 June 2017, pp. 3389–3396.

[15] Peng, X.B.; Andrychowicz, M.; Zaremba, W.; Abbeel, P. Sim-to-real transfer of robotic control with dynamics randomization. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 3803–3810.

[16] Yu, W.; Tan, J.; Bai, Y.; Coumans, E.; Ha, S. Learning fast adaptation with meta strategy optimization. IEEE Robot. Autom. Lett. 2020, 5, 2950–2957.

[17] James, S.; Bloesch, M.; Davison, A.J. Task-embedded control networks for few-shot imitation learning. In Proceedings of the Conference on Robot Learning, Zürich, Switzerland, 29–31 October 2018; pp. 783–795.

[18] Finn, C.; Yu, T.; Zhang, T.; Abbeel, P.; Levine, S. One-shot visual imitation learning via meta-learning. In Proceedings of the Conference on Robot Learning, Mountain View, CA, USA, 13–15 November 2017; pp. 357–368.

[19] Duan, Y.; Andrychowicz, M.; Stadie, B.; Jonathan Ho, O.; Schneider, J.; Sutskever, I.; Abbeel, P.; Zaremba, W. One-shot imitation learning. Adv. Neural Inf. Process. Syst. 2017, 30. https://doi.org/10.48550/arXiv.1703.07326.

[20] Wu, Y.H.; Charoenphakdee, N.; Bao, H.; Tangkaratt, V.; Sugiyama, M. Imitation learning from imperfect demonstration. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 09–15 June 2019; pp. 6818–6827.

[21] Laskin, M.; Srinivas, A.; Abbeel, P. Curl: Contrastive unsupervised representations for reinforcement learning. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 5639–5650.

[22] Laskin, M.; Lee, K.; Stooke, A.; Pinto, L.; Abbeel, P.; Srinivas, A. Reinforcement learning with augmented data. Adv. Neural Inf. Process. Syst. 2020, 33, 19884–19895.

[23] Yarats, D.; Zhang, A.; Kostrikov, I.; Amos, B.; Pineau, J.; Fergus, R. Improving sample efficiency in model-free reinforcement learning from images. arXiv 2019, arXiv:1910.01741.

[24] Zhan, A.; Zhao, P.; Pinto, L.; Abbeel, P.; Laskin, M. A framework for efficient robotic manipulation. arXiv 2020, arXiv:2012.07975.

[25] Hassani, A.; Walton, S.; Shah, N.; Abuduweili, A.; Li, J.; Shi, H. Escaping the big data paradigm with compact transformers. arXiv 2021, arXiv:2104.05704.

[26] Wu, Z.; Liu, Z.; Lin, J.; Lin, Y.; Han, S. Lite transformer with long-short range attention. arXiv 2020, arXiv:2004.11886.

[27] Wang, H.; Wu, Z.; Liu, Z.; Cai, H.; Zhu, L.; Gan, C.; Han, S. Hat: Hardware-aware transformers for efficient natural language processing. arXiv 2020, arXiv:2005.14187.

[28] Wu, B.; Xu, C.; Dai, X.; Wan, A.; Zhang, P.; Yan, Z.; Tomizuka, M.; Gonzalez, J.; Keutzer, K.; Vajda, P. Visual transformers: Token-based image representation and processing for computer vision. arXiv 2020, arXiv:2006.03677.

[29] Graham, B.; El-Nouby, A.; Touvron, H.; Stock, P.; Joulin, A.; Jégou, H.; Douze, M. LeViT: A Vision Transformer in ConvNet’s Clothing for Faster Inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 12259–12269.

[30] Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 10347–10357.

[31] Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images. http://www.cs.utoronto.ca/~kriz/learning-features-2009-TR.pdf.

[32] Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arxiv preprint arxiv:1910.01108.

[33] Schwarzer, M.; Anand, A.; Goel, R.; Hjelm, R.D.; Courville, A.; Bachman, P. Data-efficient reinforcement learning with self- predictive representations. arXiv 2020, arXiv:2007.05929.

[34] Jaderberg, M.; Mnih, V.; Czarnecki, W.M.; Schaul, T.; Leibo, J.Z.; Silver, D.; Kavukcuoglu, K. Reinforcement learning with unsupervised auxiliary tasks. arXiv 2016, arXiv:1611.05397.

[35] Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arxiv preprint arxiv:1909.11942.

[36] Koonce, B., & Koonce, B. E. (2021). Convolutional neural networks with swift for tensorflow: Image recognition and dataset categorization. New York, NY, USA: Apress, 109-123.

[37] Tan, M., Pang, R., & Le, Q. V. (2020). Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10781-10790).

[38] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arxiv preprint arxiv:2010.11929.

[39] He, K., Fan, H., Wu, Y., **e, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9729-9738).

[40] Chen, L.; Lu, K.; Rajeswaran, A.; Lee, K.; Grover, A.; Laskin, M.; Abbeel, P.; Srinivas, A.; Mordatch, I. Decision transformer: Reinforcement learning via sequence modeling. Adv. Neural Inf. Process. Syst. 2021, 34, 15084–15097.

[41] Chen, C.; Wu, Y.F.; Yoon, J.; Ahn, S. Transdreamer: Reinforcement learning with transformer world models. arXiv 2022, arXiv:2202.09481.

[42] Tao, T.; Reda, D.; van de Panne, M. Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels. arXiv 2022, arXiv:2204.04905.

[43] Meng, L.; Goodwin, M.; Yazidi, A.; Engelstad, P. Deep Reinforcement Learning with Swin Transformer. arXiv 2022, arXiv:2206.15269.

[44] Li, G.; Lin, S.; Li, S.; Qu, X. Learning Automated Driving in Complex Intersection Scenarios Based on Camera Sensors: A Deep Reinforcement Learning Approach. IEEE Sens. J. 2022, 22, 4687–4696.

[45] Henaff, O. Data-efficient image recognition with contrastive predictive coding. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 4182–4192.

[46] Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 1597–1607.

[47] Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., ... & Valko, M. (2020). Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33, 21271-21284.

[48] Xie, Q.; Dai, Z.; Hovy, E.; Luong, T.; Le, Q. Unsupervised data augmentation for consistency training. Adv. Neural Inf. Process. Syst. 2020, 33, 6256–6268.

[49] Berthelot, D.; Carlini, N.; Goodfellow, I.; Papernot, N.; Oliver, A.; Raffel, C.A. Mixmatch: A holistic approach to semi-supervised learning. Adv. Neural Inf. Process. Syst. 2019, 32. https://doi.org/10. 48550/arXiv.2001.07685.

[50] Sohn, K.; Berthelot, D.; Carlini, N.; Zhang, Z.; Zhang, H.; Raffel, C.A.; Cubuk, E.D.; Kurakin, A.; Li, C.L. FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence. Adv. Neural Inf. Process. Syst. 2020, 33, 596–608.

[51] Van den Oord, A.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748.