Experimental Technology and Management

2025, 09, v.42 27-33

Self-supervised scene flow estimation based on multiscale masked autoencoders

XIANG Xuezhi^1,2 WANG Xi¹ WANG Lu^1,2 BEN Xianye^1,2 QIAO Yulong^1,2

School of Information and Communication Engineering, Harbin Engineering University;Key Laboratory of Advanced Marine Communication and Information Technology, Ministry of Industry and Information Technology, Harbin Engineering University;

Email:

DOI: 10.16791/j.cnki.sjg.2025.09.005

121	0	34
Downloads	Citas	Reads

Cite Download

PDF

Reference

GB/T 7714-2015 MLA APA Refworks EndNote NoteExpress NoteFirst

Abstract Full Article References Publication Related

摘要：

点云场景流在自动驾驶领域发挥着重要的作用，但点云数据具有无序性和密度分布不均匀等特性，以及在真实场景中可能受遮挡影响，导致其估计的准确性难以得到提升。针对这些问题，该文提出了一种基于多尺度掩码学习的场景流估计方法，将输入点云划分为不规则点块并嵌入随机掩码和标记，通过不对称的编码器-解码器架构对点云的空间几何进行建模。在编码阶段，该文建立的模型从未掩码的点云中学习高级潜在特征；在解码阶段，利用学习到的潜在特征及掩码标记来重建原始点云。此外，该文所提方法还采用多尺度掩码策略来确保提取不同尺度特征的过程中可见区域的一致性。在FlyingThings3D和KITTI数据集上的实验结果表明，该文所提方法与基线网络相比在所有评价指标上均取得了显著提升，并且采用自监督的方式，性能优于当前主流的全监督和自监督方法。

关键词： 点云; 场景流; 掩码自编码; 特征金字塔; 自监督学习;

Abstract：

[Objective] Point cloud scene flow plays an important role in the field of autonomous driving. However, improving the accuracy of scene flow estimation is difficult because of the point cloud characteristics, such as disorder and uneven density distribution. In most previously reported methods, the models were trained on synthetic datasets because of the difficulty and cost of acquiring accurate scene flow labels for point clouds; additionally, complex situations, such as occlusion in real scenes, were ignored. To address these problems, a self-supervised scene flow estimation method based on multiscale masked autoencoders is proposed. [Methods] The proposed model divides the input point cloud into irregular point patches, performs large-ratio random masking and token embedding, and then simulates the spatial geometry of the point cloud through an asymmetric encoder–decoder architecture. In the encoding stage, the mask token is shifted to the input of the autoencoder's decoder to prevent position information from leaking to the mask token; in this way, the encoder can focus on learning high-level latent features obtained from the unmasked point cloud. In the decoding stage, the learned latent features and mask tokens are utilized to reconstruct the original point cloud. In addition, the model fuses details and global context information through a pyramid architecture and adopts a multiscale masking strategy to ensure consistent visible areas during feature extraction at different scales. [Results] Experiments were conducted on the FlyingThings3D and KITTI datasets, and the model was trained in a self-supervised manner. The results of model training and testing on the FT3D_o and KITTI_o datasets, respectively, show that despite using only one-tenth of the instance data used in other methods during model training, the proposed method outperforms all the existing self-supervised and fully supervised methods. The results of model training and testing on the FT3D_s and KITTI_s datasets, respectively, show that all indicators are significantly improved compared with the baseline, especially the EPE indicator, which is improved by 10.5%. In addition, single-scale and multiscale masked autoencoders were added to the baseline network to conduct ablation experiments. For the single-scale architecture, the EPE indicator improved by 5.3% compared with that of the baseline network; for the multiscale architecture, the improvement was 10.5%. [Conclusions] The results of ablation experiments prove that the pyramid architecture can effectively integrate multiscale information and extract rich geometric features. Comparative experiments with other methods show the superiority of the proposed method. The multiscale masked autoencoder extracts powerful features by randomly masking and reconstructing the original point cloud, thereby reducing the impact of point cloud disorder and occlusion on the accuracy of scene flow estimation.

KeyWords： point cloud; scene flow; masked autoencoders; feature pyramid; self-supervised learning;

for the full text, please visit CNKI.net

References

[1] WANG G, HU Y, LIU Z, et al. What matters for 3D scene flow network[C]//European Conference on Computer Vision. Cham:Springer Nature, 2022:38–55.

[2] LIU X, QI C R, GUIBAS L J. Flownet3D:Learning scene flow in 3D point clouds[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York:IEEE,2019:529–537.

[3] WANG Z, LI S, HOWARD-JENKINS H, et al. Flownet3D++:Geometric losses for deep scene flow estimation[C]//Proceedings of the IEEE Conference on Applications of Computer Vision. New York:IEEE, 2020:91–98.

[4] WU W, WANG Z Y, LI Z, et al. Pointpwc-net:Cost volume on point clouds for self-supervised scene flow estimation[C]//European Conference on Computer Vision. Cham:Springer International Publishing, 2020:88–107.

[5] PUY G, BOULCH A, MARLET R. Flot:Scene flow on point clouds guided by optimal transport[C]//European Conference on Computer Vision. Cham:Springer International Publishing, 2020:527–544.

[6] LANG I, AIGER D, COLE F, et al. Scoop:Self-supervised correspondence and optimization-based scene flow[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York:IEEE, 2023:5281–5290.

[7] KITTENPLON Y, ELDAR Y C, RAVIV D. Flowstep3d:Model unrolling for self-supervised scene flow estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York:IEEE, 2021:4114–4123.

[8] CHENG W, KO J H. Bi-pointflownet:Bidirectional learning for point cloud based scene flow estimation[C]//European Conference on Computer Vision. Cham:Springer Nature Switzerland, 2022:108–124.

[9] CHENG W, KO J H. Multi-scale bidirectional recurrent network with hybrid correlation for point cloud based scene flow estimation[C]//Proceedings of the IEEE Conference on Computer Vision. New York:IEEE, 2023:10041–10050.

[10] QI C R, YI L, SU H, et al. PointNet++Deep Hierarchical feature learning on point sets in a metric space[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York:Association for Computing Machinery, 2017:5105-5114.

[11] PANG Y, WANG W, TAY F E H, et al. Masked autoencoders for point cloud self-supervised learning[C]//European Conference on Computer Vision. Cham:Springer Nature Switzerland, 2022:604–621.

[12] QI C R, SU H, MO K, et al. Pointnet:Deep learning on point sets for 3D classification and segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.New York:IEEE, 2017:652–660.

[13] CHIZAT L, PEYRéG, SCHMITZER B, et al. Scaling algorithms for unbalanced transport problems[J]. Mathematics of Computation,2018,87(314):2563-2609.

[14] CUTURI M. Sinkhorn distances:Lightspeed computation of optimal transportation distances[J]. Advances in Neural Information Processing Systems, 2013, 26:2292-2300.

[15] LANG I, GINZBURG D, AVIDAN S, et al. Dpc:Unsupervised deep point correspondence via cross and self construction[C]//Proceedings of the IEEE Conference on 3D Vision. New York:IEEE, 2021:1442–1451.

[16] GU X, WANG Y, WU C, et al. Hplflownet:Hierarchical permutohedral lattice flownet for scene flow estimation on large-scale point clouds[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York:IEEE,2019:3254–3263.

[17] MITTAL H, OKORN B, HELD D. Just go with the flow:Self-supervised scene flow estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.New York:IEEE, 2020:11177–11185.

[18] LI B, ZHENG C, GIANCOLA S, et al. SCTN:Sparse convolutiontransformer network for scene flow estimation[C]//36th AAAI Conference on Artificial Intelligence. Washington:AAAI Press,2022:1254–1262.

[19]吴一全,陈慧娴,张耀.基于深度学习的三维点云处理方法研究进展[J].中国激光, 2024, 51(5):135–157.WU Y Q, CHEN H X, ZHANG Y. Research progress of 3D point cloud processing methods based on deep learning[J].Chinese Journal of Lasers, 2024, 51(5):135–157.(in Chinese)

Basic Information:

DOI：10.16791/j.cnki.sjg.2025.09.005

China Classification Code:TP18;U463.6

Citation Information:

[1]项学智,王茜,王路,等.基于多尺度掩码自编码的自监督点云场景流估计[J].实验技术与管理,2025,42(09):27-33.DOI:10.16791/j.cnki.sjg.2025.09.005.

Fund Information:

国家自然科学基金(62271160); 黑龙江省高等教育教学改革研究重点项目(SJGZB2024054); 哈尔滨工程大学本科生教学改革研究课题(JG2023B0803);哈尔滨工程大学研究生教学改革研究课题(JG2022Y037);哈尔滨工程大学特色学科基础研究稳定支持专项(KYWZ220240812); 中央高校基本科研业务费专项资金项目(3072024LJ0803)

请选择需要下载的pdf数据

Experimental Technology and Management

Summary

quote