Preview

Dependability

Advanced search

Analysis of UMAP, the method for reducing the dimensionality of initial data in machine learning for the purpose of failure prediction in a motive power service

https://doi.org/10.21683/1729-2646-2022-22-4-53-62

Abstract

Aim. Feature transformation is one of the stages of machine learning application that has a significant effect on the quality of regression models. The paper aims to develop criteria for evaluating the quality of data dimensionality reduction at the stage of feature transformation and adaptation of the UMAP method to the problem of prediction of the number of days to failure in the locomotives of JSC RZD.
Methods. The data transformation methods are divided into two groups, those that attempt to preserve the global data structure, and those that attempt to preserve the distances between points. The paper examines in detail the UMAP no-linear method of dimensionality reduction, whose low-dimensional data presentation is based on a transformation of a nearest neighbour graph retaining the data structure. The structure of the initial data manifold is examined using topological data analysis and simplified fuzzy set construction methods.
Results. The analysis of UMAP theory conducted in the Russian language for the first time enabled a substantiated identification of the three primary parameters of the method, whose variation significantly affects the type of data obtained as the result of a transformation. In particular, that pertains to the quality of class separation over a two-dimensional space. Additionally, the characteristics of the input set of parameters were identified that affect the UMAP results. Practical results of UMAP application were
demonstrated. Intermediate results included a list of nearest neighbours, a weighted graph of nearest neighbours. The fundamental result is a low-dimensional data representation (out of 44 initial measurements) over a two-dimensional space with class separation, which is confirmed both by calculations, and visually.
Conclusions. It was identified that UMAP is an efficient and substantiated method of dimensionality reduction that allows – through parameter variation – transforming data in such a way as to improve the quality of data submitted to machine learning models by the criterion of “evident class separation”. The transformation is an intermediate stage of data preparation for regression model application, and class separation was performed for the purpose of eliminating the probability of gross regression errors.

About the Authors

O. B. Pronevich
JSC NIIAS
Russian Federation

Olga B. Pronevich, Candidate of Engineering, Project Manager, Unit for Problem Definition, Deployment and Support of System-Level Designs, Division for Risk Management of Complex Technical Systems 

Moscow



A. P. Klokova
JSC NIIAS
Russian Federation

Anna P. Klokova, Postgraduate Student, Russian University of Transport RUT (MIIT), Specialist, Division for Risk Management of Complex Technical Systems 

Moscow



References

1. Shubinsky I.B, Pronevich O.B. [Methods of deep learning for hazard prediction]. Zheleznorodozhny transport 2021;12:27-31. (in Russ.)

2. Pronevich O.B., Zaytsev M.V. Intelligent methods for improving the accuracy of prediction of rare hazardous events in railway transportation. Dependability 2021;21(3):54-65. DOI: https://doi.org/10.21683/1729-2646-2021-21-3-54-65

3. Shubinsky I.B., Zamyshliaev A.M., Pronevich O.B., Platonov E.N., Ignatov A.N. Application of machine learning methods for predicting hazardous failures of railway track assets. Dependability 2020;2:45-53. DOI: https://doi.org/10.21683/1729-2646-2020-20-2-43-53

4. Platonov E.N., Prosvirin K.V. Prediction of track structure defects by machine learning methods. Herald of computer and information technologies 2022;19(2):8-18. DOI: 10.14489/vkit.2022.02.pp.008-018 (in Russ.)

5. Korneeva E.V., Sidorenko V.G. Analysis of big data term applicability to automated system of transportation operational control. Science and Technology in Transport 2022;1:70-76. (in Russ.)

6. Ustich P.A., Ivanov A.A., Mazhidov F.A. Application of information technology in the cars technical maintenance system and repair. Avtomatizatsiya. Sovremennye tekhnologii 2016;10:29-38. (in Russ.)

7. Kalaydin E.N., Pironko M.D. [Specificity of the collection and processing of data for the purpose of construction of machine learning models]. In: Sidorov V.A., editor. [Topical issues of economic theory and practice. Collected science papers]. Krasnodar; 2020. P. 116-123. (in Russ.)

8. Timchenko E.A. [Matters of preliminary data cleansing]. In: [Young Science for the Development of Agriculture. Proceedings of the All-Russian (National) research and practice conference of undergraduate, postgraduate students and young scientists]; 2020:263-269. (in Russ.)

9. Akimov A.A., Valitov D.R., Kubryak A.I. Data preprocessing for machine learning. Scientific Review. Technical science 2022;2: 26-31. DOI: 10.17513/srts.1391 (in Russ.)

10. Erokhin S.D., Borisenko B.B., Martishin I.D., Fadeev A.S. Analysis of existing methods to reduce the dimensionality of input data. T-Comm 2022;16(1):30-37. DOI: 10.36724/2072-8735-2022-16-1-30-37 (in Russ.)

11. Fedotov M.V., Grachev V. V. Predictive analytics of the technical condition of diesel locomotive systems using neural network predictive models. Bulletin of Scientific Research Results 2021;3:102-114. DOI 10.20295/2223-9987-2021-3-102-114. (in Russ.)

12. Khamidov O.R., Grishchenko A.V. [Detecting faults in rolling bearings of asynchronous traction electric motors of locomotives using modern AI-based methods]. Vestnik transporta Povolzhya 2020;1(79):35-41. (in Russ.)

13. Grachev V.V., Fedotov M.V., Grizhshenko A.V., Bazilevskiy F.Yu., Sharapov A.L. Locomotive Diesel GasAir Tract Diagnostics with the Use of Intellectual Classifier. Bulletin of Scientific Research Results 2022;2:124-140. DOI 10.20295/2223-9987-2022-2-124-140. (in Russ.)

14. Efimenko E.Yu., Miasnikov E.V. [Evaluating the methods of dimensionality reduction as part of identity recognition by the walk]. In: Miasnikov V.V., editor. [Proceedings of the VII International Conference and Youth School]. Samara; 2021. (in Russ.)

15. Gorbunov A.A. [Comparative analysis of the data dimensionality reduction algorithms as part of gene expression research]. In: [Proceedings of the 77-th Science Conference of the Undergraduate and Postgraduate Students of the Belarusian State University in 3 volumes]. Minsk; 2020. P. 161-164. (in Russ.)

16. Kulagin M.A. [An AI-based system for analysing and predicting train control violations: a Candidate of Engineering Thesis]. Moscow; 2022. (in Russ.)

17. McInnes L., Healy J., Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv; 2018. DOI: https://doi.org/10.48550/arXiv.1802.03426

18. McInnes L., Healy J., Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv; 2020. DOI: https://doi.org/10.48550/arXiv.1802.03426

19. Dong W., Moses C., Li K. Efficient k-nearest neighbor graph construction for generic similarity measures. In: Proceedings of the 20th international conference on World wide web; 2011. P. 577-586. DOI: 10.1145/1963405.1963487

20. Ting K.M., Washio T., Zhu Y., Xu Y. Breaking the curse of dimensionality with Isolation Kernel. arXiv; 2021. DOI: https://doi.org/10.48550/arXiv.2109.14198


Review

For citations:


Pronevich O.B., Klokova A.P. Analysis of UMAP, the method for reducing the dimensionality of initial data in machine learning for the purpose of failure prediction in a motive power service. Dependability. 2022;22(4):53-62. (In Russ.) https://doi.org/10.21683/1729-2646-2022-22-4-53-62

Views: 692


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1729-2646 (Print)
ISSN 2500-3909 (Online)