tailieunhanh - Data Preparation for Data Mining- P14

Data Preparation for Data Mining- P14: Ever since the Sumerian and Elam peoples living in the Tigris and Euphrates River basin some 5500 years ago invented data collection using dried mud tablets marked with tax records, people have been trying to understand the meaning of, and get use from, collected data. More directly, they have been trying to determine how to use the information in that data to improve their lives and achieve their objectives. | data. Problem 3 High variance or noise obscures the underlying relationship between input and output. Turning first to the reason The data set simply does not contain sufficient information to define the relationship to the accuracy required. This is not essentially a problem with the data sets input and output. It may be a problem for the miner but if sufficient data exists to form a multivariably representative sample there is nothing that can be done to fix such data. If the data on hand simply does not define the relationship as needed the only possible answer is to get other data that does. A miner always needs to keep clearly in mind that the solution to a problem lies in the problem domain not in the data. In other words a business may need more profit more customers less overhead or some other business solution. The business does not need a better model except as a means to an end. There is no reason to think that the answer has to be wrung from the data at hand. If the answer isn t there look elsewhere. The survey helps the miner produce the best possible model from the data that is on hand and to know how good a model is possible from that data before modeling starts. But perhaps there are problems with the data itself. Possible problems mainly stem from three sources one the relationship between input and output is very complex two data describing some part of the range of the relationship is sparse three variance is very high leading to poor definition of the manifold. The information analytic part of the survey will point to parts of the multivariable manifold to variables and or subranges of variables where entropy uncertainty is high but does not identify the exact problem in that area. Remedying and alleviating the three basic problems has been thoroughly discussed throughout the previous chapters. For example if sparsity of some particular system state is a problem Chapter 10 in part discusses ways of multiplying or enhancing particular features of

TỪ KHÓA LIÊN QUAN