tailieunhanh - Data Preparation for Data Mining- P13

Data Preparation for Data Mining- P13: Ever since the Sumerian and Elam peoples living in the Tigris and Euphrates River basin some 5500 years ago invented data collection using dried mud tablets marked with tax records, people have been trying to understand the meaning of, and get use from, collected data. More directly, they have been trying to determine how to use the information in that data to improve their lives and achieve their objectives. | system that can affect the outcome. The more of them that there are the more likely it is that purely by happenstance some particular but actually meaningless pattern will show up. The number of variables in a data set or the number of weights in a neural network all represent things that can change. So yet again high-dimensionality problems turn up this time expressed as degrees of freedom. Fortunately for the purposes of data preparation a definition of degrees of freedom is not needed as in any case this is a problem previously encountered in many guises. Much discussion particularly in this chapter has been about reducing the dimensionality combinatorial explosion problem which is degrees of freedom in disguise by reducing dimensionality. Nonetheless a data set always has some dimensionality for if it does not there is no data set And having some particular dimensionality or number of degrees of freedom implies some particular chance that spurious patterns will turn up. It also has implications about how much data is needed to ensure that any spurious patterns are swamped by valid real-world patterns. The difficulty is that the calculations are not exact because several needed measures such as the number of significant system states while definable in theory seem impossible to pin down in practice. Also each modeling tool introduces its own degrees of freedom weights in a neural network for example which may be unknown to the minere .mi. The ideal if the miner has access to software that can make the measurements such as data surveying software requires use of a multivariable sample determined to be representative to a suitable degree of confidence. Failing that as a rule of thumb for the minimum amount of data to accept for mining as opposed to data preparation use at least twice the number of instances required for a data preparation representative sample. The key is to have enough representative instances of data to swamp the spurious patterns. Each .

TỪ KHÓA LIÊN QUAN