tailieunhanh - Data Preparation for Data Mining- P4

Data Preparation for Data Mining- P4: Ever since the Sumerian and Elam peoples living in the Tigris and Euphrates River basin some 5500 years ago invented data collection using dried mud tablets marked with tax records, people have been trying to understand the meaning of, and get use from, collected data. More directly, they have been trying to determine how to use the information in that data to improve their lives and achieve their objectives. | distortion of the original signal. Somehow a modeling tool must deal with the noise in the data. Each modeling tool has a different way of expressing the nature of the relationships that it finds between variables. But however it is expressed some of the relationship between variables exists because of the true measurement and some part is made up of the relationship caused by the noise. It is very hard if not impossible to precisely determine which part is made up from the underlying measurement and which from the noise. However in order to discover the true underlying relationship between the variables it is vital to find some way of estimating which is relationship and which is noise. One problem with noise is that there is no consistent detectable pattern to it. If there were it could be easily detected and removed. So there is an unavoidable component in the training set that should not be characterized by the modeling tool. There are ways to minimize the impact of noise that are discussed later but there always remains some irreducible minimum. In fact as discussed later there are even circumstances when it is advantageous to add noise to some portion of the training set although this deliberately added noise is very carefully constructed. Ideally a modeling tool will learn to characterize the underlying relationships inside the data set without learning the noise. If for example the tool is learning to make predictions of the value of some variable it should learn to predict the true value rather than some distorted value. During training there comes a point at which the model has learned the underlying relationships as well as is possible. Anything further learned from this point will be the noise. Learning noise will make predictions from data inside the training set better. In any two subsets of data drawn from an identical source the underlying relationship will be the same. The noise on the other hand not representing the underlying relationship has a .

TỪ KHÓA LIÊN QUAN