tailieunhanh - Data Preparation for Data Mining- P5

Data Preparation for Data Mining- P5: Ever since the Sumerian and Elam peoples living in the Tigris and Euphrates River basin some 5500 years ago invented data collection using dried mud tablets marked with tax records, people have been trying to understand the meaning of, and get use from, collected data. More directly, they have been trying to determine how to use the information in that data to improve their lives and achieve their objectives. | is possible for the output. Usually the level of detail in the input streams needs to be at least one level of aggregation more detailed than the required level of detail in the output. Knowing the granularity available in the data allows the miner to assess the level of inference or prediction that the data could potentially support. It is only potential support because there are many other factors that will influence the quality of a model but granularity is particularly important as it sets a lower bound on what is possible. For instance the marketing manager at FNBA is interested in part in the weekly variance of predicted approvals to actual approvals. To support this level of detail the input stream requires at least daily approval information. With daily approval rates available the miner will also be able to build inferential models when the manager wants to discover the reason for the changing trends. There are cases where the rule of thumb does not hold such as predicting Stock Keeping Units SKU sales based on summaries from higher in the hierarchy chain. However even when these exceptions do occur the level of granularity still needs to be known. Consistency Inconsistent data can defeat any modeling technique until the inconsistency is discovered and corrected. A fundamental problem here is that different things may be represented by the same name in different systems and the same thing may be represented by different names in different systems. One data assay for a major metropolitan utility revealed that almost 90 of the data volume was in fact duplicate. However it was highly inconsistent and rationalization itself took a vast effort. The perspective with which a system of variables mentioned in Chapter 2 is built has a huge effect on what is intended by the labels attached to the data. Each system is built for a specific purpose almost certainly different from the purposes of other systems. Variable content however labeled is defined by the .

TỪ KHÓA LIÊN QUAN