tailieunhanh - Data For Marketing Risk And Customer Relationship Management_3

Tham khảo tài liệu 'data for marketing risk and customer relationship management_3', kinh doanh - tiếp thị, quản trị kinh doanh phục vụ nhu cầu học tập, nghiên cứu và làm việc hiệu quả | Page 60 Cleaning the Data I now have the complete data set for modeling. The next step is to examine the data for errors outliers and missing values. This is the most time-consuming least exciting and most important step in the data preparation process. Luckily there are some effective techniques for managing this process. First I describe some techniques for cleaning and repairing data for continuous variables. Then I repeat the process for categorical variables. Continuous Variables To perform data hygiene on continuous variables PROC UNIVARIATE is a useful procedure. It provides a great deal of information about the distribution of the variable including measures of central tendency measures of spread and the skewness or the degree of imbalance of the data. For example the following code will produce the output for examining the variable estimated income inc_est . proc univariate data plot weight smp_wgt var inc_est run The results from PROC UNIVARIATE for estimated income inc est are shown in Figure . The values are in thousands of dollars. There is a lot of information in this univariate analysis. I just look for a few key things. Notice the measures in bold. In the moments section the mean seems reasonable at . But looking a little further I detect some data issues. Notice that the highest value in the extreme values is 660. In Figure the histogram and box plot provide a good visual analysis of the overall distribution and the extreme value. I get another view of this one value. In the histogram the bulk of the observations are near the bottom of the graph with the single high value near the top. The box plot also shows the limited range for the bulk of the data. The box area represents the central 50 of the data. The distance to the extreme value is very apparent. This point may be considered an outlier. Outliers and Data Errors An outlier is a single or low-frequency occurrence of the value of a variable that is far from the mean

TỪ KHÓA LIÊN QUAN
crossorigin="anonymous">
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.