如何處理遺失值
- Understand the reason why data goes missing
- Missing at Random (MAR)
- Missing at random means that the propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data
- Missing Completely at Random (MCAR)
- The fact that a certain value is missing has nothing to do with its hypothetical value and with the values of other variables
- Missing not at Random (MNAR)
- Two possible reasons
- the missing value depends on the hypothetical value
- e.g. People with high salaries generally do not want to reveal their incomes in surveys
- missing value is dependent on some other variable’s value
- e.g. Let’s assume that females generally don’t want to reveal their ages! Here the missing value in age variable is impacted by gender variable
- Two possible reasons
- Missing at Random (MAR)
- Handling Missing Data
- Deletion
- Deleting Rows(Listwise Deletion)
- produce biased parameters and estimates
- Pairwise Deletion
- increases power in your analysis
- end up with different numbers of observations contributing to different parts of your model, which can make interpretation difficult
- Deleting Columns
- Not Recommended
- Imputation is always a preferred choice over dropping variables
- Deleting Rows(Listwise Deletion)
- Imputation
- Time-series Problem
- Data without Trend & without Seansonality
- Mean, Median and Mode
- Random
- Sample Imputation
- Data with Trend & without Seansonality
- Linear Interpolation
- Data with Trend & Seansonality
- Seasonal Adjustment + Interpolation
- Specific Methods
- Last Observation Carried Forward (LOCF) & Next Observation Carried Backward (NOCB)
- This is a common statistical approach to `the analysis of longitudinal` repeated measures data where some follow-up observations may be missing.
- Longitudinal data track the same sample at different points in time.
- Both these methods can introduce bias in analysis and perform poorly when data has a visible trend
- Linear Interpolation
- This method works `well for a time series with some trend` but is `not suitable for seasonal data`
- Seasonal Adjustment + Linear Interpolation
- This method works `well for data with both trend and seasonality`
- Last Observation Carried Forward (LOCF) & Next Observation Carried Backward (NOCB)
- Data without Trend & without Seansonality
- General Problem
- Categorical
- Make NA as level
- Missing values can be treated as a separate category by itself
- We can create another category for the missing values and use them as a different level
- This is the simplest method
- Multiple Imputation
- Logistic Regression
- we create a predictive model to estimate values that will substitute the missing data.
- In this case, we divide our data set into two sets:
- One set with no missing values for the variable (training)
- Another one with missing values (test).
- We can use methods like logistic regression and ANOVA for prediction
- Make NA as level
- Continuous
- Mean,Median,and Mode
- Multiple imputation
- Linear Regression
- Method
- Mean,Median,and Mode
- basic imputation method
- it is the only tested function that takes no advantage of the time series characteristics or relationship between the variables.
- It is very fast, but has clear disadvantages
- One disadvantage is that mean imputation reduces variance in the dataset
- Linear Regression
- To begin, several predictors of the variable with missing values are identified using a correlation matrix. The best predictors are selected and used as independent variables in a regression equation. The variable with missing data is used as the dependent variable.
- Cases with complete data for the predictor variables are used to generate the regression equation; the equation is then used to predict missing values for incomplete cases.
- In an iterative process, values for the missing variable are inserted and then all cases are used to predict the dependent variable.
- These steps are repeated until there is little difference between the predicted values from one step to the next, that is they converge.
- It “theoretically” provides good estimates for missing values.
- However, there are several disadvantages of this model which tend to outweigh the advantages.
- First, because the replaced values were predicted from other variables they tend to fit together “too well” and so standard error is deflated.
- One must also assume that there is a linear relationship between the variables used in the regression equation when there may not be one.
- Multiple Imputation
- Imputation
- Impute the missing entries of the incomplete data sets m times
- Note
- that imputed values are drawn from a distribution.
- Simulating random draws doesn’t include uncertainty in model parameters.
- Better approach is to use Markov Chain Monte Carlo (MCMC) simulation. This step results in m complete data sets.
- Analysis
- Analyze each of the m completed data sets.
- Pooling
- Integrate the m analysis results into a final result
- This is by far the most preferred method for imputation for the following reasons
- Easy to use
- No biases (if imputation model is correct)
- Imputation
- KNN (K Nearest Neighbors)
- k neighbors are chosen based on some distance measure and their average is used as an imputation estimate
- requires the selection of the number of nearest neighbors, and a distance metric
- predict both
- discrete attributes (the most frequent value among the k nearest neighbors)
- Hamming distance
- It takes all the categorical attributes and for each, count one if the value is not the same between two points.
- The Hamming distance is then equal to the number of attributes for which the value was different
- Hamming distance
- continuous attributes (the mean among the k nearest neighbors)
- Euclidean
- Manhattan
- Cosine
- discrete attributes (the most frequent value among the k nearest neighbors)
- Feature
- that it is simple to understand and easy to implement
- Drawbacks
- It becomes time-consuming when analyzing large datasets because it searches for similar instances through the entire dataset
- the accuracy of KNN can be severely degraded with high-dimensional data because there is little difference between the nearest and farthest neighbor
- Mean,Median,and Mode
- Categorical
- Time-series Problem
- Deletion
參考網址:https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4