How to Handle Missing Data

如何處理遺失值

  • Understand the reason why data goes missing
    • Missing at Random (MAR)
      • Missing at random means that the propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data
    • Missing Completely at Random (MCAR)
      • The fact that a certain value is missing has nothing to do with its hypothetical value and with the values of other variables
    • Missing not at Random (MNAR)
      • Two possible reasons
        • the missing value depends on the hypothetical value
        • e.g. People with high salaries generally do not want to reveal their incomes in surveys
        • missing value is dependent on some other variable’s value
        • e.g. Let’s assume that females generally don’t want to reveal their ages! Here the missing value in age variable is impacted by gender variable
  • Handling Missing Data
    • Deletion
      • Deleting Rows(Listwise Deletion)
        • produce biased parameters and estimates
      • Pairwise Deletion
        • increases power in your analysis
        • end up with different numbers of observations contributing to different parts of your model, which can make interpretation difficult
      • Deleting Columns
        • Not Recommended
        • Imputation is always a preferred choice over dropping variables
    • Imputation
      • Time-series Problem
        • Data without Trend & without Seansonality
          • Mean, Median and Mode
          • Random
          • Sample Imputation
        • Data with Trend & without Seansonality
          • Linear Interpolation
        • Data with Trend & Seansonality
          • Seasonal Adjustment + Interpolation
        • Specific Methods
          • Last Observation Carried Forward (LOCF) & Next Observation Carried Backward (NOCB)
            • This is a common statistical approach to `the analysis of longitudinal` repeated measures data where some follow-up observations may be missing.
            • Longitudinal data track the same sample at different points in time.
            • Both these methods can introduce bias in analysis and perform poorly when data has a visible trend
          • Linear Interpolation
            • This method works `well for a time series with some trend` but is `not suitable for seasonal data`
          • Seasonal Adjustment + Linear Interpolation
            • This method works `well for data with both trend and seasonality`
      • General Problem
        • Categorical
          • Make NA as level
            • Missing values can be treated as a separate category by itself
            • We can create another category for the missing values and use them as a different level
            • This is the simplest method
          • Multiple Imputation
          • Logistic Regression
            • we create a predictive model to estimate values that will substitute the missing data.
            • In this case, we divide our data set into two sets:
              • One set with no missing values for the variable (training)
              • Another one with missing values (test).
            • We can use methods like logistic regression and ANOVA for prediction
        • Continuous
          • Mean,Median,and Mode
          • Multiple imputation
          • Linear Regression
        • Method
          • Mean,Median,and Mode
            • basic imputation method
            • it is the only tested function that takes no advantage of the time series characteristics or relationship between the variables.
            • It is very fast, but has clear disadvantages
              • One disadvantage is that mean imputation reduces variance in the dataset
          • Linear Regression
            • To begin, several predictors of the variable with missing values are identified using a correlation matrix. The best predictors are selected and used as independent variables in a regression equation. The variable with missing data is used as the dependent variable.
            • Cases with complete data for the predictor variables are used to generate the regression equation; the equation is then used to predict missing values for incomplete cases.
            • In an iterative process, values for the missing variable are inserted and then all cases are used to predict the dependent variable.
            • These steps are repeated until there is little difference between the predicted values from one step to the next, that is they converge.
            • It “theoretically” provides good estimates for missing values.
            • However, there are several disadvantages of this model which tend to outweigh the advantages.
              • First, because the replaced values were predicted from other variables they tend to fit together “too well” and so standard error is deflated.
              • One must also assume that there is a linear relationship between the variables used in the regression equation when there may not be one.
          • Multiple Imputation
            • Imputation
              • Impute the missing entries of the incomplete data sets m times
              • Note
                • that imputed values are drawn from a distribution.
                • Simulating random draws doesn’t include uncertainty in model parameters.
                • Better approach is to use Markov Chain Monte Carlo (MCMC) simulation. This step results in m complete data sets.
            • Analysis
              • Analyze each of the m completed data sets.
            • Pooling
              • Integrate the m analysis results into a final result
            • This is by far the most preferred method for imputation for the following reasons
              • Easy to use
              • No biases (if imputation model is correct)
          • KNN (K Nearest Neighbors)
            • k neighbors are chosen based on some distance measure and their average is used as an imputation estimate
            • requires the selection of the number of nearest neighbors, and a distance metric
            • predict both
              • discrete attributes (the most frequent value among the k nearest neighbors)
                • Hamming distance
                  • It takes all the categorical attributes and for each, count one if the value is not the same between two points.
                  • The Hamming distance is then equal to the number of attributes for which the value was different
              • continuous attributes (the mean among the k nearest neighbors)
                • Euclidean
                • Manhattan
                • Cosine
            • Feature
              • that it is simple to understand and easy to implement
            • Drawbacks
              • It becomes time-consuming when analyzing large datasets because it searches for similar instances through the entire dataset
              • the accuracy of KNN can be severely degraded with high-dimensional data because there is little difference between the nearest and farthest neighbor

參考網址:https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4