How to Handle Missing Data

2019-12-12

61
0
Data Science
2020-02-21

如何處理遺失值

Understand the reason why data goes missing
- Missing at Random (MAR)
  - Missing at random means that the propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data
- Missing Completely at Random (MCAR)
  - The fact that a certain value is missing has nothing to do with its hypothetical value and with the values of other variables
- Missing not at Random (MNAR)
  - Two possible reasons
    - the missing value depends on the hypothetical value
    - e.g. People with high salaries generally do not want to reveal their incomes in surveys
    - missing value is dependent on some other variable’s value
    - e.g. Let’s assume that females generally don’t want to reveal their ages! Here the missing value in age variable is impacted by gender variable
Handling Missing Data
- Deletion
  - Deleting Rows(Listwise Deletion)
    - produce biased parameters and estimates
  - Pairwise Deletion
    - increases power in your analysis
    - end up with different numbers of observations contributing to different parts of your model, which can make interpretation difficult
  - Deleting Columns
    - Not Recommended
    - Imputation is always a preferred choice over dropping variables
- Imputation
  - Time-series Problem
    - Data without Trend & without Seansonality
      - Mean, Median and Mode
      - Random
      - Sample Imputation
    - Data with Trend & without Seansonality
      - Linear Interpolation
    - Data with Trend & Seansonality
      - Seasonal Adjustment + Interpolation
    - Specific Methods
      - Last Observation Carried Forward (LOCF) & Next Observation Carried Backward (NOCB)
        
        This is a common statistical approach to `the analysis of longitudinal` repeated measures data where some follow-up observations may be missing.
        
        Longitudinal data track the same sample at different points in time.
        
        Both these methods can introduce bias in analysis and perform poorly when data has a visible trend
      - Linear Interpolation
        
        This method works `well for a time series with some trend` but is `not suitable for seasonal data`
      - Seasonal Adjustment + Linear Interpolation
        
        This method works `well for data with both trend and seasonality`
  - General Problem
    - Categorical
      - Make NA as level
        
        Missing values can be treated as a separate category by itself
        
        We can create another category for the missing values and use them as a different level
        
        This is the simplest method
      - Multiple Imputation
      - Logistic Regression
        
        we create a predictive model to estimate values that will substitute the missing data.
        
        In this case, we divide our data set into two sets:
        
        One set with no missing values for the variable (training)
        
        Another one with missing values (test).
        
        We can use methods like logistic regression and ANOVA for prediction
    - Continuous
      - Mean,Median,and Mode
      - Multiple imputation
      - Linear Regression
    - Method
      - Mean,Median,and Mode
        
        basic imputation method
        
        it is the only tested function that takes no advantage of the time series characteristics or relationship between the variables.
        
        It is very fast, but has clear disadvantages
        
        One disadvantage is that mean imputation reduces variance in the dataset
      - Linear Regression
        
        To begin, several predictors of the variable with missing values are identified using a correlation matrix. The best predictors are selected and used as independent variables in a regression equation. The variable with missing data is used as the dependent variable.
        
        Cases with complete data for the predictor variables are used to generate the regression equation; the equation is then used to predict missing values for incomplete cases.
        
        In an iterative process, values for the missing variable are inserted and then all cases are used to predict the dependent variable.
        
        These steps are repeated until there is little difference between the predicted values from one step to the next, that is they converge.
        
        It “theoretically” provides good estimates for missing values.
        
        However, there are several disadvantages of this model which tend to outweigh the advantages.
        
        First, because the replaced values were predicted from other variables they tend to fit together “too well” and so standard error is deflated.
        
        One must also assume that there is a linear relationship between the variables used in the regression equation when there may not be one.
      - Multiple Imputation
        
        Imputation
        
        Impute the missing entries of the incomplete data sets m times
        
        Note
        
        that imputed values are drawn from a distribution.
        
        Simulating random draws doesn’t include uncertainty in model parameters.
        
        Better approach is to use Markov Chain Monte Carlo (MCMC) simulation. This step results in m complete data sets.
        
        Analysis
        
        Analyze each of the m completed data sets.
        
        Pooling
        
        Integrate the m analysis results into a final result
        
        This is by far the most preferred method for imputation for the following reasons
        
        Easy to use
        
        No biases (if imputation model is correct)
      - KNN (K Nearest Neighbors)
        
        k neighbors are chosen based on some distance measure and their average is used as an imputation estimate
        
        requires the selection of the number of nearest neighbors, and a distance metric
        
        predict both
        
        discrete attributes (the most frequent value among the k nearest neighbors)
        
        Hamming distance
        
        It takes all the categorical attributes and for each, count one if the value is not the same between two points.
        
        The Hamming distance is then equal to the number of attributes for which the value was different
        
        continuous attributes (the mean among the k nearest neighbors)
        
        Euclidean
        
        Manhattan
        
        Cosine
        
        Feature
        
        that it is simple to understand and easy to implement
        
        Drawbacks
        
        It becomes time-consuming when analyzing large datasets because it searches for similar instances through the entire dataset
        
        the accuracy of KNN can be severely degraded with high-dimensional data because there is little difference between the nearest and farthest neighbor

參考網址:https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4

Data Science

回首頁

Ray Sin Learning notes

喜歡分析資料、計算數字、善用工具的統計學家

How to Handle Missing Data

系列文章