Jekyll2021-11-16T11:01:09-08:00https://sulyunlee.github.io/feed.xmlSulyun Lee’s personal websitepersonal descriptionSulyun Leesulyun-lee@uiowa.eduLinear Regression Using Python Statsmodel Library2021-06-29T00:00:00-07:002021-06-29T00:00:00-07:00https://sulyunlee.github.io/posts/blog-post-statsmodel_linreg<p>In <code class="language-plaintext highlighter-rouge">statsmodel</code> package in Python, there are various built-in functions for statistical analysis. Linear regression is a statistical model that finds the linear relationships between feature variables and a target variable. Since linear regression models provide the coefficients of each feature variable that indicates the magnitude of impacting the target variable, given other variables. Therefore, it enables the explanations of which feature variables are associated with the target variable. While <code class="language-plaintext highlighter-rouge">sklearn</code> package in Python also provides the linear regression built-in function, it is not suitable for statistical analysis, since the confidence intervals and p-values of feature coefficients are not easy to obtain.</p> <p>I implemented a script that fits a linear regression model using <code class="language-plaintext highlighter-rouge">statsmodel</code> package. Before feeding the feature variables into a linear regression model, three things need to be done: 1) Variance inflation factor analysis, 2) standardization, and 3) log-transformation. The following packages are needed:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span> <span class="kn">import</span> <span class="nn">statsmodels.api</span> <span class="k">as</span> <span class="n">sm</span> <span class="kn">import</span> <span class="nn">statsmodels.stats.outliers_influence</span> <span class="kn">import</span> <span class="nn">variance_inflation_factor</span> <span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">scale</span> </code></pre></div></div> <h2 id="variance-inflation-factor-vif-analysis">Variance inflation factor (VIF) analysis</h2> <p>Variance inflation factor analysis is required to remove the possible multicolinearity problem. Multicolinearity is more than one feature variables are correlated with each other. This is the basic assumption of linear regression that all feature variables are independent. VIF measures the amount of multicolinearity of a feature variable with other variables. Therefore, if VIF value is big for a feature variable, it means that this variable is highly correlated with other variables, thus should be removed from the model.</p> <p>The formula of VIF of a feature variable <img src="https://render.githubusercontent.com/render/math?math=i" /> is as follows:<br /> <img src="https://render.githubusercontent.com/render/math?math=VIF_{i} = \frac{1}{1-R_{i}^2}" /><br /> <img src="https://render.githubusercontent.com/render/math?math=R_{i}^2" /> is the coefficient of determination (also known as R-squared), which measures how one variable can be explained by other variables. The range of <img src="https://render.githubusercontent.com/render/math?math=R_{i}^2" /> is between 0.0 and 1.0, and <img src="https://render.githubusercontent.com/render/math?math=R_{i}^2 = 0.0" /> means that other variables failed to predict the variable <img src="https://render.githubusercontent.com/render/math?math=i" /> at all, where <img src="https://render.githubusercontent.com/render/math?math=R_{i}^2 = 1.0" /> means that other variables perfectly fit the variable <img src="https://render.githubusercontent.com/render/math?math=i" /> with the perfect predictions. Therefore, if <img src="https://render.githubusercontent.com/render/math?math=R_{i}^2" /> is big, it means that other variables are highly correlated with the variable <img src="https://render.githubusercontent.com/render/math?math=i" />, giving much bigger value for VIF.</p> <p>In the following code, I used <code class="language-plaintext highlighter-rouge">statsmodel</code> package <code class="language-plaintext highlighter-rouge">variance_inflation_factor</code> built-in function to calculate the VIF for every feature variable and remove the variable if VIF value is greater than 10.0 (which is common threshold value). This is repeated while none of the feature variables have the VIF over 10.0.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">vif</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">threshold</span><span class="o">=</span><span class="mf">10.0</span><span class="p">):</span> <span class="s">''' Inputs - X: the dataframe of features of size (N x f), where N is the number of instances and f is the number of feature variables. - threshold: the float number to set as the cutoff of VIF values for removing feature variables with multi-colinearity. The default is 10.0. Output: dataframe of features after removing high-VIF variables. '''</span> <span class="n">dropped</span><span class="o">=</span><span class="bp">True</span> <span class="k">while</span> <span class="n">dropped</span><span class="p">:</span> <span class="n">variables</span> <span class="o">=</span> <span class="n">X</span><span class="p">.</span><span class="n">columns</span> <span class="n">dropped</span> <span class="o">=</span> <span class="bp">False</span> <span class="n">vif</span> <span class="o">=</span> <span class="p">[</span><span class="n">variance_inflation_factor</span><span class="p">(</span><span class="n">X</span><span class="p">[</span><span class="n">variables</span><span class="p">].</span><span class="n">values</span><span class="p">,</span> <span class="n">X</span><span class="p">.</span><span class="n">columns</span><span class="p">.</span><span class="n">get_loc</span><span class="p">(</span><span class="n">var</span><span class="p">))</span> <span class="k">for</span> <span class="n">var</span> <span class="ow">in</span> <span class="n">X</span><span class="p">.</span><span class="n">columns</span><span class="p">]</span> <span class="n">max_vif</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">vif</span><span class="p">)</span> <span class="k">if</span> <span class="n">max_vif</span> <span class="o">&gt;</span> <span class="n">threshold</span><span class="p">:</span> <span class="n">max_index</span> <span class="o">=</span> <span class="n">vif</span><span class="p">.</span><span class="n">index</span><span class="p">(</span><span class="n">max_vif</span><span class="p">)</span> <span class="n">X</span> <span class="o">=</span> <span class="n">X</span><span class="p">.</span><span class="n">drop</span><span class="p">([</span><span class="n">X</span><span class="p">.</span><span class="n">columns</span><span class="p">.</span><span class="n">tolist</span><span class="p">()[</span><span class="n">max_index</span><span class="p">]],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="n">dropped</span><span class="o">=</span><span class="bp">True</span> <span class="k">return</span> <span class="n">X</span> </code></pre></div></div> <h2 id="standardization">Standardization</h2> <p>The next step is to standardize all feature variables. Some feature variables have wide range of values in case of continuous variables, while some feature variables have very small value ranges. This might cause some problems because a feature variable with a wide value range might influence the model more than other variables. For example, assume there are two feature variables, annual income and the number of family members in predicting the annual household expense. The annual income values might be very wide with the minimum value of 0 and the maximum value of billions or trillions of dollars! But the range of the number of family members is limited, mostly less than 10. If we use the raw feature values, the income features might influence the model a lot. Therefore, we need to standardize the feature vectors to have the similar ranges of values. There are many ways of standardization, but I usually use mean-zero standardization method that shifts the variables to be centered around 0 with 2-standard deviation.</p> <p>In the following code, I used the <code class="language-plaintext highlighter-rouge">scale</code> built-in function from <code class="language-plaintext highlighter-rouge">sklearn</code> package. The input <code class="language-plaintext highlighter-rouge">X</code> is the numpy array that contains all the feature variables.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X_scaled</span> <span class="o">=</span> <span class="n">scale</span><span class="p">(</span><span class="n">X</span><span class="p">)</span> </code></pre></div></div> <h2 id="log-transformation">Log-transformation</h2> <p>The final step to check before fitting the linear regression model is the log-transform any highly skewed variables. This step is required because the basic assumption of linear regression is the normal distribution of variables. High skewness of variables might result in invalid statistical tests. In the following code, I used natural log transformation to make right-skewed variables into normal distribution. The input <code class="language-plaintext highlighter-rouge">skewed_X</code> is the numpy array that only contains the highly skewed feature variables.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X_transformed</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="n">skewed_X</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="c1"># added 1 in case of zero values. </span></code></pre></div></div> <h2 id="linear-regression">Linear regression</h2> <p>After the above three steps are done, we can now fit the linear regression model. The following code implements the linear regression model function.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">linear_regression</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">feature_names</span><span class="p">,</span> <span class="n">target_name</span><span class="p">,</span> <span class="n">write_filename</span><span class="p">):</span> <span class="s">''' Input: - df: dataframe that contains both features and target variables to be used for linear regression model. - feature_names: list of feature variable names - target_name: a string that contains the target variable name - write_filename: a string of the filename for exporting the modeling result. Output: trained linear regression model '''</span> <span class="n">formula</span> <span class="o">=</span> <span class="s">"{} ~ {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">target_name</span><span class="p">,</span> <span class="s">"+"</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">feature_names</span><span class="p">))</span> <span class="n">model</span> <span class="o">=</span> <span class="n">smf</span><span class="p">.</span><span class="n">ols</span><span class="p">(</span><span class="n">formula</span><span class="o">=</span><span class="n">formula</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">df</span><span class="p">).</span><span class="n">fit</span><span class="p">()</span> <span class="c1"># export model fit to csv file </span> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">"/results/{}.csv"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">write_filename</span><span class="p">),</span> <span class="s">"w"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fh</span><span class="p">:</span> <span class="n">fh</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">summary</span><span class="p">().</span><span class="n">as_csv</span><span class="p">())</span> <span class="k">return</span> <span class="n">model</span> </code></pre></div></div>Sulyun Leesulyun-lee@uiowa.eduIn statsmodel package in Python, there are various built-in functions for statistical analysis. Linear regression is a statistical model that finds the linear relationships between feature variables and a target variable. Since linear regression models provide the coefficients of each feature variable that indicates the magnitude of impacting the target variable, given other variables. Therefore, it enables the explanations of which feature variables are associated with the target variable. While sklearn package in Python also provides the linear regression built-in function, it is not suitable for statistical analysis, since the confidence intervals and p-values of feature coefficients are not easy to obtain.