工作时长和收入间的线性关系 Linear Regression

2023-06-21
3 min read

本文对KGSS 2021年的调查数据中的 Y RINCOM0收入变量 和 X WGWKHR工作时长间的线性关系进行统计检验

Data Source

  • 分析工具: Python statsmodels统计包
  • 数据来源: KGSS

Go statsmodel Document 📄 Go KGSS Data 📊

Method

  • 假设 Y:RINCOM0 X:WGWKHR
  • 使用Python包 statsmodels 中的 OLS 线性回归函数

Import Library

import statsmodels.api as sm
import statsmodels.graphics.regressionplots as sgr
import pandas as pd
import numpy as np

Load Data

df = pd.read_spss("2003-2021_KGSS_public_v3_20230210.sav", convert_categoricals=False)
df.describe()

YEARRESPIDFINALWTKGSSAGINSURVEYNSURVEFRQSEXAGEMARITALEMPLY...HHEMPLYHHWHYNOEHHDNOHOMPOPSEPAPOPUNRELATCOHAPREGIONURBANSAMPLEAB
count20841.00000020841.00000020841.00000020841.00000020841.00000020841.00000020841.00000020841.00000020841.00000020841.000000...20841.00000020841.00000020841.00000020841.00000020841.00000020841.00000020841.00000020841.00000020841.00000020841.000000
mean2010.04932630028.4078021.000000-0.720215-0.709707-0.9545611.54085745.6985752.1042661.424164...1.2090110.6134546.7097072.9002930.3456650.023319-0.9920833.4295380.829327-0.605345
std4.84124822688.7631260.4594711.2480610.8805240.5836040.49834016.9266721.6966780.498577...0.6869903.5174503.8407341.3031560.7674340.2410260.1501271.8242710.3762320.933411
min2003.000000101.0000000.203211-8.000000-8.000000-8.0000001.000000-8.000000-8.000000-8.000000...-8.000000-8.0000001.000000-8.000000-8.000000-8.000000-1.0000001.0000000.000000-1.000000
25%2006.00000011307.0000000.692861-1.000000-1.000000-1.0000001.00000033.0000001.0000001.000000...1.000000-1.0000004.0000002.0000000.0000000.000000-1.0000002.0000001.000000-1.000000
50%2010.00000025312.0000000.910562-1.000000-1.000000-1.0000002.00000044.0000001.0000001.000000...1.000000-1.0000007.0000003.0000000.0000000.000000-1.0000004.0000001.000000-1.000000
75%2013.00000053405.0000001.247668-1.000000-1.000000-1.0000002.00000058.0000003.0000002.000000...1.000000-1.00000010.0000004.0000000.0000000.000000-1.0000005.0000001.000000-1.000000
max2021.00000070124.0000004.5910922.0000002.00000010.0000002.00000099.0000006.0000002.000000...2.00000020.00000024.00000011.0000009.0000009.0000002.0000007.0000001.0000002.000000

8 rows × 3212 columns

提取2021年份的数据进行分析

df_2021 = df[df.YEAR==2021]
df_2021

YEARRESPIDFINALWTKGSSAGINKGSSINTRKGSSDIFFSURVEYNSURVEFRQSEXAGE...HHEMPLYHHWHYNOEHHDNOHOMPOPSEPAPOPUNRELATCOHAPREGIONURBANSAMPLEAB
196362021.0101.00.5670771.0-1-12.0-1.02.036.0...1.0-1.01.01.00.01.0-1.01.01.01.0
196372021.0102.00.3420962.0-1-12.0-1.02.063.0...1.0-1.02.01.00.00.0-1.01.01.02.0
196382021.0104.00.5888531.0-1-12.0-1.02.026.0...1.0-1.04.01.00.00.0-1.01.01.02.0
196392021.0105.00.689820-8.0-1-1-8.0-1.01.032.0...1.0-1.05.01.00.00.0-1.01.01.01.0
196402021.0106.00.850229-8.0-1-12.0-1.01.027.0...1.0-1.06.01.00.00.0-1.01.01.02.0
..................................................................
208362021.070104.01.0092922.0-1-12.0-1.02.035.0...1.0-1.04.04.00.00.0-1.04.01.02.0
208372021.070105.01.1072232.0-1-12.0-1.02.047.0...1.0-1.05.04.00.00.0-1.04.01.01.0
208382021.070106.01.6047232.0-1-12.0-1.01.058.0...1.0-1.06.04.00.00.0-1.04.01.02.0
208392021.070109.01.2555102.0-1-12.0-1.01.033.0...1.0-1.09.02.00.00.0-1.04.01.01.0
208402021.070111.01.107223-8.0-1-11.0-8.02.047.0...1.0-1.011.03.00.00.0-1.04.01.01.0

1205 rows × 3215 columns

Pre-process

WGWKHR

df_2021.WGWKHR.describe()
count    1205.000000
mean       13.727801
std        20.376129
min        -8.000000
25%        -1.000000
50%        -1.000000
75%        40.000000
max        80.000000
Name: WGWKHR, dtype: float64

统计异常值总和

sum(df_2021["WGWKHR"]==-8)
7

将异常值替换为空(np.nan)

WGWKHR = df_2021["WGWKHR"]
WGWKHR.replace((-8,-1), np.nan, inplace=True)
WGWKHR.describe()
/var/folders/pm/fx5ldf494tqd843k9mpzck740000gn/T/ipykernel_2294/3763845172.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  WGWKHR.replace((-8,-1), np.nan, inplace=True)





count    476.000000
mean      36.386555
std       14.185602
min        2.000000
25%       35.000000
50%       40.000000
75%       45.000000
max       80.000000
Name: WGWKHR, dtype: float64

RINCOM

RINCOM = df_2021["RINCOM0"]
RINCOM.describe()
count    1205.000000
mean      174.807469
std       211.070790
min        -8.000000
25%        -1.000000
50%       150.000000
75%       300.000000
max      2400.000000
Name: RINCOM0, dtype: float64

替换异常值

RINCOM.replace((-8,-1), np.nan, inplace=True)
RINCOM.describe()
/var/folders/pm/fx5ldf494tqd843k9mpzck740000gn/T/ipykernel_2294/3581387875.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  RINCOM.replace((-8,-1), np.nan, inplace=True)





count     730.000000
mean      289.547945
std       200.331463
min         0.000000
25%       190.000000
50%       250.000000
75%       350.000000
max      2400.000000
Name: RINCOM0, dtype: float64

收入数据方差过大,容易对线性模型产生影响,因此将RINCOMlog标准化

RINCOM = np.log(RINCOM).replace(-np.inf,0)
RINCOM.describe()

/Users/jmsu/anaconda3/envs/KGSS/lib/python3.10/site-packages/pandas/core/arraylike.py:402: RuntimeWarning: divide by zero encountered in log
  result = getattr(ufunc, method)(*inputs, **kwargs)





count    730.000000
mean       5.394227
std        1.001719
min        0.000000
25%        5.247024
50%        5.521461
75%        5.857933
max        7.783224
Name: RINCOM0, dtype: float64

Simaple Linear Regression (OLS)

建立模型,拟合并输出结果

model = sm.OLS(RINCOM, sm.add_constant(WGWKHR),missing='drop')
result = model.fit()
result.summary()
OLS Regression Results
Dep. Variable:RINCOM0R-squared:0.117
Model:OLSAdj. R-squared:0.115
Method:Least SquaresF-statistic:60.54
Date:Wed, 21 Jun 2023Prob (F-statistic):4.81e-14
Time:18:14:13Log-Likelihood:-464.72
No. Observations:461AIC:933.4
Df Residuals:459BIC:941.7
Df Model:1
Covariance Type:nonrobust
coefstd errtP>|t|[0.0250.975]
const4.84240.08656.0610.0004.6735.012
WGWKHR0.01720.0027.7810.0000.0130.021
Omnibus:306.448Durbin-Watson:1.707
Prob(Omnibus):0.000Jarque-Bera (JB):6747.756
Skew:-2.490Prob(JB):0.00
Kurtosis:21.069Cond. No.109.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

可以看出$R^2=0.117$ 说明模型的拟合程度不足,很好解释,因为很明显自变量工作时长不是影响因变量收入的唯一因素, 需要其他变量的介入,拟合多元线性模型以提高解释能力。

根据参数t检验结果,可以看出工作时长收入间的存在着显著的线性关系

画个回归拟合图:

sgr.plot_fit(result, 1)

output

Avatar

Jm Su

Life is Just for Fun and Love