我有来自互联网的数据集,我想为不同的列尝试不同的正常测试。我觉得很有趣,不同的正态性测试给了我不同的结果。不仅仅是几个小数不同,而是完全不同的输出。
这是我的代码。
from pandas import read_csv
url = "https://raw.githubusercontent.com/rashida048/Datasets/master/cars.csv"
data = read_csv(url)
y_1 = 'HWY (Le/100 km)'
y_2 = 'HWY (kWh/100 km)'
y_3 = 'CITY (kWh/100 km)'
y_4 = '(km)'
m = data[y_1]
m_2 = data[y_2]
m_3 = data[y_3]
m_4 = data[y_4]
l = [m,m_2, m_3, m_4]
#Kolmogorov-Smirnov test for Normality
for i in l:
statistic, pvalue = stats.kstest(i, 'norm')
print('statistic = %.2f, p = %.1f' %(statistic, pvalue))
if pvalue > 0.05:
print ('Gaussian')
else:
print('Not Gaussian')
输出:
statistic = 0.98, p = 0.0
Not Gaussian
statistic = 1.00, p = 0.0
Not Gaussian
statistic = 1.00, p = 0.0
Not Gaussian
statistic = 1.00, p = 0.0
Not Gaussian
#NormalTest (D'agostino's)
for i in l:
statistic, pvalue = stats.normaltest(i)
print('statistic = %.2f, p = %.5f' %(statistic, pvalue))
if pvalue > 0.05:
print ('Gaussian')
else:
print('Not Gaussian')
output:
statistic = 3.12, p = 0.21050
Gaussian
statistic = 3.28, p = 0.19423
Gaussian
statistic = 70.15, p = 0.00000
Not Gaussian
statistic = 188.31, p = 0.00000
Not Gaussian
#chi-Square
for i in l:
statistic, pvalue = stats.chisquare(i)
print('statistic = %.2f, p = %.5f' %(statistic, pvalue))
if pvalue > 0.05:
print ('Gaussian')
else:
print('Not Gaussian')
output:
statistic = 0.44, p = 1.00000
Gaussian
statistic = 3.73, p = 1.00000
Gaussian
statistic = 23.84, p = 0.99972
Gaussian
statistic = 4348.68, p = 0.00000
Not Gaussian
我仍在学习数据科学及其背后的一切。但我很困惑,如何用不同的价值观发表声明。只是选择一种方法并坚持下去吗?那不可能吧?