Pandas
(Python Data Analysis Library)是基于NumPy
的一种工具,该工具是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。你很快就会发现,它是使Python成为强大而高效的数据分析环境的重要因素之一。
pandas主要是用来作数据分析和处理的,如果想从事机器学习或数据挖掘相关的工作和研究,跟数据能打上交道的行业都离不开这个pandas库。pandas库给我们提供了非常简洁的函数,从数据的读取、预处理、到最后的分析,用一些很简洁的函数就能表示很复杂的操作。
读取数据
'''
aaa.txt
--------------------
Year,t1,t2,t3,t4
1986,t11,t21,t31,23
1987,t12,t22,t32,121
1988,t13,t23,t33,123
1989,t14,t24,t34,125
--------------------
'''
import pandas
import re
#只要文件中的数据是以 "," 为分隔符的,都可以使用 pandas.read_csv 函数进行读取
info = pandas.read_csv('aaa.txt')
print ("Pandas数据流矩阵(核心类):\n", type(info))
print ("数表的类型:\n", info.dtypes)
print ("数表的列:\n", info.columns)
print ("数表的型状:\n", info.shape)
print ("数表的前3个样本:\n", info.head(3))
print ("数表的尾3个样本:\n", info.tail(3))
print ("取数表的第0个样本:\n", info.loc[0])
print ("取数表的第[2,3]个样本:\n", info.loc[2:3])
print ("只打印数表的某些行和某些列:\n", info.loc[2:3][["Year","t1"]] )
print ("只打印数表的某些列和某些行:\n", info[["Year","t1"]].loc[2:3] )
columnlist = info.columns.tolist()
print (type(columnlist), columnlist)
target_clolumns = []
target_pattern = re.compile(r't\d')
for c in target_columns:
if target_pattern.match(c):
target_clolumns.append(c)
print ("目标列为:\n", target_clolumns)
print ("目标列数据对应的数据为:\n", info[target_clolumns].head(3))
'''
Pandas数据流矩阵(核心类):
<class 'pandas.core.frame.DataFrame'>
数表的类型:
Year int64
t1 object
t2 object
t3 object
t4 int64
dtype: object
数表的列:
Index(['Year', 't1', 't2', 't3', 't4'], dtype='object')
数表的型状:
(4, 5)
数表的前3个样本:
Year t1 t2 t3 t4
0 1986 t11 t21 t31 23
1 1987 t12 t22 t32 121
2 1988 t13 t23 t33 123
数表的尾3个样本:
Year t1 t2 t3 t4
1 1987 t12 t22 t32 121
2 1988 t13 t23 t33 123
3 1989 t14 t24 t34 125
取数表的第0个样本:
Year 1986
t1 t11
t2 t21
t3 t31
t4 23
Name: 0, dtype: object
取数表的第[2,3]个样本:
Year t1 t2 t3 t4
2 1988 t13 t23 t33 123
3 1989 t14 t24 t34 125
只打印数表的某些行和某些列:
Year t1
2 1988 t13
3 1989 t14
只打印数表的某些列和某些行:
Year t1
2 1988 t13
3 1989 t14
<class 'list'> ['Year', 't1', 't2', 't3', 't4']
目标列为:
['t1', 't2', 't3', 't4']
目标列数据对应的数据为:
t1 t2 t3 t4
0 t11 t21 t31 23
1 t12 t22 t32 121
2 t13 t23 t33 123
'''
数据的预处理
#数据的预处理
print ("某一列都除以这一列的最大数:\n", info['t4']/info['t4'].max())
info.sort_values("t4", inplace=True, ascending=False) #注意排序会改变原来的dataframe
print ("在原来dataframe基础上按某一行进行排序, inplace=True表示在原来的dataframe位置替换而不copy生成一个新dataframe:\n", info)
print ("可以看到排序之后,我们的索引值变得无序了,这对于我们取数据是不利的,我们希望能对索引进行重新设定,如下:")
new_info = info.reset_index(drop=True)
print ("重新设定索引后的dataframe(之前的datarame 即info并没有被改变): \n", new_info)
'''
某一列都除以这一列的最大数:
0 0.184
1 0.968
2 0.984
3 1.000
Name: t4, dtype: float64
在原来dataframe基础上按某一行进行排序, inplace=True表示在原来的dataframe位置替换而不copy生成一个新dataframe:
Year t1 t2 t3 t4
3 1989 t14 t24 t34 125
2 1988 t13 t23 t33 123
1 1987 t12 t22 t32 121
0 1986 t11 t21 t31 23
可以看到排序之后,我们的索引值变得无序了,这对于我们取数据是不利的,我们希望能对索引进行重新设定,如下:
重新设定索引后的dataframe(之前的datarame 即info并没有被改变):
Year t1 t2 t3 t4
0 1989 t14 t24 t34 125
1 1988 t13 t23 t33 123
2 1987 t12 t22 t32 121
3 1986 t11 t21 t31 23
'''
泰坦尼克号乘客样本分析案例:
'''
titanic_train 数据示例:
----------------------
PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S
----------------------
'''
import pandas as pd
import numpy as np
titanic_survival = pandas.read_csv('titanic_train.csv')
#定位数表中的元素
print ("第83个样本的Age: ", titanic_survival.loc[83, "Age"])
#缺失值(NA)的处理
age = titanic_survival["Age"]
age_is_null = pd.isnull(age)
age_null_true = age[age_is_null]
age_null_true_count = len(age_null_true)
print ("Age缺失值的个数: ", age_null_true_count)
#使用普通方法计算平均值
good_ages = titanic_survival["Age"][age_is_null==False]
mean_age = sum(good_ages) / len(good_ages)
print ("年龄的平均值(普通方法计算): ", mean_age)
#使用pandas的函数计算含有缺失值的列的平均数
print ("年龄的平均值(dataframe.mean方法计算): ", titanic_survival["Age"].mean())
# 计算1,2,3等仓船票的平均价格
passenger_classes = [1,2,3]
fares_by_class = {}
for this_class in passenger_classes:
pclass_rows = titanic_survival[titanic_survival["Pclass"]==this_class]
pclass_fares = pclass_rows["Fare"]
fare_for_class = pclass_fares.mean()
fares_by_class[this_class] = fare_for_class
print ("各等船舱船票的平均值为: ", fares_by_class)
# 分析船舱等级和是否获救之间的关系(类似于Excel的透视表分析功能)
# pivot_table(index表示基准列,values表示基准列的分析列,aggfunc表示采用的数据透视算法)
passenger_survival = titanic_survival.pivot_table(index="Pclass", values="Survived", aggfunc=np.mean)
print ("各等级船舱的人获救的几率:\n", passenger_survival)
print ("由数据透视结果可知,船舱的等级越高,被获救的计几率也就越高!")
# 再分析船舱等级和年龄之间的关系
passenger_age = titanic_survival.pivot_table(index="Pclass", values="Age", aggfunc=np.mean)
print ("各等级船舱的人的平均年龄:\n", passenger_age)
print ("由数据透视结果可知,船舱的等级越高,该船舱人的平均年龄就越大!")
# 再来分析不同登船地点的 售票总额以及被获救的人数
port_start = titanic_survival.pivot_table(index="Embarked", values=["Fare", "Survived"], aggfunc=np.sum)
print ("不同登船地点的 售票总额以及被获救的人数如下:\n", port_start)
# 将匹配到的 axis=? 或 axis='columns' 对应的列的缺失值丢弃掉
# subset指定在哪些列中没有数据时才进行删除
# how有两个取值 'any'|'all',表示一行数据在何种方式下被舍弃(至少一个NA还是全部NA)
# thresh=3 表示一行中至少有3个NA时才将其丢弃
drop_na_columns = titanic_survival.dropna(axis=0, how='any', thresh=None)
new_titanic_survival = titanic_survival.dropna(axis=0, subset=["Age", "Sex"])
print (id(new_titanic_survival), id(titanic_survival))
'''
第83个样本的Age: 28.0
Age缺失值的个数: 177
年龄的平均值(普通方法计算): 29.6991176471
年龄的平均值(dataframe.mean方法计算): 29.69911764705882
各等船舱船票的平均值为: {1: 84.15468749999992, 2: 20.66218315217391, 3: 13.675550101832997}
各等级船舱的人获救的几率:
Survived
Pclass
1 0.629630
2 0.472826
3 0.242363
由数据透视结果可知,船舱的等级越高,被获救的计几率也就越高!
各等级船舱的人的平均年龄:
Age
Pclass
1 38.233441
2 29.877630
3 25.140620
由数据透视结果可知,船舱的等级越高,该船舱人的平均年龄就越大!
不同登船地点的 售票总额以及被获救的人数如下:
Fare Survived
Embarked
C 10072.2962 93
Q 1022.2543 30
S 17439.3988 217
4560635328 4464760368
'''
自定义函数:
# 自定义一个函数
def hundredth_row(column):
return column.loc[99]
def not_null_count(column):
column_null_true = column[pd.isnull(column)]
return len(column_null_true)
def which_class(row):
pclass = row['Pclass']
if pd.isnull(pclass):
return 'Unknown'
elif pclass==1:
return 'First Class'
elif pclass==2:
return 'Second Class'
elif pclass==3:
return 'Third Class'
def is_minor(row):
if row["Age"] < 18:
return True
else:
return False
def generate_age_label(row):
age = row["Age"]
if pd.isnull(age):
return "Unknown"
elif age < 18:
return "Minor"
else:
return "Adult"
print ("使用自定义函数展示数据(默认axis为0表示列):\n", titanic_survival.apply(hundredth_row) )
print ("使用自定义函数计算得到的每一列的缺失值个数:\n", titanic_survival.apply(not_null_count))
print ("使用自定义函数变换一个cell的值(这里axis为1表示行):\n", titanic_survival.apply(which_class, axis=1).head(3))
print ("使用自定义函数判断一个样本是否是小孩:\n", titanic_survival.apply(is_minor, axis=1).head(3))
print ("使用自定义函数判断一个样本的年龄段:\n", titanic_survival.apply(generate_age_label, axis=1).head(3))
# 给titanic_survial这个dataframe加一个列age_labels,并透视分析该列与其他列之间的关系
age_labels = titanic_survival.apply(generate_age_label, axis=1)
titanic_survival['age_labels'] = age_labels
print ("新增age_labels列之后的titanic_survival:\n", titanic_survival.head(3))
age_group_survival = titanic_survival.pivot_table(index='age_labels', values="Survived", aggfunc=np.mean)
print ("不同年龄段人群获救的几率如下:\n", age_group_survival)
print ("有分析结果可知,小孩的获救几率比成年人的获救几率要大很多!")
'''
使用自定义函数展示数据(默认axis为0表示列):
PassengerId 100
Survived 0
Pclass 2
Name Kantor, Mr. Sinai
Sex male
Age 34
SibSp 1
Parch 0
Ticket 244367
Fare 26
Cabin NaN
Embarked S
age_labels Adult
dtype: object
使用自定义函数计算得到的每一列的缺失值个数:
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
age_labels 0
dtype: int64
使用自定义函数变换一个cell的值(这里axis为1表示行):
0 Third Class
1 First Class
2 Third Class
dtype: object
使用自定义函数判断一个样本是否是小孩:
0 False
1 False
2 False
dtype: bool
使用自定义函数判断一个样本的年龄段:
0 Adult
1 Adult
2 Adult
dtype: object
新增age_labels列之后的titanic_survival:
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
Parch Ticket Fare Cabin Embarked age_labels
0 0 A/5 21171 7.2500 NaN S Adult
1 0 PC 17599 71.2833 C85 C Adult
2 0 STON/O2. 3101282 7.9250 NaN S Adult
不同年龄段人群获救的几率如下:
Survived
age_labels
Adult 0.381032
Minor 0.539823
Unknown 0.293785
有分析结果可知,小孩的获救几率比成年人的获救几率要大很多!
'''
pandas Series 的使用:
'''
fandango_score_comparison.csv数据示例:
------------------------------------
FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,Metacritic_norm,Metacritic_user_nom,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
Avengers: Age of Ultron (2015),74,86,66,7.1,7.8,5,4.5,3.7,4.3,3.3,3.55,3.9,3.5,4.5,3.5,3.5,4,1330,271107,14846,0.5
Cinderella (2015),85,80,67,7.5,7.1,5,4.5,4.25,4,3.35,3.75,3.55,4.5,4,3.5,4,3.5,249,65709,12640,0.5
Ant-Man (2015),80,90,64,8.1,7.8,5,4.5,4,4.5,3.2,4.05,3.9,4,4.5,3,4,4,627,103660,12055,0.5
Do You Believe? (2015),18,84,22,4.7,5.4,5,4.5,0.9,4.2,1.1,2.35,2.7,1,4,1,2.5,2.5,31,3136,1793,0.5
------------------------------------
'''
import pandas as pd
import numpy as np
from pandas import Series
# 读取数据
fandango = pd.read_csv("fandango_score_comparison.csv")
series_film = fandango['FILM']
series_rt = fandango['RottenTomatoes']
print ("Series类型的数据,底层使用numpy的ndarray,两者之间可以混用: ", type(series_film))
print ("Series.values的类型就是ndarray: ", type(series_film.values))
print ("第[0,5)行series_film数据: \n", series_film[0:5])
print ("第[0,5)行series_rt数据: \n", series_rt[0:5])
# 自定义一个Series
film_names = series_film.values
rt_scores = series_rt.values
series_custom = Series(rt_scores, index=film_names)
print ("自定义的series_custom示例:\n", series_custom.head(5))
print ("取对应电影索引的数据: \n", series_custom[['Ant-Man (2015)', 'Do You Believe? (2015)']] )
print ("其实在padas中我们既可以使用自定义的索引,也可以继续使用之前的数字索引来取数据:\n", series_custom[[2,3]] )
print ("series_custom[[2,3]] 相当于 series_custom[2:4]:\n", series_custom[2:4])
print ("取出自定义Series的索引列:\n", series_custom.index)
# 按索引重新排序Series
original_index = series_custom.index.tolist()
sorted_index = sorted(original_index)
sorted_by_index = series_custom.reindex(sorted_index)
print ("重新按索引排序的Series(自定义方法): \n", sorted_by_index.head(5))
# 直接使用 pandas api 对Series排序
sc2 = series_custom.sort_index()
sc3 = series_custom.sort_values(ascending=False)
print ("重新按索引排序的Series(使用pandas api):\n", sc2[0:5])
print ("重新按值排序的Series:(使用pandas api)\n", sc3[0:5])
# pandas Series 和 numpy ndarray 混用
print ("两个Series对应元素值相加, 即 series_custom+series_custom: \n", np.add(series_custom, series_custom)[0:5] )
print ("对Series的值列进行函数计算:\n", np.sin(series_custom)[0:5] )
print ("取出满足一定条件的电影序列: \n", series_custom[(series_custom>80) | (series_custom<30)].head(5) )
'''
Series类型的数据,底层使用numpy的ndarray,两者之间可以混用: <class 'pandas.core.series.Series'>
Series.values的类型就是ndarray: <class 'numpy.ndarray'>
第[0,5)行series_film数据:
0 Avengers: Age of Ultron (2015)
1 Cinderella (2015)
2 Ant-Man (2015)
3 Do You Believe? (2015)
4 Hot Tub Time Machine 2 (2015)
Name: FILM, dtype: object
第[0,5)行series_rt数据:
0 74
1 85
2 80
3 18
4 14
Name: RottenTomatoes, dtype: int64
自定义的series_custom示例:
Avengers: Age of Ultron (2015) 74
Cinderella (2015) 85
Ant-Man (2015) 80
Do You Believe? (2015) 18
Hot Tub Time Machine 2 (2015) 14
dtype: int64
取对应电影索引的数据:
Ant-Man (2015) 80
Do You Believe? (2015) 18
dtype: int64
其实在padas中我们既可以使用自定义的索引,也可以继续使用之前的数字索引来取数据:
Ant-Man (2015) 80
Do You Believe? (2015) 18
dtype: int64
series_custom[[2,3]] 相当于 series_custom[2:4]:
Ant-Man (2015) 80
Do You Believe? (2015) 18
dtype: int64
取出自定义Series的索引列:
Index(['Avengers: Age of Ultron (2015)', 'Cinderella (2015)', 'Ant-Man (2015)',
'Do You Believe? (2015)', 'Hot Tub Time Machine 2 (2015)',
'The Water Diviner (2015)', 'Irrational Man (2015)', 'Top Five (2014)',
'Shaun the Sheep Movie (2015)', 'Love & Mercy (2015)',
...
'The Woman In Black 2 Angel of Death (2015)', 'Danny Collins (2015)',
'Spare Parts (2015)', 'Serena (2015)', 'Inside Out (2015)',
'Mr. Holmes (2015)', ''71 (2015)', 'Two Days, One Night (2014)',
'Gett: The Trial of Viviane Amsalem (2015)',
'Kumiko, The Treasure Hunter (2015)'],
dtype='object', length=146)
重新按索引排序的Series(自定义方法):
'71 (2015) 97
5 Flights Up (2015) 52
A Little Chaos (2015) 40
A Most Violent Year (2014) 90
About Elly (2015) 97
dtype: int64
重新按索引排序的Series(使用pandas api):
'71 (2015) 97
5 Flights Up (2015) 52
A Little Chaos (2015) 40
A Most Violent Year (2014) 90
About Elly (2015) 97
dtype: int64
重新按值排序的Series:(使用pandas api)
Gett: The Trial of Viviane Amsalem (2015) 100
Seymour: An Introduction (2015) 100
Selma (2014) 99
Phoenix (2015) 99
Song of the Sea (2014) 99
dtype: int64
两个Series对应元素值相加, 即 series_custom+series_custom:
Avengers: Age of Ultron (2015) 148
Cinderella (2015) 170
Ant-Man (2015) 160
Do You Believe? (2015) 36
Hot Tub Time Machine 2 (2015) 28
dtype: int64
对Series的值列进行函数计算:
Avengers: Age of Ultron (2015) -0.985146
Cinderella (2015) -0.176076
Ant-Man (2015) -0.993889
Do You Believe? (2015) -0.750987
Hot Tub Time Machine 2 (2015) 0.990607
dtype: float64
取出满足一定条件的电影序列:
Cinderella (2015) 85
Do You Believe? (2015) 18
Hot Tub Time Machine 2 (2015) 14
Top Five (2014) 86
Shaun the Sheep Movie (2015) 99
dtype: int64
'''