
八、Pandas的基本使用
发布日期:2021-05-08 02:34:50
浏览次数:22
分类:原创文章
本文共 25459 字,大约阅读时间需要 84 分钟。
点击标题即可获取文章源代码和笔记
4.1.0 概要
Pandas 基础处理 Pandas是什么?为什么用? 核心数据结构 DataFrame Panel Series 基本操作 运算 画图 文件的读取与存储 高级处理4.1Pandas介绍 4.1.1 Pandas介绍 - 数据处理工具 panel + data + analysis panel面板数据 - 计量经济学 三维数据 4.1.2 为什么使用Pandas 便捷的数据处理能力 读取文件方便 封装了Matplotlib、Numpy的画图和计算 4.1.3 DataFrame 结构:既有行索引,又有列索引的二维数组 属性: shape index columns values T 方法: head() tail() 3 DataFrame索引的设置 1)修改行列索引值 2)重设索引 3)设置新索引 2 Panel DataFrame的容器 3 Series 带索引的一维数组 属性 index values 总结: DataFrame是Series的容器 Panel是DataFrame的容器4.2 基本数据操作 4.2.1 索引操作 1)直接索引 先列后行 2)按名字索引 loc 3)按数字索引 iloc 4)组合索引 数字、名字 4.2.3 排序 对内容排序 dataframe series 对索引排序 dataframe series4.3 DataFrame运算 算术运算 逻辑运算 逻辑运算符 布尔索引 逻辑运算函数 query() isin() 统计运算 min max mean median var std np.argmax() np.argmin() 自定义运算 apply(func, axis=0)True func:自定义函数4.4 Pandas画图 sr.plot()4.5 文件读取与存储 4.5.1 CSV pd.read_csv(path) usecols= names= dataframe.to_csv(path) columns=[] index=False header=False 4.5.2 HDF5 hdf5 存储 3维数据的文件 key1 dataframe1二维数据 key2 dataframe2二维数据 pd.read_hdf(path, key=) df.to_hdf(path, key=) 4.5.3 JSON pd.read_json(path) orient="records" lines=True df.to_json(patn) orient="records" lines=True
4.1.3 DataFrame
import numpy as np# 创建一个符合正态分布的10个股票5天的涨跌幅数据stock_change = np.random.normal(0,1,(10,5))
stock_change
array([[ 0.77072465, 1.30408183, -0.44043464, 0.8900768 , -0.80947118], [ 0.92407994, 0.01646795, -1.26614793, 1.52393669, -0.85373051], [-1.68378051, 0.4302981 , 0.8069393 , 0.60557427, -0.03960376], [ 0.75708007, -0.39899325, 0.23027082, -0.89585658, -1.86590247], [-0.41516245, -1.31841546, 0.16256478, -0.67449097, -1.26234013], [-0.27687242, -0.74154521, -0.03755446, 1.24182603, -0.79444361], [-0.2549323 , -0.41034663, -1.85076521, -1.28663451, -0.28566877], [ 1.22453612, -1.60200055, -1.83171522, -0.85322799, -1.70950421], [ 2.00461483, 1.49338564, 0.33928513, -0.1776084 , -0.39698965], [ 0.2184662 , -0.03868143, -0.21432675, 0.00604093, 1.35011139]])
import pandas as pd pd.DataFrame(stock_change)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 0.770725 | 1.304082 | -0.440435 | 0.890077 | -0.809471 |
1 | 0.924080 | 0.016468 | -1.266148 | 1.523937 | -0.853731 |
2 | -1.683781 | 0.430298 | 0.806939 | 0.605574 | -0.039604 |
3 | 0.757080 | -0.398993 | 0.230271 | -0.895857 | -1.865902 |
4 | -0.415162 | -1.318415 | 0.162565 | -0.674491 | -1.262340 |
5 | -0.276872 | -0.741545 | -0.037554 | 1.241826 | -0.794444 |
6 | -0.254932 | -0.410347 | -1.850765 | -1.286635 | -0.285669 |
7 | 1.224536 | -1.602001 | -1.831715 | -0.853228 | -1.709504 |
8 | 2.004615 | 1.493386 | 0.339285 | -0.177608 | -0.396990 |
9 | 0.218466 | -0.038681 | -0.214327 | 0.006041 | 1.350111 |
# 构造行索引序列stock_code = ['股票' + str(i) for i in range(stock_change.shape[0])]stock_code
['股票0', '股票1', '股票2', '股票3', '股票4', '股票5', '股票6', '股票7', '股票8', '股票9']
# 添加行索引data = pd.DataFrame(stock_change,index=stock_code)data
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
股票0 | 0.770725 | 1.304082 | -0.440435 | 0.890077 | -0.809471 |
股票1 | 0.924080 | 0.016468 | -1.266148 | 1.523937 | -0.853731 |
股票2 | -1.683781 | 0.430298 | 0.806939 | 0.605574 | -0.039604 |
股票3 | 0.757080 | -0.398993 | 0.230271 | -0.895857 | -1.865902 |
股票4 | -0.415162 | -1.318415 | 0.162565 | -0.674491 | -1.262340 |
股票5 | -0.276872 | -0.741545 | -0.037554 | 1.241826 | -0.794444 |
股票6 | -0.254932 | -0.410347 | -1.850765 | -1.286635 | -0.285669 |
股票7 | 1.224536 | -1.602001 | -1.831715 | -0.853228 | -1.709504 |
股票8 | 2.004615 | 1.493386 | 0.339285 | -0.177608 | -0.396990 |
股票9 | 0.218466 | -0.038681 | -0.214327 | 0.006041 | 1.350111 |
# 添加列索引date = pd.date_range(start="20200618",periods=5,freq="B") # start 开始时间, periods 间隔时间,freq 按照什么间隔 d w 5hdate
DatetimeIndex(['2020-06-18', '2020-06-19', '2020-06-22', '2020-06-23', '2020-06-24'], dtype='datetime64[ns]', freq='B')
# 添加列索引data = pd.DataFrame(stock_change,index=stock_code,columns=date) data
2020-06-18 | 2020-06-19 | 2020-06-22 | 2020-06-23 | 2020-06-24 | |
---|---|---|---|---|---|
股票0 | 0.770725 | 1.304082 | -0.440435 | 0.890077 | -0.809471 |
股票1 | 0.924080 | 0.016468 | -1.266148 | 1.523937 | -0.853731 |
股票2 | -1.683781 | 0.430298 | 0.806939 | 0.605574 | -0.039604 |
股票3 | 0.757080 | -0.398993 | 0.230271 | -0.895857 | -1.865902 |
股票4 | -0.415162 | -1.318415 | 0.162565 | -0.674491 | -1.262340 |
股票5 | -0.276872 | -0.741545 | -0.037554 | 1.241826 | -0.794444 |
股票6 | -0.254932 | -0.410347 | -1.850765 | -1.286635 | -0.285669 |
股票7 | 1.224536 | -1.602001 | -1.831715 | -0.853228 | -1.709504 |
股票8 | 2.004615 | 1.493386 | 0.339285 | -0.177608 | -0.396990 |
股票9 | 0.218466 | -0.038681 | -0.214327 | 0.006041 | 1.350111 |
DataFrame属性
data.shape
(10, 5)
data.index
Index(['股票0', '股票1', '股票2', '股票3', '股票4', '股票5', '股票6', '股票7', '股票8', '股票9'], dtype='object')
data.columns
DatetimeIndex(['2020-06-18', '2020-06-19', '2020-06-22', '2020-06-23', '2020-06-24'], dtype='datetime64[ns]', freq='B')
data.values
array([[ 0.77072465, 1.30408183, -0.44043464, 0.8900768 , -0.80947118], [ 0.92407994, 0.01646795, -1.26614793, 1.52393669, -0.85373051], [-1.68378051, 0.4302981 , 0.8069393 , 0.60557427, -0.03960376], [ 0.75708007, -0.39899325, 0.23027082, -0.89585658, -1.86590247], [-0.41516245, -1.31841546, 0.16256478, -0.67449097, -1.26234013], [-0.27687242, -0.74154521, -0.03755446, 1.24182603, -0.79444361], [-0.2549323 , -0.41034663, -1.85076521, -1.28663451, -0.28566877], [ 1.22453612, -1.60200055, -1.83171522, -0.85322799, -1.70950421], [ 2.00461483, 1.49338564, 0.33928513, -0.1776084 , -0.39698965], [ 0.2184662 , -0.03868143, -0.21432675, 0.00604093, 1.35011139]])
data.T
股票0 | 股票1 | 股票2 | 股票3 | 股票4 | 股票5 | 股票6 | 股票7 | 股票8 | 股票9 | |
---|---|---|---|---|---|---|---|---|---|---|
2020-06-18 | 0.770725 | 0.924080 | -1.683781 | 0.757080 | -0.415162 | -0.276872 | -0.254932 | 1.224536 | 2.004615 | 0.218466 |
2020-06-19 | 1.304082 | 0.016468 | 0.430298 | -0.398993 | -1.318415 | -0.741545 | -0.410347 | -1.602001 | 1.493386 | -0.038681 |
2020-06-22 | -0.440435 | -1.266148 | 0.806939 | 0.230271 | 0.162565 | -0.037554 | -1.850765 | -1.831715 | 0.339285 | -0.214327 |
2020-06-23 | 0.890077 | 1.523937 | 0.605574 | -0.895857 | -0.674491 | 1.241826 | -1.286635 | -0.853228 | -0.177608 | 0.006041 |
2020-06-24 | -0.809471 | -0.853731 | -0.039604 | -1.865902 | -1.262340 | -0.794444 | -0.285669 | -1.709504 | -0.396990 | 1.350111 |
DataFrame方法
data.head() # 返回前5行数据
2020-06-18 | 2020-06-19 | 2020-06-22 | 2020-06-23 | 2020-06-24 | |
---|---|---|---|---|---|
股票0 | 0.770725 | 1.304082 | -0.440435 | 0.890077 | -0.809471 |
股票1 | 0.924080 | 0.016468 | -1.266148 | 1.523937 | -0.853731 |
股票2 | -1.683781 | 0.430298 | 0.806939 | 0.605574 | -0.039604 |
股票3 | 0.757080 | -0.398993 | 0.230271 | -0.895857 | -1.865902 |
股票4 | -0.415162 | -1.318415 | 0.162565 | -0.674491 | -1.262340 |
data.tail() # 返回后5行数据
2020-06-18 | 2020-06-19 | 2020-06-22 | 2020-06-23 | 2020-06-24 | |
---|---|---|---|---|---|
股票5 | -0.276872 | -0.741545 | -0.037554 | 1.241826 | -0.794444 |
股票6 | -0.254932 | -0.410347 | -1.850765 | -1.286635 | -0.285669 |
股票7 | 1.224536 | -1.602001 | -1.831715 | -0.853228 | -1.709504 |
股票8 | 2.004615 | 1.493386 | 0.339285 | -0.177608 | -0.396990 |
股票9 | 0.218466 | -0.038681 | -0.214327 | 0.006041 | 1.350111 |
3 DataFrame索引的设置
- 修改行列索引值
data.index[2]
'股票2'
data.index[2] = "股票88"# 注意:单独修改每一列的索引是不行的,在DataFrame中,只能对索引进行整体的修改
---------------------------------------------------------------------------TypeError Traceback (most recent call last)<ipython-input-19-9e95917cc4d9> in <module>----> 1 data.index[2] = "股票88"D:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value) 3908 3909 def __setitem__(self, key, value):-> 3910 raise TypeError("Index does not support mutable operations") 3911 3912 def __getitem__(self, key):TypeError: Index does not support mutable operations
stock_ = ["股票_{}".format(i) for i in range(10)]
data.index = stock_
data.index
Index(['股票_0', '股票_1', '股票_2', '股票_3', '股票_4', '股票_5', '股票_6', '股票_7', '股票_8', '股票_9'], dtype='object')
重设索引
- reset_index(drop=False)
- 设置新的下标索引
- drop:默认为False,不删除原来索引,如果为True,删除原来的索引值
# 重置索引,drop=Falsedata.reset_index()
index | 2020-06-18 00:00:00 | 2020-06-19 00:00:00 | 2020-06-22 00:00:00 | 2020-06-23 00:00:00 | 2020-06-24 00:00:00 | |
---|---|---|---|---|---|---|
0 | 股票_0 | 0.770725 | 1.304082 | -0.440435 | 0.890077 | -0.809471 |
1 | 股票_1 | 0.924080 | 0.016468 | -1.266148 | 1.523937 | -0.853731 |
2 | 股票_2 | -1.683781 | 0.430298 | 0.806939 | 0.605574 | -0.039604 |
3 | 股票_3 | 0.757080 | -0.398993 | 0.230271 | -0.895857 | -1.865902 |
4 | 股票_4 | -0.415162 | -1.318415 | 0.162565 | -0.674491 | -1.262340 |
5 | 股票_5 | -0.276872 | -0.741545 | -0.037554 | 1.241826 | -0.794444 |
6 | 股票_6 | -0.254932 | -0.410347 | -1.850765 | -1.286635 | -0.285669 |
7 | 股票_7 | 1.224536 | -1.602001 | -1.831715 | -0.853228 | -1.709504 |
8 | 股票_8 | 2.004615 | 1.493386 | 0.339285 | -0.177608 | -0.396990 |
9 | 股票_9 | 0.218466 | -0.038681 | -0.214327 | 0.006041 | 1.350111 |
# 重置索引,drop=Truedata.reset_index(drop=True)
2020-06-18 | 2020-06-19 | 2020-06-22 | 2020-06-23 | 2020-06-24 | |
---|---|---|---|---|---|
0 | 0.770725 | 1.304082 | -0.440435 | 0.890077 | -0.809471 |
1 | 0.924080 | 0.016468 | -1.266148 | 1.523937 | -0.853731 |
2 | -1.683781 | 0.430298 | 0.806939 | 0.605574 | -0.039604 |
3 | 0.757080 | -0.398993 | 0.230271 | -0.895857 | -1.865902 |
4 | -0.415162 | -1.318415 | 0.162565 | -0.674491 | -1.262340 |
5 | -0.276872 | -0.741545 | -0.037554 | 1.241826 | -0.794444 |
6 | -0.254932 | -0.410347 | -1.850765 | -1.286635 | -0.285669 |
7 | 1.224536 | -1.602001 | -1.831715 | -0.853228 | -1.709504 |
8 | 2.004615 | 1.493386 | 0.339285 | -0.177608 | -0.396990 |
9 | 0.218466 | -0.038681 | -0.214327 | 0.006041 | 1.350111 |
以某列值设置为新的索引
- set_index(keys,drop=True)
- keys:列索引名或者列索引名称的列表
- drop:boolean,default True 当作新的索引,删除原来的索引列
设置新索引案例
- 1.创建
df = pd.DataFrame({ 'month':[1,4,7,10], 'year':[2012,2014,2013,2014], 'sale':[55,40,84,31]})df
month | year | sale | |
---|---|---|---|
0 | 1 | 2012 | 55 |
1 | 4 | 2014 | 40 |
2 | 7 | 2013 | 84 |
3 | 10 | 2014 | 31 |
- 2、以月份设置新的索引
df.set_index('month')
year | sale | |
---|---|---|
month | ||
1 | 2012 | 55 |
4 | 2014 | 40 |
7 | 2013 | 84 |
10 | 2014 | 31 |
-
- 设置多个索引,以年和月份
new_df = df.set_index(['year','month'])
new_df
sale | ||
---|---|---|
year | month | |
2012 | 1 | 55 |
2014 | 4 | 40 |
2013 | 7 | 84 |
2014 | 10 | 31 |
new_df.index
MultiIndex([(2012, 1), (2014, 4), (2013, 7), (2014, 10)], names=['year', 'month'])
4.1.4 MultiIndex 与 Panel的关系
1 Multilndex多级或分层索引对象。
- index属性
names: levels的名称
levels:每个level的元组值
new_df.index.names
FrozenList(['year', 'month'])
new_df.index.levels
FrozenList([[2012, 2013, 2014], [1, 4, 7, 10]])
2 Panel
p = pd.Panel()p# 新版本已移除该函数
D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version """Entry point for launching an IPython kernel.<pandas.__getattr__.<locals>.Panel at 0x203fd31ea08>
data
2020-06-18 | 2020-06-19 | 2020-06-22 | 2020-06-23 | 2020-06-24 | |
---|---|---|---|---|---|
股票_0 | 0.770725 | 1.304082 | -0.440435 | 0.890077 | -0.809471 |
股票_1 | 0.924080 | 0.016468 | -1.266148 | 1.523937 | -0.853731 |
股票_2 | -1.683781 | 0.430298 | 0.806939 | 0.605574 | -0.039604 |
股票_3 | 0.757080 | -0.398993 | 0.230271 | -0.895857 | -1.865902 |
股票_4 | -0.415162 | -1.318415 | 0.162565 | -0.674491 | -1.262340 |
股票_5 | -0.276872 | -0.741545 | -0.037554 | 1.241826 | -0.794444 |
股票_6 | -0.254932 | -0.410347 | -1.850765 | -1.286635 | -0.285669 |
股票_7 | 1.224536 | -1.602001 | -1.831715 | -0.853228 | -1.709504 |
股票_8 | 2.004615 | 1.493386 | 0.339285 | -0.177608 | -0.396990 |
股票_9 | 0.218466 | -0.038681 | -0.214327 | 0.006041 | 1.350111 |
Series
data.iloc[1,:] # 带索引的一维数组
2020-06-18 0.9240802020-06-19 0.0164682020-06-22 -1.2661482020-06-23 1.5239372020-06-24 -0.853731Freq: B, Name: 股票_1, dtype: float64
type(data.iloc[1,:])
pandas.core.series.Series
属性
data.iloc[1,:].index
DatetimeIndex(['2020-06-18', '2020-06-19', '2020-06-22', '2020-06-23', '2020-06-24'], dtype='datetime64[ns]', freq='B')
data.iloc[1,:].values
array([ 0.92407994, 0.01646795, -1.26614793, 1.52393669, -0.85373051])
1. 创建Series
通过已有数据创建
- 指定内容,默认索引
pd.Series(np.arange(10))
0 01 12 23 34 45 56 67 78 89 9dtype: int32
- 指定索引
pd.Series([6.7,5.6,3,10,2],index=[1,2,3,4,5])
1 6.72 5.63 3.04 10.05 2.0dtype: float64
- 通过字典数据创建
pd.Series({ 'red':100, 'blue':200, 'green':500, 'yellow':1000})
red 100blue 200green 500yellow 1000dtype: int64
总结
- DataFrame 是 Series的容器
- Panel 是 DataFrame的容器
4.2 基本数据操作
datas = pd.read_excel("./datas/szfj_baoan.xls")
datas
district | roomnum | hall | AREA | C_floor | floor_num | school | subway | per_price | |
---|---|---|---|---|---|---|---|---|---|
0 | baoan | 3 | 2 | 89.3 | middle | 31 | 0 | 0 | 7.0773 |
1 | baoan | 4 | 2 | 127.0 | high | 31 | 0 | 0 | 6.9291 |
2 | baoan | 1 | 1 | 28.0 | low | 39 | 0 | 0 | 3.9286 |
3 | baoan | 1 | 1 | 28.0 | middle | 30 | 0 | 0 | 3.3568 |
4 | baoan | 2 | 2 | 78.0 | middle | 8 | 1 | 1 | 5.0769 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1246 | baoan | 4 | 2 | 89.3 | low | 8 | 0 | 0 | 4.2553 |
1247 | baoan | 2 | 1 | 67.0 | middle | 30 | 0 | 0 | 3.8060 |
1248 | baoan | 2 | 2 | 67.4 | middle | 29 | 1 | 0 | 5.3412 |
1249 | baoan | 2 | 2 | 73.1 | low | 15 | 1 | 0 | 5.9508 |
1250 | baoan | 3 | 2 | 86.2 | middle | 32 | 0 | 1 | 4.5244 |
1251 rows × 9 columns
datas.columns
Index(['district', 'roomnum', 'hall', 'AREA', 'C_floor', 'floor_num', 'school', 'subway', 'per_price'], dtype='object')
# 删除列datas = datas.drop(columns=[ 'school','subway',],axis=0)
datas
district | roomnum | hall | AREA | C_floor | floor_num | per_price | |
---|---|---|---|---|---|---|---|
0 | baoan | 3 | 2 | 89.3 | middle | 31 | 7.0773 |
1 | baoan | 4 | 2 | 127.0 | high | 31 | 6.9291 |
2 | baoan | 1 | 1 | 28.0 | low | 39 | 3.9286 |
3 | baoan | 1 | 1 | 28.0 | middle | 30 | 3.3568 |
4 | baoan | 2 | 2 | 78.0 | middle | 8 | 5.0769 |
... | ... | ... | ... | ... | ... | ... | ... |
1246 | baoan | 4 | 2 | 89.3 | low | 8 | 4.2553 |
1247 | baoan | 2 | 1 | 67.0 | middle | 30 | 3.8060 |
1248 | baoan | 2 | 2 | 67.4 | middle | 29 | 5.3412 |
1249 | baoan | 2 | 2 | 73.1 | low | 15 | 5.9508 |
1250 | baoan | 3 | 2 | 86.2 | middle | 32 | 4.5244 |
1251 rows × 7 columns
4.2.1 索引操作
1.直接使用行列索引(先列后行)
datas["per_price"][0]
7.0773
2. 按名字索引(先行后列)
datas.loc[0]["per_price"]
7.0773
datas.loc[0,"per_price"]
7.0773
3.按数字索引
datas.iloc[0,6]
7.0773
# 通过索引值获取行名datas.index[0:4]
RangeIndex(start=0, stop=4, step=1)
datas.loc[datas.index[0:4],["district","roomnum"]]
district | roomnum | |
---|---|---|
0 | baoan | 3 |
1 | baoan | 4 |
2 | baoan | 1 |
3 | baoan | 1 |
# datas.columns.get_indexer() 通过列名获取索引值datas.columns.get_indexer(["district","roomnum"])
array([0, 1], dtype=int64)
datas.iloc[0:4,datas.columns.get_indexer(["district","roomnum"])]
district | roomnum | |
---|---|---|
0 | baoan | 3 |
1 | baoan | 4 |
2 | baoan | 1 |
3 | baoan | 1 |
4.2.2 赋值操作
# 直接修改原来的值datas["hall"] = 5
datas.head()
district | roomnum | hall | AREA | C_floor | floor_num | per_price | |
---|---|---|---|---|---|---|---|
0 | baoan | 3 | 5 | 89.3 | middle | 31 | 7.0773 |
1 | baoan | 4 | 5 | 127.0 | high | 31 | 6.9291 |
2 | baoan | 1 | 5 | 28.0 | low | 39 | 3.9286 |
3 | baoan | 1 | 5 | 28.0 | middle | 30 | 3.3568 |
4 | baoan | 2 | 5 | 78.0 | middle | 8 | 5.0769 |
# 或者datas.hall = 1
datas.head()
district | roomnum | hall | AREA | C_floor | floor_num | per_price | |
---|---|---|---|---|---|---|---|
0 | baoan | 3 | 1 | 89.3 | middle | 31 | 7.0773 |
1 | baoan | 4 | 1 | 127.0 | high | 31 | 6.9291 |
2 | baoan | 1 | 1 | 28.0 | low | 39 | 3.9286 |
3 | baoan | 1 | 1 | 28.0 | middle | 30 | 3.3568 |
4 | baoan | 2 | 1 | 78.0 | middle | 8 | 5.0769 |
datas.iloc[0,0] = "zzzz"
datas.head()
district | roomnum | hall | AREA | C_floor | floor_num | per_price | |
---|---|---|---|---|---|---|---|
0 | zzzz | 3 | 1 | 89.3 | middle | 31 | 7.0773 |
1 | baoan | 4 | 1 | 127.0 | high | 31 | 6.9291 |
2 | baoan | 1 | 1 | 28.0 | low | 39 | 3.9286 |
3 | baoan | 1 | 1 | 28.0 | middle | 30 | 3.3568 |
4 | baoan | 2 | 1 | 78.0 | middle | 8 | 5.0769 |
4.2.3 排序
# 对内容进行排序, ascending=False降序排列 ,默认为True升序排列datas.sort_values(by="per_price",ascending=False)
district | roomnum | hall | AREA | C_floor | floor_num | per_price | |
---|---|---|---|---|---|---|---|
917 | baoan | 4 | 1 | 93.59 | high | 28 | 21.9040 |
356 | baoan | 8 | 1 | 248.99 | low | 7 | 21.2860 |
576 | baoan | 1 | 1 | 21.95 | middle | 22 | 19.3622 |
296 | baoan | 4 | 1 | 93.59 | high | 28 | 19.2328 |
186 | baoan | 3 | 1 | 113.60 | middle | 31 | 16.5493 |
... | ... | ... | ... | ... | ... | ... | ... |
911 | baoan | 2 | 1 | 89.00 | middle | 16 | 1.6854 |
841 | baoan | 2 | 1 | 75.00 | high | 7 | 1.6667 |
1188 | baoan | 3 | 1 | 110.00 | middle | 33 | 1.5909 |
684 | baoan | 3 | 1 | 89.00 | middle | 26 | 1.2247 |
1047 | baoan | 3 | 1 | 98.90 | middle | 26 | 1.1931 |
1251 rows × 7 columns
datas.sort_values(by="per_price")
district | roomnum | hall | AREA | C_floor | floor_num | per_price | |
---|---|---|---|---|---|---|---|
1047 | baoan | 3 | 1 | 98.90 | middle | 26 | 1.1931 |
684 | baoan | 3 | 1 | 89.00 | middle | 26 | 1.2247 |
1188 | baoan | 3 | 1 | 110.00 | middle | 33 | 1.5909 |
841 | baoan | 2 | 1 | 75.00 | high | 7 | 1.6667 |
911 | baoan | 2 | 1 | 89.00 | middle | 16 | 1.6854 |
... | ... | ... | ... | ... | ... | ... | ... |
186 | baoan | 3 | 1 | 113.60 | middle | 31 | 16.5493 |
296 | baoan | 4 | 1 | 93.59 | high | 28 | 19.2328 |
576 | baoan | 1 | 1 | 21.95 | middle | 22 | 19.3622 |
356 | baoan | 8 | 1 | 248.99 | low | 7 | 21.2860 |
917 | baoan | 4 | 1 | 93.59 | high | 28 | 21.9040 |
1251 rows × 7 columns
# 按照多个字段进行排序# 先按照“district”字段的内容进行排序,如果值相同,再按照“per_price”字段的内容进行排序datas.sort_values(by=["district","per_price"])
district | roomnum | hall | AREA | C_floor | floor_num | per_price | |
---|---|---|---|---|---|---|---|
1047 | baoan | 3 | 1 | 98.90 | middle | 26 | 1.1931 |
684 | baoan | 3 | 1 | 89.00 | middle | 26 | 1.2247 |
1188 | baoan | 3 | 1 | 110.00 | middle | 33 | 1.5909 |
841 | baoan | 2 | 1 | 75.00 | high | 7 | 1.6667 |
911 | baoan | 2 | 1 | 89.00 | middle | 16 | 1.6854 |
... | ... | ... | ... | ... | ... | ... | ... |
296 | baoan | 4 | 1 | 93.59 | high | 28 | 19.2328 |
576 | baoan | 1 | 1 | 21.95 | middle | 22 | 19.3622 |
356 | baoan | 8 | 1 | 248.99 | low | 7 | 21.2860 |
917 | baoan | 4 | 1 | 93.59 | high | 28 | 21.9040 |
0 | zzzz | 3 | 1 | 89.30 | middle | 31 | 7.0773 |
1251 rows × 7 columns
# 按照行索引大小进行排序,默认从小到大排序datas.sort_index()
district | roomnum | hall | AREA | C_floor | floor_num | per_price | |
---|---|---|---|---|---|---|---|
0 | zzzz | 3 | 1 | 89.3 | middle | 31 | 7.0773 |
1 | baoan | 4 | 1 | 127.0 | high | 31 | 6.9291 |
2 | baoan | 1 | 1 | 28.0 | low | 39 | 3.9286 |
3 | baoan | 1 | 1 | 28.0 | middle | 30 | 3.3568 |
4 | baoan | 2 | 1 | 78.0 | middle | 8 | 5.0769 |
... | ... | ... | ... | ... | ... | ... | ... |
1246 | baoan | 4 | 1 | 89.3 | low | 8 | 4.2553 |
1247 | baoan | 2 | 1 | 67.0 | middle | 30 | 3.8060 |
1248 | baoan | 2 | 1 | 67.4 | middle | 29 | 5.3412 |
1249 | baoan | 2 | 1 | 73.1 | low | 15 | 5.9508 |
1250 | baoan | 3 | 1 | 86.2 | middle | 32 | 4.5244 |
1251 rows × 7 columns
sr = datas["per_price"]
sr
0 7.07731 6.92912 3.92863 3.35684 5.0769 ... 1246 4.25531247 3.80601248 5.34121249 5.95081250 4.5244Name: per_price, Length: 1251, dtype: float64
# 对Series类型的数据的内容进行排序sr.sort_values()
1047 1.1931684 1.22471188 1.5909841 1.6667911 1.6854 ... 186 16.5493296 19.2328576 19.3622356 21.2860917 21.9040Name: per_price, Length: 1251, dtype: float64
# 对Series类型的数据的索引进行排序sr.sort_index()
0 7.07731 6.92912 3.92863 3.35684 5.0769 ... 1246 4.25531247 3.80601248 5.34121249 5.95081250 4.5244Name: per_price, Length: 1251, dtype: float64
4.3 DataFrame运算
- 算术运算
# 对Series类型进行操作datas["roomnum"] + 3
0 61 72 43 44 5 ..1246 71247 51248 51249 51250 6Name: roomnum, Length: 1251, dtype: int64
datas["roomnum"].add(3).head()
0 61 72 43 44 5Name: roomnum, dtype: int64
datas.iloc[:,1:4]
roomnum | hall | AREA | |
---|---|---|---|
0 | 3 | 1 | 89.3 |
1 | 4 | 1 | 127.0 |
2 | 1 | 1 | 28.0 |
3 | 1 | 1 | 28.0 |
4 | 2 | 1 | 78.0 |
... | ... | ... | ... |
1246 | 4 | 1 | 89.3 |
1247 | 2 | 1 | 67.0 |
1248 | 2 | 1 | 67.4 |
1249 | 2 | 1 | 73.1 |
1250 | 3 | 1 | 86.2 |
1251 rows × 3 columns
# 对DataFrame类型进行操作datas.iloc[:,1:4] + 10
roomnum | hall | AREA | |
---|---|---|---|
0 | 13 | 11 | 99.3 |
1 | 14 | 11 | 137.0 |
2 | 11 | 11 | 38.0 |
3 | 11 | 11 | 38.0 |
4 | 12 | 11 | 88.0 |
... | ... | ... | ... |
1246 | 14 | 11 | 99.3 |
1247 | 12 | 11 | 77.0 |
1248 | 12 | 11 | 77.4 |
1249 | 12 | 11 | 83.1 |
1250 | 13 | 11 | 96.2 |
1251 rows × 3 columns
- 逻辑运算
# 逻辑判断的结果可以作为筛选的依据datas['AREA'] > 100
0 False1 True2 False3 False4 False ... 1246 False1247 False1248 False1249 False1250 FalseName: AREA, Length: 1251, dtype: bool
# 可以进行布尔索引datas[datas['AREA'] > 100]
district | roomnum | hall | AREA | C_floor | floor_num | per_price | |
---|---|---|---|---|---|---|---|
1 | baoan | 4 | 1 | 127.00 | high | 31 | 6.9291 |
5 | baoan | 4 | 1 | 125.17 | middle | 15 | 5.8161 |
16 | baoan | 3 | 1 | 151.00 | high | 20 | 4.9669 |
25 | baoan | 3 | 1 | 116.00 | high | 18 | 5.0000 |
26 | baoan | 5 | 1 | 151.25 | high | 30 | 7.6033 |
... | ... | ... | ... | ... | ... | ... | ... |
1232 | baoan | 5 | 1 | 127.17 | low | 24 | 5.1113 |
1238 | baoan | 4 | 1 | 130.74 | low | 30 | 13.0029 |
1239 | baoan | 3 | 1 | 102.10 | middle | 28 | 10.8717 |
1241 | baoan | 5 | 1 | 151.30 | high | 29 | 7.2703 |
1243 | baoan | 4 | 1 | 142.25 | high | 32 | 6.3269 |
322 rows × 7 columns
# 多个逻辑判断# 筛选面积大于100 并且 放假小于40000的数据(datas["AREA"]>100) & (datas["per_price"]< 40000)
0 False1 True2 False3 False4 False ... 1246 False1247 False1248 False1249 False1250 FalseLength: 1251, dtype: bool
# 布尔索引datas[(datas["AREA"]>100) & (datas["per_price"]< 40000)]
district | roomnum | hall | AREA | C_floor | floor_num | per_price | |
---|---|---|---|---|---|---|---|
1 | baoan | 4 | 1 | 127.00 | high | 31 | 6.9291 |
5 | baoan | 4 | 1 | 125.17 | middle | 15 | 5.8161 |
16 | baoan | 3 | 1 | 151.00 | high | 20 | 4.9669 |
25 | baoan | 3 | 1 | 116.00 | high | 18 | 5.0000 |
26 | baoan | 5 | 1 | 151.25 | high | 30 | 7.6033 |
... | ... | ... | ... | ... | ... | ... | ... |
1232 | baoan | 5 | 1 | 127.17 | low | 24 | 5.1113 |
1238 | baoan | 4 | 1 | 130.74 | low | 30 | 13.0029 |
1239 | baoan | 3 | 1 | 102.10 | middle | 28 | 10.8717 |
1241 | baoan | 5 | 1 | 151.30 | high | 29 | 7.2703 |
1243 | baoan | 4 | 1 | 142.25 | high | 32 | 6.3269 |
322 rows × 7 columns
逻辑运算函数
# 条件查询函数datas.query("AREA>100 & per_price<40000")
district | roomnum | hall | AREA | C_floor | floor_num | per_price | |
---|---|---|---|---|---|---|---|
1 | baoan | 4 | 1 | 127.00 | high | 31 | 6.9291 |
5 | baoan | 4 | 1 | 125.17 | middle | 15 | 5.8161 |
16 | baoan | 3 | 1 | 151.00 | high | 20 | 4.9669 |
25 | baoan | 3 | 1 | 116.00 | high | 18 | 5.0000 |
26 | baoan | 5 | 1 | 151.25 | high | 30 | 7.6033 |
... | ... | ... | ... | ... | ... | ... | ... |
1232 | baoan | 5 | 1 | 127.17 | low | 24 | 5.1113 |
1238 | baoan | 4 | 1 | 130.74 | low | 30 | 13.0029 |
1239 | baoan | 3 | 1 | 102.10 | middle | 28 | 10.8717 |
1241 | baoan | 5 | 1 | 151.30 | high | 29 | 7.2703 |
1243 | baoan | 4 | 1 | 142.25 | high | 32 | 6.3269 |
322 rows × 7 columns
datas["roomnum"].isin([4,5])
0 False1 True2 False3 False4 False ... 1246 True1247 False1248 False1249 False1250 FalseName: roomnum, Length: 1251, dtype: bool
# 可以指定值进行判断,从而进行筛选操作# 筛选出房间数量为4或者5的数据datas[datas["roomnum"].isin([4,5])]
district | roomnum | hall | AREA | C_floor | floor_num | per_price | |
---|---|---|---|---|---|---|---|
1 | baoan | 4 | 1 | 127.00 | high | 31 | 6.9291 |
5 | baoan | 4 | 1 | 125.17 | middle | 15 | 5.8161 |
26 | baoan | 5 | 1 | 151.25 | high | 30 | 7.6033 |
29 | baoan | 4 | 1 | 143.45 | middle | 25 | 6.9711 |
36 | baoan | 4 | 1 | 134.60 | middle | 32 | 9.1828 |
... | ... | ... | ... | ... | ... | ... | ... |
1232 | baoan | 5 | 1 | 127.17 | low | 24 | 5.1113 |
1238 | baoan | 4 | 1 | 130.74 | low | 30 | 13.0029 |
1241 | baoan | 5 | 1 | 151.30 | high | 29 | 7.2703 |
1243 | baoan | 4 | 1 | 142.25 | high | 32 | 6.3269 |
1246 | baoan | 4 | 1 | 89.30 | low | 8 | 4.2553 |
224 rows × 7 columns
- 统计运算
# 计算每一列的总数,均值,标准差,最小值,分位数,最大值等datas.describe()
roomnum | hall | AREA | floor_num | per_price | |
---|---|---|---|---|---|
count | 1251.000000 | 1251.0 | 1251.000000 | 1251.000000 | 1251.000000 |
mean | 2.906475 | 1.0 | 92.409976 | 24.598721 | 6.643429 |
std | 0.940663 | 0.0 | 37.798122 | 9.332119 | 2.435132 |
min | 1.000000 | 1.0 | 21.950000 | 1.000000 | 1.193100 |
25% | 2.000000 | 1.0 | 75.000000 | 17.000000 | 5.075850 |
50% | 3.000000 | 1.0 | 87.800000 | 28.000000 | 5.906800 |
75% | 3.000000 | 1.0 | 101.375000 | 31.000000 | 7.761950 |
max | 8.000000 | 1.0 | 352.900000 | 53.000000 | 21.904000 |
统计函数
# axis=0 求每一列的最大值 axis=1求每一行的最大值datas.max(axis=0)
district zzzzroomnum 8hall 1AREA 352.9C_floor middlefloor_num 53per_price 21.904dtype: object
# 方差datas.var(axis=0)
roomnum 0.884846hall 0.000000AREA 1428.698032floor_num 87.088446per_price 5.929870dtype: float64
# 标准差datas.std(axis=0)
roomnum 0.940663hall 0.000000AREA 37.798122floor_num 9.332119per_price 2.435132dtype: float64
datas.iloc[:,3]
0 89.31 127.02 28.03 28.04 78.0 ... 1246 89.31247 67.01248 67.41249 73.11250 86.2Name: AREA, Length: 1251, dtype: float64
# 求最大值所在的下标(索引)datas.iloc[:,3].idxmax(axis=0)
759
datas.iloc[759,3]
352.9
# 求最小值所在的下标(索引)datas.iloc[:,3].idxmin(axis=0)
576
datas.iloc[576,3]
21.95
累计统计函数
datas["per_price"]
0 7.07731 6.92912 3.92863 3.35684 5.0769 ... 1246 4.25531247 3.80601248 5.34121249 5.95081250 4.5244Name: per_price, Length: 1251, dtype: float64
# 累加datas["per_price"].cumsum()
0 7.07731 14.00642 17.93503 21.29184 26.3687 ... 1246 8291.30761247 8295.11361248 8300.45481249 8306.40561250 8310.9300Name: per_price, Length: 1251, dtype: float64
datas["per_price"].sort_index().cumsum().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x2039a3a3dc8>
import matplotlib.pyplot as pltdatas["per_price"].sort_index().cumsum().plot()plt.show()
- 自定义运算
# 自定义一个计算最大值-最小值的函数datas[["per_price"]].apply(lambda x : x.max()-x.min(),axis=0)
per_price 20.7109dtype: float64
4.4 Pandas画图
# 查看面积和房价之间的关系datas.plot(x="AREA",y="per_price",kind="scatter")
<matplotlib.axes._subplots.AxesSubplot at 0x203a343dec8>
# 查看楼层和房价之间的关系datas.plot(x="floor_num",y="per_price",kind="scatter")
<matplotlib.axes._subplots.AxesSubplot at 0x203a3a81bc8>
datas.plot(x="AREA",y="per_price",kind="barh")
<matplotlib.axes._subplots.AxesSubplot at 0x203a2147f08>
4.5 文件的读取与存储
1.读取csv文件 read_csv()
iris_data = pd.read_csv("./datas/iris.data.csv")
iris_data.head()
feature1 | feature2 | feature3 | feature4 | result | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
# usecols:指定读取的列名,列表形式iris_data1 = pd.read_csv("./datas/iris.data.csv",usecols=["feature1","feature2","result"])
iris_data1.head()
feature1 | feature2 | result | |
---|---|---|---|
0 | 5.1 | 3.5 | Iris-setosa |
1 | 4.9 | 3.0 | Iris-setosa |
2 | 4.7 | 3.2 | Iris-setosa |
3 | 4.6 | 3.1 | Iris-setosa |
4 | 5.0 | 3.6 | Iris-setosa |
iris_data2 = pd.read_csv("./datas/iris.data2.csv")iris_data2.head()
5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa | |
---|---|---|---|---|---|
0 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
1 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
2 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
3 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
4 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa |
# names:如果数据集本身没有列名,可以自己指定列名iris_data2 = pd.read_csv("./datas/iris.data2.csv",names=["feature1","feature2","feature3","feature4","result"])iris_data2.head()
feature1 | feature2 | feature3 | feature4 | result | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
datas.head(5)
district | roomnum | hall | AREA | C_floor | floor_num | per_price | |
---|---|---|---|---|---|---|---|
0 | zzzz | 3 | 1 | 89.3 | middle | 31 | 7.0773 |
1 | baoan | 4 | 1 | 127.0 | high | 31 | 6.9291 |
2 | baoan | 1 | 1 | 28.0 | low | 39 | 3.9286 |
3 | baoan | 1 | 1 | 28.0 | middle | 30 | 3.3568 |
4 | baoan | 2 | 1 | 78.0 | middle | 8 | 5.0769 |
# 保存per_price列的数据# 保存的时候index=False 去掉行索引# mode="a" 追加数据# header=False 不要重复追加列名datas[:-1].to_csv("./price_test",columns=['per_price'],index=False,mode="a",header=False)
# 读取,查看数据perice_test = pd.read_csv("./price_test")
perice_test
per_price | |
---|---|
0 | 7.0773 |
1 | 6.9291 |
2 | 3.9286 |
3 | 3.3568 |
4 | 5.0769 |
... | ... |
3746 | 6.1932 |
3747 | 4.2553 |
3748 | 3.806 |
3749 | 5.3412 |
3750 | 5.9508 |
3751 rows × 1 columns
发表评论
最新留言
很好
[***.229.124.182]2025年04月05日 13时34分21秒
关于作者

喝酒易醉,品茶养心,人生如梦,品茶悟道,何以解忧?唯有杜康!
-- 愿君每日到此一游!
推荐文章
Linux的软链接跟Windows快捷方式一样?
2021-05-08
使用第三方sdk,微信wechat扫码登录
2021-05-08
基于LabVIEW的入门指南
2021-05-08
“/”应用程序中的服务器错误。
2021-05-08
weblogic之cve-2015-4852
2021-05-08
Java注释
2021-05-08
水调歌头·1024
2021-05-08
C++ 函数重载
2021-05-08
Nginx的Gzip功能
2021-05-08
Azure Storage 系列(四)在.Net 上使用Table Storage
2021-05-08
abstract关键字的使用
2021-05-08
.NET微信网页开发之使用微信JS-SDK调用微信扫一扫功能
2021-05-08
解决Spirng注入时名称下的红色波浪线
2021-05-08
使用mybatis-generator生成底层
2021-05-08
Android APK 重签名
2021-05-08
Mybatis【3】-- Mybatis使用工具类读取配置文件以及从属性读取DB信息
2021-05-08
Mybatis【5】-- Mybatis多种增删改查那些你会了么?
2021-05-08
Mybatis【7】-- Mybatis如何知道增删改是否成功执行?
2021-05-08