八、Pandas的基本使用-白红宇的个人博客

八、Pandas的基本使用

发布日期：2021-05-08 02:34:50 浏览次数：22 分类：原创文章

本文共 25459 字，大约阅读时间需要 84 分钟。

点击标题即可获取文章源代码和笔记

4.1.0 概要

在这里插入图片描述

Pandas    基础处理        Pandas是什么？为什么用？        核心数据结构            DataFrame            Panel            Series        基本操作        运算        画图        文件的读取与存储    高级处理4.1Pandas介绍    4.1.1 Pandas介绍 - 数据处理工具        panel + data + analysis        panel面板数据 - 计量经济学 三维数据    4.1.2 为什么使用Pandas        便捷的数据处理能力        读取文件方便        封装了Matplotlib、Numpy的画图和计算    4.1.3 DataFrame        结构：既有行索引，又有列索引的二维数组        属性：            shape            index            columns            values            T        方法：            head()            tail()        3 DataFrame索引的设置            1）修改行列索引值            2）重设索引            3）设置新索引    2 Panel        DataFrame的容器    3 Series        带索引的一维数组        属性            index            values    总结：        DataFrame是Series的容器        Panel是DataFrame的容器4.2 基本数据操作    4.2.1 索引操作        1）直接索引            先列后行        2）按名字索引            loc        3）按数字索引            iloc        4）组合索引            数字、名字    4.2.3 排序        对内容排序            dataframe            series        对索引排序            dataframe            series4.3 DataFrame运算    算术运算    逻辑运算        逻辑运算符            布尔索引        逻辑运算函数            query()            isin()    统计运算        min max mean median var std        np.argmax()        np.argmin()    自定义运算        apply(func, axis=0)True            func:自定义函数4.4 Pandas画图    sr.plot()4.5 文件读取与存储    4.5.1 CSV        pd.read_csv(path)            usecols=            names=        dataframe.to_csv(path)            columns=[]            index=False            header=False    4.5.2 HDF5        hdf5 存储 3维数据的文件            key1 dataframe1二维数据            key2 dataframe2二维数据        pd.read_hdf(path, key=)        df.to_hdf(path, key=)    4.5.3 JSON        pd.read_json(path)            orient="records"            lines=True        df.to_json(patn)            orient="records"            lines=True

4.1.3 DataFrame

import numpy as np# 创建一个符合正态分布的10个股票5天的涨跌幅数据stock_change = np.random.normal(0,1,(10,5))

stock_change

array([[ 0.77072465,  1.30408183, -0.44043464,  0.8900768 , -0.80947118],       [ 0.92407994,  0.01646795, -1.26614793,  1.52393669, -0.85373051],       [-1.68378051,  0.4302981 ,  0.8069393 ,  0.60557427, -0.03960376],       [ 0.75708007, -0.39899325,  0.23027082, -0.89585658, -1.86590247],       [-0.41516245, -1.31841546,  0.16256478, -0.67449097, -1.26234013],       [-0.27687242, -0.74154521, -0.03755446,  1.24182603, -0.79444361],       [-0.2549323 , -0.41034663, -1.85076521, -1.28663451, -0.28566877],       [ 1.22453612, -1.60200055, -1.83171522, -0.85322799, -1.70950421],       [ 2.00461483,  1.49338564,  0.33928513, -0.1776084 , -0.39698965],       [ 0.2184662 , -0.03868143, -0.21432675,  0.00604093,  1.35011139]])

import pandas as pd pd.DataFrame(stock_change)

	0	1	2	3	4
0	0.770725	1.304082	-0.440435	0.890077	-0.809471
1	0.924080	0.016468	-1.266148	1.523937	-0.853731
2	-1.683781	0.430298	0.806939	0.605574	-0.039604
3	0.757080	-0.398993	0.230271	-0.895857	-1.865902
4	-0.415162	-1.318415	0.162565	-0.674491	-1.262340
5	-0.276872	-0.741545	-0.037554	1.241826	-0.794444
6	-0.254932	-0.410347	-1.850765	-1.286635	-0.285669
7	1.224536	-1.602001	-1.831715	-0.853228	-1.709504
8	2.004615	1.493386	0.339285	-0.177608	-0.396990
9	0.218466	-0.038681	-0.214327	0.006041	1.350111

# 构造行索引序列stock_code = ['股票' + str(i) for i in range(stock_change.shape[0])]stock_code

['股票0', '股票1', '股票2', '股票3', '股票4', '股票5', '股票6', '股票7', '股票8', '股票9']

# 添加行索引data = pd.DataFrame(stock_change,index=stock_code)data

	0	1	2	3	4
股票0	0.770725	1.304082	-0.440435	0.890077	-0.809471
股票1	0.924080	0.016468	-1.266148	1.523937	-0.853731
股票2	-1.683781	0.430298	0.806939	0.605574	-0.039604
股票3	0.757080	-0.398993	0.230271	-0.895857	-1.865902
股票4	-0.415162	-1.318415	0.162565	-0.674491	-1.262340
股票5	-0.276872	-0.741545	-0.037554	1.241826	-0.794444
股票6	-0.254932	-0.410347	-1.850765	-1.286635	-0.285669
股票7	1.224536	-1.602001	-1.831715	-0.853228	-1.709504
股票8	2.004615	1.493386	0.339285	-0.177608	-0.396990
股票9	0.218466	-0.038681	-0.214327	0.006041	1.350111

# 添加列索引date = pd.date_range(start="20200618",periods=5,freq="B") # start 开始时间， periods 间隔时间，freq 按照什么间隔 d w 5hdate

DatetimeIndex(['2020-06-18', '2020-06-19', '2020-06-22', '2020-06-23',               '2020-06-24'],              dtype='datetime64[ns]', freq='B')

# 添加列索引data = pd.DataFrame(stock_change,index=stock_code,columns=date) data

	2020-06-18	2020-06-19	2020-06-22	2020-06-23	2020-06-24
股票0	0.770725	1.304082	-0.440435	0.890077	-0.809471
股票1	0.924080	0.016468	-1.266148	1.523937	-0.853731
股票2	-1.683781	0.430298	0.806939	0.605574	-0.039604
股票3	0.757080	-0.398993	0.230271	-0.895857	-1.865902
股票4	-0.415162	-1.318415	0.162565	-0.674491	-1.262340
股票5	-0.276872	-0.741545	-0.037554	1.241826	-0.794444
股票6	-0.254932	-0.410347	-1.850765	-1.286635	-0.285669
股票7	1.224536	-1.602001	-1.831715	-0.853228	-1.709504
股票8	2.004615	1.493386	0.339285	-0.177608	-0.396990
股票9	0.218466	-0.038681	-0.214327	0.006041	1.350111

DataFrame属性

data.shape

(10, 5)

data.index

Index(['股票0', '股票1', '股票2', '股票3', '股票4', '股票5', '股票6', '股票7', '股票8', '股票9'], dtype='object')

data.columns

DatetimeIndex(['2020-06-18', '2020-06-19', '2020-06-22', '2020-06-23',               '2020-06-24'],              dtype='datetime64[ns]', freq='B')

data.values

array([[ 0.77072465,  1.30408183, -0.44043464,  0.8900768 , -0.80947118],       [ 0.92407994,  0.01646795, -1.26614793,  1.52393669, -0.85373051],       [-1.68378051,  0.4302981 ,  0.8069393 ,  0.60557427, -0.03960376],       [ 0.75708007, -0.39899325,  0.23027082, -0.89585658, -1.86590247],       [-0.41516245, -1.31841546,  0.16256478, -0.67449097, -1.26234013],       [-0.27687242, -0.74154521, -0.03755446,  1.24182603, -0.79444361],       [-0.2549323 , -0.41034663, -1.85076521, -1.28663451, -0.28566877],       [ 1.22453612, -1.60200055, -1.83171522, -0.85322799, -1.70950421],       [ 2.00461483,  1.49338564,  0.33928513, -0.1776084 , -0.39698965],       [ 0.2184662 , -0.03868143, -0.21432675,  0.00604093,  1.35011139]])

data.T

	股票0	股票1	股票2	股票3	股票4	股票5	股票6	股票7	股票8	股票9
2020-06-18	0.770725	0.924080	-1.683781	0.757080	-0.415162	-0.276872	-0.254932	1.224536	2.004615	0.218466
2020-06-19	1.304082	0.016468	0.430298	-0.398993	-1.318415	-0.741545	-0.410347	-1.602001	1.493386	-0.038681
2020-06-22	-0.440435	-1.266148	0.806939	0.230271	0.162565	-0.037554	-1.850765	-1.831715	0.339285	-0.214327
2020-06-23	0.890077	1.523937	0.605574	-0.895857	-0.674491	1.241826	-1.286635	-0.853228	-0.177608	0.006041
2020-06-24	-0.809471	-0.853731	-0.039604	-1.865902	-1.262340	-0.794444	-0.285669	-1.709504	-0.396990	1.350111

DataFrame方法

data.head() # 返回前5行数据

	2020-06-18	2020-06-19	2020-06-22	2020-06-23	2020-06-24
股票0	0.770725	1.304082	-0.440435	0.890077	-0.809471
股票1	0.924080	0.016468	-1.266148	1.523937	-0.853731
股票2	-1.683781	0.430298	0.806939	0.605574	-0.039604
股票3	0.757080	-0.398993	0.230271	-0.895857	-1.865902
股票4	-0.415162	-1.318415	0.162565	-0.674491	-1.262340

data.tail() # 返回后5行数据

	2020-06-18	2020-06-19	2020-06-22	2020-06-23	2020-06-24
股票5	-0.276872	-0.741545	-0.037554	1.241826	-0.794444
股票6	-0.254932	-0.410347	-1.850765	-1.286635	-0.285669
股票7	1.224536	-1.602001	-1.831715	-0.853228	-1.709504
股票8	2.004615	1.493386	0.339285	-0.177608	-0.396990
股票9	0.218466	-0.038681	-0.214327	0.006041	1.350111

3 DataFrame索引的设置

修改行列索引值

data.index[2]

'股票2'

data.index[2] = "股票88"# 注意：单独修改每一列的索引是不行的，在DataFrame中，只能对索引进行整体的修改

---------------------------------------------------------------------------TypeError                                 Traceback (most recent call last)<ipython-input-19-9e95917cc4d9> in <module>----> 1 data.index[2] = "股票88"D:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)   3908    3909     def __setitem__(self, key, value):-> 3910         raise TypeError("Index does not support mutable operations")   3911    3912     def __getitem__(self, key):TypeError: Index does not support mutable operations

stock_ = ["股票_{}".format(i) for i in range(10)]

data.index = stock_

data.index

Index(['股票_0', '股票_1', '股票_2', '股票_3', '股票_4', '股票_5', '股票_6', '股票_7', '股票_8',       '股票_9'],      dtype='object')

重设索引

reset_index（drop=False）
设置新的下标索引
drop：默认为False，不删除原来索引，如果为True，删除原来的索引值

# 重置索引，drop=Falsedata.reset_index()

	index	2020-06-18 00:00:00	2020-06-19 00:00:00	2020-06-22 00:00:00	2020-06-23 00:00:00	2020-06-24 00:00:00
0	股票_0	0.770725	1.304082	-0.440435	0.890077	-0.809471
1	股票_1	0.924080	0.016468	-1.266148	1.523937	-0.853731
2	股票_2	-1.683781	0.430298	0.806939	0.605574	-0.039604
3	股票_3	0.757080	-0.398993	0.230271	-0.895857	-1.865902
4	股票_4	-0.415162	-1.318415	0.162565	-0.674491	-1.262340
5	股票_5	-0.276872	-0.741545	-0.037554	1.241826	-0.794444
6	股票_6	-0.254932	-0.410347	-1.850765	-1.286635	-0.285669
7	股票_7	1.224536	-1.602001	-1.831715	-0.853228	-1.709504
8	股票_8	2.004615	1.493386	0.339285	-0.177608	-0.396990
9	股票_9	0.218466	-0.038681	-0.214327	0.006041	1.350111

# 重置索引，drop=Truedata.reset_index(drop=True)

	2020-06-18	2020-06-19	2020-06-22	2020-06-23	2020-06-24
0	0.770725	1.304082	-0.440435	0.890077	-0.809471
1	0.924080	0.016468	-1.266148	1.523937	-0.853731
2	-1.683781	0.430298	0.806939	0.605574	-0.039604
3	0.757080	-0.398993	0.230271	-0.895857	-1.865902
4	-0.415162	-1.318415	0.162565	-0.674491	-1.262340
5	-0.276872	-0.741545	-0.037554	1.241826	-0.794444
6	-0.254932	-0.410347	-1.850765	-1.286635	-0.285669
7	1.224536	-1.602001	-1.831715	-0.853228	-1.709504
8	2.004615	1.493386	0.339285	-0.177608	-0.396990
9	0.218466	-0.038681	-0.214327	0.006041	1.350111

以某列值设置为新的索引

set_index(keys,drop=True)
keys:列索引名或者列索引名称的列表
drop:boolean,default True 当作新的索引，删除原来的索引列

设置新索引案例

1.创建

df = pd.DataFrame({       'month':[1,4,7,10],    'year':[2012,2014,2013,2014],    'sale':[55,40,84,31]})df

	month	year	sale
0	1	2012	55
1	4	2014	40
2	7	2013	84
3	10	2014	31

2、以月份设置新的索引

df.set_index('month')

	year	sale
month
1	2012	55
4	2014	40
7	2013	84
10	2014	31

1. 设置多个索引，以年和月份

new_df = df.set_index(['year','month'])

new_df

		sale
year	month
2012	1	55
2014	4	40
2013	7	84
2014	10	31

new_df.index

MultiIndex([(2012,  1),            (2014,  4),            (2013,  7),            (2014, 10)],           names=['year', 'month'])

4.1.4 MultiIndex 与 Panel的关系

1 Multilndex多级或分层索引对象。

index属性

names: levels的名称

levels：每个level的元组值

new_df.index.names

FrozenList(['year', 'month'])

new_df.index.levels

FrozenList([[2012, 2013, 2014], [1, 4, 7, 10]])

2 Panel

p = pd.Panel()p# 新版本已移除该函数

D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version  """Entry point for launching an IPython kernel.<pandas.__getattr__.<locals>.Panel at 0x203fd31ea08>

data

	2020-06-18	2020-06-19	2020-06-22	2020-06-23	2020-06-24
股票_0	0.770725	1.304082	-0.440435	0.890077	-0.809471
股票_1	0.924080	0.016468	-1.266148	1.523937	-0.853731
股票_2	-1.683781	0.430298	0.806939	0.605574	-0.039604
股票_3	0.757080	-0.398993	0.230271	-0.895857	-1.865902
股票_4	-0.415162	-1.318415	0.162565	-0.674491	-1.262340
股票_5	-0.276872	-0.741545	-0.037554	1.241826	-0.794444
股票_6	-0.254932	-0.410347	-1.850765	-1.286635	-0.285669
股票_7	1.224536	-1.602001	-1.831715	-0.853228	-1.709504
股票_8	2.004615	1.493386	0.339285	-0.177608	-0.396990
股票_9	0.218466	-0.038681	-0.214327	0.006041	1.350111

Series

data.iloc[1,:] # 带索引的一维数组

2020-06-18    0.9240802020-06-19    0.0164682020-06-22   -1.2661482020-06-23    1.5239372020-06-24   -0.853731Freq: B, Name: 股票_1, dtype: float64

type(data.iloc[1,:])

pandas.core.series.Series

属性

data.iloc[1,:].index

DatetimeIndex(['2020-06-18', '2020-06-19', '2020-06-22', '2020-06-23',               '2020-06-24'],              dtype='datetime64[ns]', freq='B')

data.iloc[1,:].values

array([ 0.92407994,  0.01646795, -1.26614793,  1.52393669, -0.85373051])

1. 创建Series

通过已有数据创建

指定内容，默认索引

pd.Series(np.arange(10))

0    01    12    23    34    45    56    67    78    89    9dtype: int32

指定索引

pd.Series([6.7,5.6,3,10,2],index=[1,2,3,4,5])

1     6.72     5.63     3.04    10.05     2.0dtype: float64

通过字典数据创建

pd.Series({       'red':100,    'blue':200,    'green':500,    'yellow':1000})

red        100blue       200green      500yellow    1000dtype: int64

总结

DataFrame 是 Series的容器
Panel 是 DataFrame的容器

4.2 基本数据操作

datas = pd.read_excel("./datas/szfj_baoan.xls")

datas

	district	roomnum	hall	AREA	C_floor	floor_num	school	subway	per_price
0	baoan	3	2	89.3	middle	31	0	0	7.0773
1	baoan	4	2	127.0	high	31	0	0	6.9291
2	baoan	1	1	28.0	low	39	0	0	3.9286
3	baoan	1	1	28.0	middle	30	0	0	3.3568
4	baoan	2	2	78.0	middle	8	1	1	5.0769
...	...	...	...	...	...	...	...	...	...
1246	baoan	4	2	89.3	low	8	0	0	4.2553
1247	baoan	2	1	67.0	middle	30	0	0	3.8060
1248	baoan	2	2	67.4	middle	29	1	0	5.3412
1249	baoan	2	2	73.1	low	15	1	0	5.9508
1250	baoan	3	2	86.2	middle	32	0	1	4.5244

1251 rows × 9 columns

datas.columns

Index(['district', 'roomnum', 'hall', 'AREA', 'C_floor', 'floor_num', 'school',       'subway', 'per_price'],      dtype='object')

# 删除列datas = datas.drop(columns=[ 'school','subway',],axis=0)

datas

	district	roomnum	hall	AREA	C_floor	floor_num	per_price
0	baoan	3	2	89.3	middle	31	7.0773
1	baoan	4	2	127.0	high	31	6.9291
2	baoan	1	1	28.0	low	39	3.9286
3	baoan	1	1	28.0	middle	30	3.3568
4	baoan	2	2	78.0	middle	8	5.0769
...	...	...	...	...	...	...	...
1246	baoan	4	2	89.3	low	8	4.2553
1247	baoan	2	1	67.0	middle	30	3.8060
1248	baoan	2	2	67.4	middle	29	5.3412
1249	baoan	2	2	73.1	low	15	5.9508
1250	baoan	3	2	86.2	middle	32	4.5244

1251 rows × 7 columns

4.2.1 索引操作

1.直接使用行列索引（先列后行）

datas["per_price"][0]

7.0773

2. 按名字索引(先行后列)

datas.loc[0]["per_price"]

7.0773

datas.loc[0,"per_price"]

7.0773

3.按数字索引

datas.iloc[0,6]

7.0773

# 通过索引值获取行名datas.index[0:4]

RangeIndex(start=0, stop=4, step=1)

datas.loc[datas.index[0:4],["district","roomnum"]]

	district	roomnum
0	baoan	3
1	baoan	4
2	baoan	1
3	baoan	1

# datas.columns.get_indexer() 通过列名获取索引值datas.columns.get_indexer(["district","roomnum"])

array([0, 1], dtype=int64)

datas.iloc[0:4,datas.columns.get_indexer(["district","roomnum"])]

	district	roomnum
0	baoan	3
1	baoan	4
2	baoan	1
3	baoan	1

4.2.2 赋值操作

# 直接修改原来的值datas["hall"] = 5

datas.head()

	district	roomnum	hall	AREA	C_floor	floor_num	per_price
0	baoan	3	5	89.3	middle	31	7.0773
1	baoan	4	5	127.0	high	31	6.9291
2	baoan	1	5	28.0	low	39	3.9286
3	baoan	1	5	28.0	middle	30	3.3568
4	baoan	2	5	78.0	middle	8	5.0769

# 或者datas.hall = 1

datas.head()

	district	roomnum	hall	AREA	C_floor	floor_num	per_price
0	baoan	3	1	89.3	middle	31	7.0773
1	baoan	4	1	127.0	high	31	6.9291
2	baoan	1	1	28.0	low	39	3.9286
3	baoan	1	1	28.0	middle	30	3.3568
4	baoan	2	1	78.0	middle	8	5.0769

datas.iloc[0,0] = "zzzz"

datas.head()

	district	roomnum	hall	AREA	C_floor	floor_num	per_price
0	zzzz	3	1	89.3	middle	31	7.0773
1	baoan	4	1	127.0	high	31	6.9291
2	baoan	1	1	28.0	low	39	3.9286
3	baoan	1	1	28.0	middle	30	3.3568
4	baoan	2	1	78.0	middle	8	5.0769

4.2.3 排序

# 对内容进行排序, ascending=False降序排列 ，默认为True升序排列datas.sort_values(by="per_price",ascending=False)

	district	roomnum	hall	AREA	C_floor	floor_num	per_price
917	baoan	4	1	93.59	high	28	21.9040
356	baoan	8	1	248.99	low	7	21.2860
576	baoan	1	1	21.95	middle	22	19.3622
296	baoan	4	1	93.59	high	28	19.2328
186	baoan	3	1	113.60	middle	31	16.5493
...	...	...	...	...	...	...	...
911	baoan	2	1	89.00	middle	16	1.6854
841	baoan	2	1	75.00	high	7	1.6667
1188	baoan	3	1	110.00	middle	33	1.5909
684	baoan	3	1	89.00	middle	26	1.2247
1047	baoan	3	1	98.90	middle	26	1.1931

1251 rows × 7 columns

datas.sort_values(by="per_price")

	district	roomnum	hall	AREA	C_floor	floor_num	per_price
1047	baoan	3	1	98.90	middle	26	1.1931
684	baoan	3	1	89.00	middle	26	1.2247
1188	baoan	3	1	110.00	middle	33	1.5909
841	baoan	2	1	75.00	high	7	1.6667
911	baoan	2	1	89.00	middle	16	1.6854
...	...	...	...	...	...	...	...
186	baoan	3	1	113.60	middle	31	16.5493
296	baoan	4	1	93.59	high	28	19.2328
576	baoan	1	1	21.95	middle	22	19.3622
356	baoan	8	1	248.99	low	7	21.2860
917	baoan	4	1	93.59	high	28	21.9040

1251 rows × 7 columns

# 按照多个字段进行排序# 先按照“district”字段的内容进行排序，如果值相同，再按照“per_price”字段的内容进行排序datas.sort_values(by=["district","per_price"])

	district	roomnum	hall	AREA	C_floor	floor_num	per_price
1047	baoan	3	1	98.90	middle	26	1.1931
684	baoan	3	1	89.00	middle	26	1.2247
1188	baoan	3	1	110.00	middle	33	1.5909
841	baoan	2	1	75.00	high	7	1.6667
911	baoan	2	1	89.00	middle	16	1.6854
...	...	...	...	...	...	...	...
296	baoan	4	1	93.59	high	28	19.2328
576	baoan	1	1	21.95	middle	22	19.3622
356	baoan	8	1	248.99	low	7	21.2860
917	baoan	4	1	93.59	high	28	21.9040
0	zzzz	3	1	89.30	middle	31	7.0773

1251 rows × 7 columns

# 按照行索引大小进行排序,默认从小到大排序datas.sort_index()

	district	roomnum	hall	AREA	C_floor	floor_num	per_price
0	zzzz	3	1	89.3	middle	31	7.0773
1	baoan	4	1	127.0	high	31	6.9291
2	baoan	1	1	28.0	low	39	3.9286
3	baoan	1	1	28.0	middle	30	3.3568
4	baoan	2	1	78.0	middle	8	5.0769
...	...	...	...	...	...	...	...
1246	baoan	4	1	89.3	low	8	4.2553
1247	baoan	2	1	67.0	middle	30	3.8060
1248	baoan	2	1	67.4	middle	29	5.3412
1249	baoan	2	1	73.1	low	15	5.9508
1250	baoan	3	1	86.2	middle	32	4.5244

1251 rows × 7 columns

sr = datas["per_price"]

sr

0       7.07731       6.92912       3.92863       3.35684       5.0769         ...  1246    4.25531247    3.80601248    5.34121249    5.95081250    4.5244Name: per_price, Length: 1251, dtype: float64

# 对Series类型的数据的内容进行排序sr.sort_values()

1047     1.1931684      1.22471188     1.5909841      1.6667911      1.6854         ...   186     16.5493296     19.2328576     19.3622356     21.2860917     21.9040Name: per_price, Length: 1251, dtype: float64

# 对Series类型的数据的索引进行排序sr.sort_index()

0       7.07731       6.92912       3.92863       3.35684       5.0769         ...  1246    4.25531247    3.80601248    5.34121249    5.95081250    4.5244Name: per_price, Length: 1251, dtype: float64

4.3 DataFrame运算

算术运算

# 对Series类型进行操作datas["roomnum"] + 3

0       61       72       43       44       5       ..1246    71247    51248    51249    51250    6Name: roomnum, Length: 1251, dtype: int64

datas["roomnum"].add(3).head()

0    61    72    43    44    5Name: roomnum, dtype: int64

datas.iloc[:,1:4]

	roomnum	hall	AREA
0	3	1	89.3
1	4	1	127.0
2	1	1	28.0
3	1	1	28.0
4	2	1	78.0
...	...	...	...
1246	4	1	89.3
1247	2	1	67.0
1248	2	1	67.4
1249	2	1	73.1
1250	3	1	86.2

1251 rows × 3 columns

# 对DataFrame类型进行操作datas.iloc[:,1:4] + 10

	roomnum	hall	AREA
0	13	11	99.3
1	14	11	137.0
2	11	11	38.0
3	11	11	38.0
4	12	11	88.0
...	...	...	...
1246	14	11	99.3
1247	12	11	77.0
1248	12	11	77.4
1249	12	11	83.1
1250	13	11	96.2

1251 rows × 3 columns

逻辑运算

# 逻辑判断的结果可以作为筛选的依据datas['AREA'] > 100

0       False1        True2       False3       False4       False        ...  1246    False1247    False1248    False1249    False1250    FalseName: AREA, Length: 1251, dtype: bool

# 可以进行布尔索引datas[datas['AREA'] > 100]

	district	roomnum	hall	AREA	C_floor	floor_num	per_price
1	baoan	4	1	127.00	high	31	6.9291
5	baoan	4	1	125.17	middle	15	5.8161
16	baoan	3	1	151.00	high	20	4.9669
25	baoan	3	1	116.00	high	18	5.0000
26	baoan	5	1	151.25	high	30	7.6033
...	...	...	...	...	...	...	...
1232	baoan	5	1	127.17	low	24	5.1113
1238	baoan	4	1	130.74	low	30	13.0029
1239	baoan	3	1	102.10	middle	28	10.8717
1241	baoan	5	1	151.30	high	29	7.2703
1243	baoan	4	1	142.25	high	32	6.3269

322 rows × 7 columns

# 多个逻辑判断# 筛选面积大于100 并且 放假小于40000的数据(datas["AREA"]>100) & (datas["per_price"]< 40000)

0       False1        True2       False3       False4       False        ...  1246    False1247    False1248    False1249    False1250    FalseLength: 1251, dtype: bool

# 布尔索引datas[(datas["AREA"]>100) & (datas["per_price"]< 40000)]

	district	roomnum	hall	AREA	C_floor	floor_num	per_price
1	baoan	4	1	127.00	high	31	6.9291
5	baoan	4	1	125.17	middle	15	5.8161
16	baoan	3	1	151.00	high	20	4.9669
25	baoan	3	1	116.00	high	18	5.0000
26	baoan	5	1	151.25	high	30	7.6033
...	...	...	...	...	...	...	...
1232	baoan	5	1	127.17	low	24	5.1113
1238	baoan	4	1	130.74	low	30	13.0029
1239	baoan	3	1	102.10	middle	28	10.8717
1241	baoan	5	1	151.30	high	29	7.2703
1243	baoan	4	1	142.25	high	32	6.3269

322 rows × 7 columns

逻辑运算函数

# 条件查询函数datas.query("AREA>100 & per_price<40000")

	district	roomnum	hall	AREA	C_floor	floor_num	per_price
1	baoan	4	1	127.00	high	31	6.9291
5	baoan	4	1	125.17	middle	15	5.8161
16	baoan	3	1	151.00	high	20	4.9669
25	baoan	3	1	116.00	high	18	5.0000
26	baoan	5	1	151.25	high	30	7.6033
...	...	...	...	...	...	...	...
1232	baoan	5	1	127.17	low	24	5.1113
1238	baoan	4	1	130.74	low	30	13.0029
1239	baoan	3	1	102.10	middle	28	10.8717
1241	baoan	5	1	151.30	high	29	7.2703
1243	baoan	4	1	142.25	high	32	6.3269

322 rows × 7 columns

datas["roomnum"].isin([4,5])

0       False1        True2       False3       False4       False        ...  1246     True1247    False1248    False1249    False1250    FalseName: roomnum, Length: 1251, dtype: bool

# 可以指定值进行判断，从而进行筛选操作# 筛选出房间数量为4或者5的数据datas[datas["roomnum"].isin([4,5])]

	district	roomnum	hall	AREA	C_floor	floor_num	per_price
1	baoan	4	1	127.00	high	31	6.9291
5	baoan	4	1	125.17	middle	15	5.8161
26	baoan	5	1	151.25	high	30	7.6033
29	baoan	4	1	143.45	middle	25	6.9711
36	baoan	4	1	134.60	middle	32	9.1828
...	...	...	...	...	...	...	...
1232	baoan	5	1	127.17	low	24	5.1113
1238	baoan	4	1	130.74	low	30	13.0029
1241	baoan	5	1	151.30	high	29	7.2703
1243	baoan	4	1	142.25	high	32	6.3269
1246	baoan	4	1	89.30	low	8	4.2553

224 rows × 7 columns

统计运算

# 计算每一列的总数，均值，标准差，最小值，分位数，最大值等datas.describe()

	roomnum	hall	AREA	floor_num	per_price
count	1251.000000	1251.0	1251.000000	1251.000000	1251.000000
mean	2.906475	1.0	92.409976	24.598721	6.643429
std	0.940663	0.0	37.798122	9.332119	2.435132
min	1.000000	1.0	21.950000	1.000000	1.193100
25%	2.000000	1.0	75.000000	17.000000	5.075850
50%	3.000000	1.0	87.800000	28.000000	5.906800
75%	3.000000	1.0	101.375000	31.000000	7.761950
max	8.000000	1.0	352.900000	53.000000	21.904000

统计函数

# axis=0 求每一列的最大值  axis=1求每一行的最大值datas.max(axis=0)

district       zzzzroomnum           8hall              1AREA          352.9C_floor      middlefloor_num        53per_price    21.904dtype: object

# 方差datas.var(axis=0)

roomnum         0.884846hall            0.000000AREA         1428.698032floor_num      87.088446per_price       5.929870dtype: float64

# 标准差datas.std(axis=0)

roomnum       0.940663hall          0.000000AREA         37.798122floor_num     9.332119per_price     2.435132dtype: float64

datas.iloc[:,3]

0        89.31       127.02        28.03        28.04        78.0        ...  1246     89.31247     67.01248     67.41249     73.11250     86.2Name: AREA, Length: 1251, dtype: float64

# 求最大值所在的下标（索引）datas.iloc[:,3].idxmax(axis=0)

datas.iloc[759,3]

352.9

# 求最小值所在的下标（索引）datas.iloc[:,3].idxmin(axis=0)

datas.iloc[576,3]

21.95

累计统计函数

datas["per_price"]

0       7.07731       6.92912       3.92863       3.35684       5.0769         ...  1246    4.25531247    3.80601248    5.34121249    5.95081250    4.5244Name: per_price, Length: 1251, dtype: float64

# 累加datas["per_price"].cumsum()

0          7.07731         14.00642         17.93503         21.29184         26.3687          ...    1246    8291.30761247    8295.11361248    8300.45481249    8306.40561250    8310.9300Name: per_price, Length: 1251, dtype: float64

datas["per_price"].sort_index().cumsum().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x2039a3a3dc8>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-76oPmxXG-1592894035229)(output_124_1.png)]

import matplotlib.pyplot as pltdatas["per_price"].sort_index().cumsum().plot()plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-iDjOExYI-1592894035232)(output_125_0.png)]

自定义运算

# 自定义一个计算最大值-最小值的函数datas[["per_price"]].apply(lambda x : x.max()-x.min(),axis=0)

per_price    20.7109dtype: float64

4.4 Pandas画图

# 查看面积和房价之间的关系datas.plot(x="AREA",y="per_price",kind="scatter")

<matplotlib.axes._subplots.AxesSubplot at 0x203a343dec8>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-G1sOFopU-1592894035233)(output_129_1.png)]

# 查看楼层和房价之间的关系datas.plot(x="floor_num",y="per_price",kind="scatter")

<matplotlib.axes._subplots.AxesSubplot at 0x203a3a81bc8>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RXGfuikt-1592894035235)(output_130_1.png)]

datas.plot(x="AREA",y="per_price",kind="barh")

<matplotlib.axes._subplots.AxesSubplot at 0x203a2147f08>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-nDleMVfm-1592894035236)(output_131_1.png)]

4.5 文件的读取与存储

1.读取csv文件 read_csv()

iris_data = pd.read_csv("./datas/iris.data.csv")

iris_data.head()

	feature1	feature2	feature3	feature4	result
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa

# usecols：指定读取的列名，列表形式iris_data1 = pd.read_csv("./datas/iris.data.csv",usecols=["feature1","feature2","result"])

iris_data1.head()

	feature1	feature2	result
0	5.1	3.5	Iris-setosa
1	4.9	3.0	Iris-setosa
2	4.7	3.2	Iris-setosa
3	4.6	3.1	Iris-setosa
4	5.0	3.6	Iris-setosa

iris_data2 = pd.read_csv("./datas/iris.data2.csv")iris_data2.head()

	5.1	3.5	1.4	0.2	Iris-setosa
0	4.9	3.0	1.4	0.2	Iris-setosa
1	4.7	3.2	1.3	0.2	Iris-setosa
2	4.6	3.1	1.5	0.2	Iris-setosa
3	5.0	3.6	1.4	0.2	Iris-setosa
4	5.4	3.9	1.7	0.4	Iris-setosa

# names:如果数据集本身没有列名，可以自己指定列名iris_data2 = pd.read_csv("./datas/iris.data2.csv",names=["feature1","feature2","feature3","feature4","result"])iris_data2.head()

	feature1	feature2	feature3	feature4	result
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa

datas.head(5)

	district	roomnum	hall	AREA	C_floor	floor_num	per_price
0	zzzz	3	1	89.3	middle	31	7.0773
1	baoan	4	1	127.0	high	31	6.9291
2	baoan	1	1	28.0	low	39	3.9286
3	baoan	1	1	28.0	middle	30	3.3568
4	baoan	2	1	78.0	middle	8	5.0769

# 保存per_price列的数据# 保存的时候index=False 去掉行索引# mode="a" 追加数据# header=False 不要重复追加列名datas[:-1].to_csv("./price_test",columns=['per_price'],index=False,mode="a",header=False)

# 读取，查看数据perice_test = pd.read_csv("./price_test")

perice_test

	per_price
0	7.0773
1	6.9291
2	3.9286
3	3.3568
4	5.0769
...	...
3746	6.1932
3747	4.2553
3748	3.806
3749	5.3412
3750	5.9508

3751 rows × 1 columns

上一篇：第十七次CCF计算机软件能力认证推荐系统 (优先队列,哈希表)

下一篇：LeetCode每日一题781. 森林中的兔子

发表评论

关于作者

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！

-- 愿君每日到此一游！

4.1.0 概要

4.1.3 DataFrame

DataFrame属性

DataFrame方法

3 DataFrame索引的设置

重设索引

以某列值设置为新的索引

设置新索引案例

4.1.4 MultiIndex 与 Panel的关系

1 Multilndex多级或分层索引对象。

2 Panel

Series

属性

1. 创建Series

总结

4.2 基本数据操作

4.2.1 索引操作

1.直接使用行列索引（先列后行）

2. 按名字索引(先行后列)

3.按数字索引

4.2.2 赋值操作

4.2.3 排序

4.3 DataFrame运算

4.4 Pandas画图

4.5 文件的读取与存储

发表评论

最新留言

关于作者

推荐文章