09 Python之Pandas库之数据处理与规整
发布日期:2021-06-29 15:44:24
浏览次数:3
分类:技术文章
本文共 9235 字,大约阅读时间需要 30 分钟。
Pandas库之数据处理与规整
import numpy as npimport pandas as pdimport pandas_datareader.data as webimport datetime# 爬取数据df = web.DataReader("AAPL","yahoo",datetime.datetime(2019,4,1),datetime.date.today())# 展示数据df
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800 | 195.350006 |
2019-04-04 | 196.369995 | 193.139999 | 194.789993 | 195.690002 | 19114300 | 195.690002 |
2019-04-05 | 197.100006 | 195.929993 | 196.449997 | 197.000000 | 18526600 | 197.000000 |
2019-04-08 | 200.229996 | 196.339996 | 196.419998 | 200.100006 | 25881700 | 200.100006 |
2019-04-09 | 202.850006 | 199.229996 | 200.320007 | 199.500000 | 35768200 | 199.500000 |
2019-04-10 | 200.740005 | 198.179993 | 198.679993 | 200.619995 | 21695300 | 200.619995 |
2019-04-11 | 201.000000 | 198.440002 | 200.850006 | 198.949997 | 20900800 | 198.949997 |
2019-04-12 | 200.139999 | 196.210007 | 199.199997 | 198.869995 | 27744300 | 198.869995 |
1 缺失数据处理
1.1 去掉包含缺失值的行
df_drop = df.dropna()df_drop.head()
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800 | 195.350006 |
2019-04-04 | 196.369995 | 193.139999 | 194.789993 | 195.690002 | 19114300 | 195.690002 |
2019-04-05 | 197.100006 | 195.929993 | 196.449997 | 197.000000 | 18526600 | 197.000000 |
1.2 对缺失值进行补充
df_fillna = df.fillna(value=0)df_fillna.head()
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800 | 195.350006 |
2019-04-04 | 196.369995 | 193.139999 | 194.789993 | 195.690002 | 19114300 | 195.690002 |
2019-04-05 | 197.100006 | 195.929993 | 196.449997 | 197.000000 | 18526600 | 197.000000 |
1.3 判断数据是否为nan,并进行布尔填充
df_isnull = pd.isnull(df)df_isnull.head()
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | False | False | False | False | False | False |
2019-04-02 | False | False | False | False | False | False |
2019-04-03 | False | False | False | False | False | False |
2019-04-04 | False | False | False | False | False | False |
2019-04-05 | False | False | False | False | False | False |
2 函数的应用和映射
# 列计算平均值df.mean()
High 1.981070e+02Low 1.950050e+02Open 1.962690e+02Close 1.971340e+02Volume 2.435307e+07Adj Close 1.971340e+02dtype: float64
# 行计算平均值df.mean(1)
Date2019-04-01 4.643826e+062019-04-02 3.794444e+062019-04-03 3.878796e+062019-04-04 3.185879e+062019-04-05 3.087931e+062019-04-08 4.313782e+062019-04-09 5.961534e+062019-04-10 3.616050e+062019-04-11 3.483633e+062019-04-12 4.624216e+06dtype: float64
#skipna参数默认是True 表示排除缺失值df.mean(axis = 1,skipna = False)
Date2019-04-01 4.643826e+062019-04-02 3.794444e+062019-04-03 3.878796e+062019-04-04 3.185879e+062019-04-05 3.087931e+062019-04-08 4.313782e+062019-04-09 5.961534e+062019-04-10 3.616050e+062019-04-11 3.483633e+062019-04-12 4.624216e+06dtype: float64
# 行名字排序sorted_row_df = df.sort_index()sorted_row_df.head()
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800 | 195.350006 |
2019-04-04 | 196.369995 | 193.139999 | 194.789993 | 195.690002 | 19114300 | 195.690002 |
2019-04-05 | 197.100006 | 195.929993 | 196.449997 | 197.000000 | 18526600 | 197.000000 |
# 列名字排序sorted_col_df = df.sort_index(axis=1)sorted_col_df.head()
Adj Close | Close | High | Low | Open | Volume | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.240005 | 191.240005 | 191.679993 | 188.380005 | 191.639999 | 27862000 |
2019-04-02 | 194.020004 | 194.020004 | 194.460007 | 191.050003 | 191.089996 | 22765700 |
2019-04-03 | 195.350006 | 195.350006 | 196.500000 | 193.149994 | 193.250000 | 23271800 |
2019-04-04 | 195.690002 | 195.690002 | 196.369995 | 193.139999 | 194.789993 | 19114300 |
2019-04-05 | 197.000000 | 197.000000 | 197.100006 | 195.929993 | 196.449997 | 18526600 |
常用的方法如上所介绍,还有很多其他的,下面罗列了一些,可供参考:
count 非na值的数量
describe 针对Series或DataFrame列计算汇总统计
min、max 计算最小值和最大值
argmin、argmax 计算能够获取到最大值和最小值得索引位置
idxmin、idxmax 计算能够获取到最大值和最小值得索引值
quantile 计算样本的分位数(0到1)
sum 值的总和
mean 值的平均数
median 值的算术中位数
mad 根据平均值计算平均绝对离差
var 样本值的方差
std 样本值的标准差
skew 样本值的偏度(三阶矩)
kurt 样本值的累积和
cumsum 样本值的累积和
cummin,cummax 样本值的累计最大值和累计最小值
cumprod 样本值的累计积
diff 计算一阶差分
pct_change 计算百分数变化
3 数据规整
Pandas提供了大量的方法能够轻松的对Series,DataFrame和Pannel对象进行各种符号各种逻辑关系的合并操作
concat 可以沿一条轴将多个对象堆叠到一起
append 将一行连接到一个DataFrame上
duolicated 移除重复数据
3.1 数据堆叠concat
df1 = web.DataReader("AAPL","yahoo",datetime.datetime(2019,4,1),datetime.datetime(2019,4,3))df1
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800 | 195.350006 |
df2 = web.DataReader("AAPL","yahoo",datetime.datetime(2019,4,5),datetime.datetime(2019,4,8))df2
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-04 | 196.369995 | 193.139999 | 194.789993 | 195.690002 | 19114300 | 195.690002 |
2019-04-05 | 197.100006 | 195.929993 | 196.449997 | 197.000000 | 18526600 | 197.000000 |
2019-04-08 | 200.229996 | 196.339996 | 196.419998 | 200.100006 | 25881700 | 200.100006 |
纵向拼接(默认):
pd.concat([df1,df2],axis=0)
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800 | 195.350006 |
2019-04-04 | 196.369995 | 193.139999 | 194.789993 | 195.690002 | 19114300 | 195.690002 |
2019-04-05 | 197.100006 | 195.929993 | 196.449997 | 197.000000 | 18526600 | 197.000000 |
2019-04-08 | 200.229996 | 196.339996 | 196.419998 | 200.100006 | 25881700 | 200.100006 |
横向拼接,index对不上的用NaN填充
pd.concat([df1,df2],axis=1)
High | Low | Open | Close | Volume | Adj Close | High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Date | ||||||||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000.0 | 191.240005 | NaN | NaN | NaN | NaN | NaN | NaN |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700.0 | 194.020004 | NaN | NaN | NaN | NaN | NaN | NaN |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800.0 | 195.350006 | NaN | NaN | NaN | NaN | NaN | NaN |
2019-04-04 | NaN | NaN | NaN | NaN | NaN | NaN | 196.369995 | 193.139999 | 194.789993 | 195.690002 | 19114300.0 | 195.690002 |
2019-04-05 | NaN | NaN | NaN | NaN | NaN | NaN | 197.100006 | 195.929993 | 196.449997 | 197.000000 | 18526600.0 | 197.000000 |
2019-04-08 | NaN | NaN | NaN | NaN | NaN | NaN | 200.229996 | 196.339996 | 196.419998 | 200.100006 | 25881700.0 | 200.100006 |
3.2 数据连接append
df1
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800 | 195.350006 |
s = df1.iloc[0]s
High 1.916800e+02Low 1.883800e+02Open 1.916400e+02Close 1.912400e+02Volume 2.786200e+07Adj Close 1.912400e+02Name: 2019-04-01 00:00:00, dtype: float64
# ignore_index=False 表示索引不变df1.append(s,ignore_index=False)
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000.0 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700.0 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800.0 | 195.350006 |
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000.0 | 191.240005 |
# ignore_index=True 表示索引重置df1.append(s,ignore_index=True)
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
0 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000.0 | 191.240005 |
1 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700.0 | 194.020004 |
2 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800.0 | 195.350006 |
3 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000.0 | 191.240005 |
3.3 移除重复数据duplicated
z = df1.append(s,ignore_index=False)z
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000.0 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700.0 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800.0 | 195.350006 |
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000.0 | 191.240005 |
查看重复数据
z.duplicated()
Date2019-04-01 False2019-04-02 False2019-04-03 False2019-04-01 Truedtype: bool
移除重复数据
z.drop_duplicates()
High | Low | Open | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
Date | ||||||
2019-04-01 | 191.679993 | 188.380005 | 191.639999 | 191.240005 | 27862000.0 | 191.240005 |
2019-04-02 | 194.460007 | 191.050003 | 191.089996 | 194.020004 | 22765700.0 | 194.020004 |
2019-04-03 | 196.500000 | 193.149994 | 193.250000 | 195.350006 | 23271800.0 | 195.350006 |
4 分组
z.groupby("Open").sum()
High | Low | Close | Volume | Adj Close | |
---|---|---|---|---|---|
Open | |||||
191.089996 | 194.460007 | 191.050003 | 194.020004 | 22765700.0 | 194.020004 |
191.639999 | 383.359985 | 376.760010 | 382.480011 | 55724000.0 | 382.480011 |
193.250000 | 196.500000 | 193.149994 | 195.350006 | 23271800.0 | 195.350006 |
z.groupby(["Open","Close"]).sum()
High | Low | Volume | Adj Close | ||
---|---|---|---|---|---|
Open | Close | ||||
191.089996 | 194.020004 | 194.460007 | 191.050003 | 22765700.0 | 194.020004 |
191.639999 | 191.240005 | 383.359985 | 376.760010 | 55724000.0 | 382.480011 |
193.250000 | 195.350006 | 196.500000 | 193.149994 | 23271800.0 | 195.350006 |
转载地址:https://codingchaozhang.blog.csdn.net/article/details/89349197 如侵犯您的版权,请留言回复原文章的地址,我们会给您删除此文章,给您带来不便请您谅解!
发表评论
最新留言
第一次来,支持一个
[***.219.124.196]2024年04月11日 03时16分47秒
关于作者
喝酒易醉,品茶养心,人生如梦,品茶悟道,何以解忧?唯有杜康!
-- 愿君每日到此一游!
推荐文章
关系型数据库 ,nosql数据库简介
2019-04-29
Centos 7搭建NTP时间同步服务器
2019-04-29
centos7 基于rsync+inotify 实现定时备份
2019-04-29
指定IP进行 文件的分发
2019-04-29
基于http搭建本地yum仓库
2019-04-29
常规邮件基础
2019-04-29
邮件基础之 -- SMTP协议
2019-04-29
网络基础
2019-04-29
Java设计模式--责任链模式
2019-04-29
OpenCV的cvLoadImage函数
2019-04-29
OpenCV中与matlab中相对应的函数
2019-04-29
C/C++中二维数组作函数形参时,调用函数时,可传递的实参类型的小结
2019-04-29
cvGetSubRect与cvMul用法
2019-04-29
opencv图像处理梯度边缘和角点
2019-04-29
Caffe源码中blob文件分析
2019-04-29
OpenCV 图像采样 插值 几何变换
2019-04-29
图像处理-仿射变换 AffineTransform
2019-04-29
图像二值化----otsu(最大类间方差法、大津算法)
2019-04-29
图像二值化----otsu(最大类间方差法、大津算法)(二)
2019-04-29
OpenCV编程案例:使用轮廓函数检测连通区域
2019-04-29