前言
这次参加群里的实战,利用pandas 进行股票数据清洗与分析,中间遇到太多场景
- 两个Dataframe的合并,类似数据库两个表join,有一列主键key作为关联的唯一性
- 利用Pandas给折线图产生一个特定区间的按月横坐标
- 按照Dataframe中日期列分组统计
- 新增列和修改列的数据类型
- 重置索引
- DataFrame的多种取数据方式
根据这些遇到的场景,利用Google在对每一个进行深入的学习并归纳:
本次主要整理Dataframe的合并
如果你熟悉sql,这些概念理解会更容易,这些函数本质上都是以特定方式组合成新的Dataframe
Concat, Join And Merge
- 准备数据,准备数据我特意让Dataframe不一样多,这样能更接近一些特殊场景
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import pandas as pd
raw_data1 = {
"user_id": ["1", "2", "3", "4", "5"],
"first_name": ["warren", "clyde", "Allen", "Alice", "Ayoung"],
"last_name": ["ke", "gao", "Ali", "li", "wang"],
}
raw_data2 = {
"user_id": ["4", "5", "6", "7", "8", "9"],
"first_name": ["Billy", "Brian", "Bran", "Bryce", "Betty", "mason"],
"last_name": ["liu", "Black", "li", "Brice", "wei", "wei"],
"age": [25, 26, 27, 28, 30, 31],
}
raw_data3 = {
"user_id": ["1", "2", "3", "4", "5", "7", "8", "9", "10", "11"],
"score": [51, 15, 15, 61, 16, 14, 15, 1, 61, 16],
}
df_one = pd.DataFrame(raw_data1, columns=["user_id", "first_name", "last_name"])
df_two = pd.DataFrame(raw_data2, columns=["user_id", "first_name", "last_name", "age"])
df_third = pd.DataFrame(raw_data3, columns=["user_id", "score"])
3个DataFrame的数据结构如下:
df_one View:
user_id | first_name | last_name | |
---|---|---|---|
0 | 1 | warren | ke |
1 | 2 | clyde | gao |
2 | 3 | Allen | Ali |
3 | 4 | Alice | li |
4 | 5 | Ayoung | wang |
df_two View:
user_id | first_name | last_name | age | |
---|---|---|---|---|
0 | 4 | Billy | liu | 25 |
1 | 5 | Brian | Black | 26 |
2 | 6 | Bran | li | 27 |
3 | 7 | Bryce | Brice | 28 |
4 | 8 | Betty | wei | 30 |
5 | 9 | mason | wei | 31 |
df_third View:
user_id | score | |
---|---|---|
0 | 1 | 51 |
1 | 2 | 15 |
2 | 3 | 15 |
3 | 4 | 61 |
4 | 5 | 16 |
5 | 7 | 14 |
6 | 8 | 15 |
7 | 9 | 1 |
8 | 10 | 61 |
9 | 11 | 16 |
Concat之后取最多列,数据row方向(按照行堆砌)合并
1
2
df_new_row = pd.concat([df_one, df_two], sort=True)
print(df_new_row)
df_one有5行,df_two有6行,join之前结果为11行,结果table如下:
age | first_name | user_id | last_name | |
---|---|---|---|---|
0 | NaN | warren | 1 | ke |
1 | NaN | clyde | 2 | gao |
2 | NaN | Allen | 3 | Ali |
3 | NaN | Alice | 4 | li |
4 | NaN | Ayoung | 5 | wang |
0 | 25.0 | Billy | 4 | liu |
1 | 26.0 | Brian | 5 | Black |
2 | 27.0 | Bran | 6 | li |
3 | 28.0 | Bryce | 7 | Brice |
4 | 30.0 | Betty | 8 | wei |
5 | 31.0 | mason | 9 | wei |
Concat之后取最多列,数据col方向取最多行合并,行数取最多df数据行
1
2
df_new_col = pd.concat([df_one, df_two], axis=1)
print(df_new_col)
df_one有5行,df_two有6行,join之前结果为6行,同列名也是重复的,结果table如下:
user_id | first_name | last_name | user_id | first_name | last_name | age | |
---|---|---|---|---|---|---|---|
0 | 1 | warren | ke | 4 | Billy | liu | 25 |
1 | 2 | clyde | gao | 5 | Brian | Black | 26 |
2 | 3 | Allen | Ali | 6 | Bran | li | 27 |
3 | 4 | Alice | li | 7 | Bryce | Brice | 28 |
4 | 5 | Ayoung | wang | 8 | Betty | wei | 30 |
5 | NaN | NaN | NaN | 9 | mason | wei | 31 |
Merge 是将多个df合并并指定主键(Key)相同的行
根据主键user_id进行关联Merge
1
2
df_merge = pd.merge(df_new_row, df_third, on="user_id")
print(df_merge)
age | first_name | last_name | user_id | score | |
---|---|---|---|---|---|
0 | NaN | warren | ke | 1 | 51 |
1 | NaN | clyde | gao | 2 | 15 |
2 | NaN | Allen | Ali | 3 | 15 |
3 | NaN | Alice | li | 4 | 61 |
4 | 25.0 | Billy | liu | 4 | 61 |
5 | NaN | Ayoung | wang | 5 | 16 |
6 | 26.0 | Brian | Black | 5 | 16 |
7 | 28.0 | Bryce | Brice | 7 | 14 |
8 | 30.0 | Betty | wei | 8 | 15 |
9 | 31.0 | mason | wei | 9 | 1 |
根据主键user_id进行关联然后 outer join
1
2
df_merge_outer = pd.merge(df_one, df_two, on="user_id", how="outer")
print(df_merge_outer)
注意列名的变化,个人感觉这个场景使用比较少,结果视图如下:
user_id | first_name_x | last_name_x | first_name_y | last_name_y | age | |
---|---|---|---|---|---|---|
0 | 1 | warren | ke | NaN | NaN | NaN |
1 | 2 | clyde | gao | NaN | NaN | NaN |
2 | 3 | Allen | Ali | NaN | NaN | NaN |
3 | 4 | Alice | li | Billy | liu | 25.0 |
4 | 5 | Ayoung | wang | Brian | Black | 26.0 |
5 | 6 | NaN | NaN | Bran | li | 27.0 |
6 | 7 | NaN | NaN | Bryce | Brice | 28.0 |
7 | 8 | NaN | NaN | Betty | wei | 30.0 |
8 | 9 | NaN | NaN | mason | wei | 31.0 |
根据主键user_id进行关联然后 inner join
1
2
df_merge_inner = pd.merge(df_one, df_two, on="user_id", how="inner")
print(df_merge_inner)
这种场景使用最广,多表关联
user_id | first_name_x | last_name_x | first_name_y | last_name_y | age | |
---|---|---|---|---|---|---|
0 | 4 | Alice | li | Billy | liu | 25 |
1 | 5 | Ayoung | wang | Brian | Black | 26 |
根据主键user_id进行关联然后 left/right join
1
2
df_merge_left = pd.merge(df_one, df_two, on="user_id", how="left",suffixes=("_left", "_right"))
print(df_merge_left)
left/right 两者基本写法一致,我就偷个懒。除了join之外请注意重复列的设置别名
user_id | first_name_left | last_name_left | first_name_right | last_name_right | age | |
---|---|---|---|---|---|---|
0 | 1 | warren | ke | NaN | NaN | NaN |
1 | 2 | clyde | gao | NaN | NaN | NaN |
2 | 3 | Allen | Ali | NaN | NaN | NaN |
3 | 4 | Alice | li | Billy | liu | 25.0 |
4 | 5 | Ayoung | wang | Brian | Black | 26.0 |
DataFrame的合并就整理这么多,下面将整理下列的相关操作