【开发笔记】Pandas 资料分析基础操作

只要谈起资料分析，不少人应该都会同意在 Python 环境下，Pandas 套件可说是快速直觉易用的首选。尤其结合 Numpy 与 Matplotlib 组合来操作资料分析，更是大数据和机器学习过程中不可多得的利器

首先载入 csv 到 dataframe

考虑到之后更动上的方便，建议将变量的设置集中到 Config 类别中。

import pandas as pd

# Config
class Config:
    TRAIN_CSV_FILEPATH = './train.csv'

# 从 csv 载入 DataFrame
config = Config()
df = pd.read_csv(config.TRAIN_CSV_FILEPATH)
print(df.head(10))

列印 dataframe 前 10 行的结果如下

   image_id label  image_width  image_height  is_tma
0         4  HGSC        23785         20008   False
1        66  LGSC        48871         48195   False
2        91  HGSC         3388          3388    True
3       281  LGSC        42309         15545   False
4       286    EC        37204         30020   False
5       431  HGSC        39991         40943   False
6       706  HGSC        75606         25965   False
7       970  HGSC        32131         18935   False
8      1020  HGSC        36585         33751   False
9      1080  HGSC        31336         23200   False

接着查看 dataframe 的基本特征

# 查看 dataframe 的行列数
print(df.shape)

# 列印每一行的名称
print(df.columns)
# Index(['image_id', 'label', 'image_width', 'image_height', 'is_tma'], dtype='object')

# 根据标签内容计算各自的数量
print(df["label"].value_counts())

标签的内容与数量如下

HGSC    222
EC      124
CC       99
LGSC     47
MC       46
Name: label, dtype: int64

另外可以针对特定行去除重复，或是不合法／空值的列

# 去除重复的列
print(df["label"].drop_duplicates())

# 去除不合法／空值的列
print(df["label"].dropna())

# 串起来执行也是可以的
print(df["label"].dropna().drop_duplicates())

留意 iloc 与 loc 的差别

iloc 是根据索引来取值，loc 是基于标签来取值。

# 取索引 2 到 5 之间的列（不包含第五列）
print(df.iloc[2:5])

# 使用 loc 的话要先选取行的标签来建立索引
print(df.set_index("label").loc["HGSC"])

# 留意可以只选取特定行来形成子集，减少数据量加快操作
print(df[["label", "image_width", "image_height"]])

过滤功能真的是 Pandas 的精华

Pandas 以非常直觉的语法来操作过滤数据

# 过滤 label 等于 HGSC 的列
print(df[df["label"]=="HGSC"])

# isin() 过滤列表中符合设置条件的列
print(df[df["image_id"].isin([1252, 1289, 4797, 4827, 6281, 6449, 6843, 7955, 8279, 8280, 8713, 9697, 12222, 12442, 15231])])

# 不等于反之只需要加上 ~ 符号作为前缀
print(df[~df["image_id"].isin([1252, 1289, 4797, 4827, 6281, 6449, 6843, 7955, 8279, 8280, 8713, 9697, 12222, 12442, 15231])])

# 且／或的话，只要加上 & 或是 | 作为操作子
print(df[(df["image_width"]>23785) & (df["image_height"]>48195)])
print(df[(df["image_width"]>23785) | (df["image_height"]>48195)])

# 也可以 query('') 加条件子句来让阅读上更容易
print(df.query("image_width > 23785 & image_height > 48195"))
print(df.query("image_width > 23785 | image_height > 48195"))

图表显示更容易直观理解资料特征

import pandas as pd
import matplotlib.pyplot as plt

# Config
class Config:
    TRAIN_CSV_FILEPATH = './train.csv'

# 从 csv 载入 DataFrame
config = Config()
df = pd.read_csv(config.TRAIN_CSV_FILEPATH)

# 根据 label 计算出各个标签的数量有多少 (value_counts 只能计算一柱，但是会自动降幂排列)
value_counts = df[["label"]].value_counts()

_, axes = plt.subplots(nrows=1, ncols=1, figsize=(10, 15))
# 使用圆饼图显示各标签的比例关系
value_counts.plot.pie(ax=axes[0], autopct='%1.2f%%', ylabel="label")
# 使用长条图显示各标签的数量关系
value_counts.plot.bar(ax=axes[1], ylabel="count", grid=True)
# 使用直方图显示特定行各个值出现的频率
df[["image_width"]].plot.hist(ax=axes[2], grid=True)
df[["image_height"]].plot.hist(ax=axes[3], grid=True)

plt.show()

终场加映散布图显示数据变量的分布方式

import plotly.express as px

# groupby() 可以多个 column 来聚合
# size 会把 row 的数量全算是，不管 row 所对应的值是否合法；count 的话，只会计算合法的值
df = df.groupby(["image_width", "image_height", "label"], as_index=False).size()
fig = px.scatter(df, x="image_width", y="image_height", size="size", color="label", height=600, width=1200)
fig.update_xaxes(range=[1_000, 120_000])
fig.update_yaxes(scaleanchor="x", scaleratio=1, range=[1_000, 60_000])
fig.show()