【開發筆記】Pandas 資料分析基礎操作

只要談起資料分析，不少人應該都會同意在 Python 環境下，Pandas 套件可說是快速直覺易用的首選。尤其結合 Numpy 與 Matplotlib 組合來操作資料分析，更是大數據和機器學習過程中不可多得的利器

首先載入 csv 到 dataframe

考慮到之後更動上的方便，建議將變量的設置集中到 Config 類別中。

import pandas as pd

# Config
class Config:
    TRAIN_CSV_FILEPATH = './train.csv'

# 從 csv 載入 DataFrame
config = Config()
df = pd.read_csv(config.TRAIN_CSV_FILEPATH)
print(df.head(10))

列印 dataframe 前 10 行的結果如下

   image_id label  image_width  image_height  is_tma
0         4  HGSC        23785         20008   False
1        66  LGSC        48871         48195   False
2        91  HGSC         3388          3388    True
3       281  LGSC        42309         15545   False
4       286    EC        37204         30020   False
5       431  HGSC        39991         40943   False
6       706  HGSC        75606         25965   False
7       970  HGSC        32131         18935   False
8      1020  HGSC        36585         33751   False
9      1080  HGSC        31336         23200   False

接著查看 dataframe 的基本特徵

# 查看 dataframe 的行列數
print(df.shape)

# 列印每一行的名稱
print(df.columns)
# Index(['image_id', 'label', 'image_width', 'image_height', 'is_tma'], dtype='object')

# 根據標籤內容計算各自的數量
print(df["label"].value_counts())

標籤的內容與數量如下

HGSC    222
EC      124
CC       99
LGSC     47
MC       46
Name: label, dtype: int64

另外可以針對特定行去除重複，或是不合法／空值的列

# 去除重複的列
print(df["label"].drop_duplicates())

# 去除不合法／空值的列
print(df["label"].dropna())

# 串起來執行也是可以的
print(df["label"].dropna().drop_duplicates())

留意 iloc 與 loc 的差別

iloc 是根據索引來取值，loc 是基於標籤來取值。

# 取索引 2 到 5 之間的列（不包含第五列）
print(df.iloc[2:5])

# 使用 loc 的話要先選取行的標籤來建立索引
print(df.set_index("label").loc["HGSC"])

# 留意可以只選取特定行來形成子集，減少數據量加快操作
print(df[["label", "image_width", "image_height"]])

過濾功能真的是 Pandas 的精華

Pandas 以非常直覺的語法來操作過濾數據

# 過濾 label 等於 HGSC 的列
print(df[df["label"]=="HGSC"])

# isin() 過濾列表中符合設置條件的列
print(df[df["image_id"].isin([1252, 1289, 4797, 4827, 6281, 6449, 6843, 7955, 8279, 8280, 8713, 9697, 12222, 12442, 15231])])

# 不等於反之只需要加上 ~ 符號作為前綴
print(df[~df["image_id"].isin([1252, 1289, 4797, 4827, 6281, 6449, 6843, 7955, 8279, 8280, 8713, 9697, 12222, 12442, 15231])])

# 且／或的話，只要加上 & 或是 | 作為操作子
print(df[(df["image_width"]>23785) & (df["image_height"]>48195)])
print(df[(df["image_width"]>23785) | (df["image_height"]>48195)])

# 也可以 query('') 加條件子句來讓閱讀上更容易
print(df.query("image_width > 23785 & image_height > 48195"))
print(df.query("image_width > 23785 | image_height > 48195"))

圖表顯示更容易直觀理解資料特徵

import pandas as pd
import matplotlib.pyplot as plt

# Config
class Config:
    TRAIN_CSV_FILEPATH = './train.csv'

# 從 csv 載入 DataFrame
config = Config()
df = pd.read_csv(config.TRAIN_CSV_FILEPATH)

# 根據 label 計算出各個標籤的數量有多少 (value_counts 只能計算一柱，但是會自動降冪排列)
value_counts = df[["label"]].value_counts()

_, axes = plt.subplots(nrows=1, ncols=1, figsize=(10, 15))
# 使用圓餅圖顯示各標籤的比例關係
value_counts.plot.pie(ax=axes[0], autopct='%1.2f%%', ylabel="label")
# 使用長條圖顯示各標籤的數量關係
value_counts.plot.bar(ax=axes[1], ylabel="count", grid=True)
# 使用直方圖顯示特定行各個值出現的頻率
df[["image_width"]].plot.hist(ax=axes[2], grid=True)
df[["image_height"]].plot.hist(ax=axes[3], grid=True)

plt.show()

終場加映散布圖顯示數據變量的分布方式

import plotly.express as px

# groupby() 可以多個 column 來聚合
# size 會把 row 的數量全算是，不管 row 所對應的值是否合法；count 的話，只會計算合法的值
df = df.groupby(["image_width", "image_height", "label"], as_index=False).size()
fig = px.scatter(df, x="image_width", y="image_height", size="size", color="label", height=600, width=1200)
fig.update_xaxes(range=[1_000, 120_000])
fig.update_yaxes(scaleanchor="x", scaleratio=1, range=[1_000, 60_000])
fig.show()