提交 8cc7f0e3 编辑于 作者: Toshihiro Nakae's avatar Toshihiro Nakae
浏览文件

add KDDCup notebooks (fixed #4)

上级 71a59620
加载中
加载中
加载中
加载中

KDDCup99.ipynb

0 → 100644
+141 −0
原始行号 差异行号 差异行
%% Cell type:markdown id: tags:

# KDDCup99 10%Data Evaluation
- Import KDDCup99 10%data from network and check performance of anomaly detection.
- To execute this notebook, need python(3.6), tensorflow, pandas, numpy, sklearn.

%% Cell type:code id: tags:

``` python
import tensorflow as tf
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support

from dagmm import DAGMM
```

%% Cell type:markdown id: tags:

## Data Import

%% Cell type:code id: tags:

``` python
url_base = "http://kdd.ics.uci.edu/databases/kddcup99"

# KDDCup 10% Data
url_data = f"{url_base}/kddcup.data_10_percent.gz"
# info data (column names, col types)
url_info = f"{url_base}/kddcup.names"
```

%% Cell type:code id: tags:

``` python
# Import info data
df_info = pd.read_csv(url_info, sep=":", skiprows=1, index_col=False, names=["colname", "type"])
colnames = df_info.colname.values
coltypes = np.where(df_info["type"].str.contains("continuous"), "float", "str")
colnames = np.append(colnames, ["status"])
coltypes = np.append(coltypes, ["str"])

# Import data
df = pd.read_csv(url_data, names=colnames, index_col=False,
                 dtype=dict(zip(colnames, coltypes)))
```

%% Cell type:code id: tags:

``` python
# Dumminize
X = pd.get_dummies(df.iloc[:,:-1]).values

# Create Traget Flag
# Anomaly data when status is normal, Otherwise, Not anomaly.
y = np.where(df.status == "normal.", 1, 0)
```

%% Cell type:code id: tags:

``` python
# Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=123)
X_train, y_train = X_train[y_train == 0], y_train[y_train == 0]
```

%% Cell type:markdown id: tags:

## Fit Data to DAGMM Model
next points are different from original paper:
- $\lambda_2$ is set to 0.0001 (paper: 0.005)
- Add small value($10^{-6}$) to diagonal elements of GMM covariance (paper: no additional value)

Standard Scaler is applied to input data (This DAGMM implementation default)

%% Cell type:code id: tags:

``` python
model = DAGMM(
    comp_hiddens=[60, 30, 10, 1], comp_activation=tf.nn.tanh,
    est_hiddens=[10, 4], est_dropout_ratio=0.5, est_activation=tf.nn.tanh,
    learning_rate=0.0001, epoch_size=200, minibatch_size=1024, random_seed=1111
)
```

%% Cell type:code id: tags:

``` python
model.fit(X_train)
```

%% Output

     epoch 100/200 : loss = 80.526
     epoch 200/200 : loss = 72.563

%% Cell type:markdown id: tags:

## Apply model to test data

%% Cell type:code id: tags:

``` python
y_pred = model.predict(X_test)
```

%% Cell type:code id: tags:

``` python
# Energy thleshold to detect anomaly = 80% percentile of energies
anomaly_energy_threshold = np.percentile(y_pred, 80)
print(f"Energy thleshold to detect anomaly : {anomaly_energy_threshold:.3f}")
```

%% Output

    Energy thleshold to detect anomaly : 6.518

%% Cell type:code id: tags:

``` python
# Detect anomalies from test data
y_pred_flag = np.where(y_pred >= anomaly_energy_threshold, 1, 0)
```

%% Cell type:code id: tags:

``` python
prec, recall, fscore, _ = precision_recall_fscore_support(y_test, y_pred_flag, average="binary")
print(f" Precision = {prec:.3f}")
print(f" Recall    = {recall:.3f}")
print(f" F1-Score  = {fscore:.3f}")
```

%% Output

     Precision = 0.932
     Recall    = 0.942
     F1-Score  = 0.937

KDDCup99_ja.ipynb

0 → 100644
+143 −0
原始行号 差异行号 差异行
%% Cell type:markdown id: tags:

# KDDCup99 10% データによる異常検知評価
- KDDCup99 10% データをネットワークからダウンロードし、異常検知の評価を実施します。
- なおこのサンプルの実行には python(3.6), tensorflow, pandas, numpy, sklearn が必要です。

%% Cell type:code id: tags:

``` python
import tensorflow as tf
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support

from dagmm import DAGMM
```

%% Cell type:markdown id: tags:

## データインポート

%% Cell type:code id: tags:

``` python
url_base = "http://kdd.ics.uci.edu/databases/kddcup99"

# KDDCup 10% Data
url_data = f"{url_base}/kddcup.data_10_percent.gz"
# データ属性情報(列名・列の型)
url_info = f"{url_base}/kddcup.names"
```

%% Cell type:code id: tags:

``` python
# データ属性情報を読み込み
df_info = pd.read_csv(url_info, sep=":", skiprows=1, index_col=False, names=["colname", "type"])
colnames = df_info.colname.values
coltypes = np.where(df_info["type"].str.contains("continuous"), "float", "str")
colnames = np.append(colnames, ["status"])
coltypes = np.append(coltypes, ["str"])

# データ本体のインポート
df = pd.read_csv(url_data, names=colnames, index_col=False,
                 dtype=dict(zip(colnames, coltypes)))
```

%% Cell type:code id: tags:

``` python
# データダミー化
X = pd.get_dummies(df.iloc[:,:-1]).values

# 目的変数の生成
# "normal" の場合、異常(1)とする。そうでない場合に 正常(0)
# 通常の考え方と逆だが、論文の趣旨に準じる
y = np.where(df.status == "normal.", 1, 0)
```

%% Cell type:code id: tags:

``` python
# 学習/評価用に分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=123)
X_train, y_train = X_train[y_train == 0], y_train[y_train == 0]
```

%% Cell type:markdown id: tags:

## DAGMM データ学習
但し次の点を論文と変更しました:
- $\lambda_2$ を 0.0001 としました (論文では 0.005)
- GMM の共分散行列の対角成分に小さな値 ($10^{-6}$) を加えて、特異にならないようにした(論文では特に言及無し)

これはデータの分布、および学習前の正規化(平均0, 分散1となるようにするなど)の方法に依存するものと思われます。
(論文と同じパラメータでは、あまり良い精度とならなかった)

%% Cell type:code id: tags:

``` python
model = DAGMM(
    comp_hiddens=[60, 30, 10, 1], comp_activation=tf.nn.tanh,
    est_hiddens=[10, 4], est_dropout_ratio=0.5, est_activation=tf.nn.tanh,
    learning_rate=0.0001, epoch_size=200, minibatch_size=1024, random_seed=1111
)
```

%% Cell type:code id: tags:

``` python
model.fit(X_train)
```

%% Output

     epoch 100/200 : loss = 80.526
     epoch 200/200 : loss = 72.563

%% Cell type:markdown id: tags:

## 学習済みモデルを検証データに適用

%% Cell type:code id: tags:

``` python
y_pred = model.predict(X_test)
```

%% Cell type:code id: tags:

``` python
# エネルギーのしきい値は、全データのエネルギー分布の80%点(上側20%点)
anomaly_energy_threshold = np.percentile(y_pred, 80)
print(f"Energy thleshold to detect anomaly : {anomaly_energy_threshold:.3f}")
```

%% Output

    Energy thleshold to detect anomaly : 6.518

%% Cell type:code id: tags:

``` python
# 検証データの異常判定
y_pred_flag = np.where(y_pred >= anomaly_energy_threshold, 1, 0)
```

%% Cell type:code id: tags:

``` python
prec, recall, fscore, _ = precision_recall_fscore_support(y_test, y_pred_flag, average="binary")
print(f" Precision = {prec:.3f}")
print(f" Recall    = {recall:.3f}")
print(f" F1-Score  = {fscore:.3f}")
```

%% Output

     Precision = 0.932
     Recall    = 0.942
     F1-Score  = 0.937