Merge pull request #14 from tnakae/TestDataEvaluation (a9159cf2) · 提交 · AIOps-NanKai / model / DAGMM

KDDCup99.ipynb

0 → 100644

+141 −0

原始行号	差异行号	差异行
		%% Cell type:markdown id: tags:

		# KDDCup99 10%Data Evaluation
		- Import KDDCup99 10%data from network and check performance of anomaly detection.
		- To execute this notebook, need python(3.6), tensorflow, pandas, numpy, sklearn.

		%% Cell type:code id: tags:

		``` python
		import tensorflow as tf
		import numpy as np
		import pandas as pd

		from sklearn.model_selection import train_test_split
		from sklearn.metrics import precision_recall_fscore_support

		from dagmm import DAGMM
		```

		%% Cell type:markdown id: tags:

		## Data Import

		%% Cell type:code id: tags:

		``` python
		url_base = "http://kdd.ics.uci.edu/databases/kddcup99"

		# KDDCup 10% Data
		url_data = f"{url_base}/kddcup.data_10_percent.gz"
		# info data (column names, col types)
		url_info = f"{url_base}/kddcup.names"
		```

		%% Cell type:code id: tags:

		``` python
		# Import info data
		df_info = pd.read_csv(url_info, sep=":", skiprows=1, index_col=False, names=["colname", "type"])
		colnames = df_info.colname.values
		coltypes = np.where(df_info["type"].str.contains("continuous"), "float", "str")
		colnames = np.append(colnames, ["status"])
		coltypes = np.append(coltypes, ["str"])

		# Import data
		df = pd.read_csv(url_data, names=colnames, index_col=False,
		dtype=dict(zip(colnames, coltypes)))
		```

		%% Cell type:code id: tags:

		``` python
		# Dumminize
		X = pd.get_dummies(df.iloc[:,:-1]).values

		# Create Traget Flag
		# Anomaly data when status is normal, Otherwise, Not anomaly.
		y = np.where(df.status == "normal.", 1, 0)
		```

		%% Cell type:code id: tags:

		``` python
		# Split Data
		X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=123)
		X_train, y_train = X_train[y_train == 0], y_train[y_train == 0]
		```

		%% Cell type:markdown id: tags:

		## Fit Data to DAGMM Model
		next points are different from original paper:
		- $\lambda_2$ is set to 0.0001 (paper: 0.005)
		- Add small value($10^{-6}$) to diagonal elements of GMM covariance (paper: no additional value)

		Standard Scaler is applied to input data (This DAGMM implementation default)

		%% Cell type:code id: tags:

		``` python
		model = DAGMM(
		comp_hiddens=[60, 30, 10, 1], comp_activation=tf.nn.tanh,
		est_hiddens=[10, 4], est_dropout_ratio=0.5, est_activation=tf.nn.tanh,
		learning_rate=0.0001, epoch_size=200, minibatch_size=1024, random_seed=1111
		)
		```

		%% Cell type:code id: tags:

		``` python
		model.fit(X_train)
		```

		%% Output

		epoch 100/200 : loss = 80.526
		epoch 200/200 : loss = 72.563

		%% Cell type:markdown id: tags:

		## Apply model to test data

		%% Cell type:code id: tags:

		``` python
		y_pred = model.predict(X_test)
		```

		%% Cell type:code id: tags:

		``` python
		# Energy thleshold to detect anomaly = 80% percentile of energies
		anomaly_energy_threshold = np.percentile(y_pred, 80)
		print(f"Energy thleshold to detect anomaly : {anomaly_energy_threshold:.3f}")
		```

		%% Output

		Energy thleshold to detect anomaly : 6.518

		%% Cell type:code id: tags:

		``` python
		# Detect anomalies from test data
		y_pred_flag = np.where(y_pred >= anomaly_energy_threshold, 1, 0)
		```

		%% Cell type:code id: tags:

		``` python
		prec, recall, fscore, _ = precision_recall_fscore_support(y_test, y_pred_flag, average="binary")
		print(f" Precision = {prec:.3f}")
		print(f" Recall = {recall:.3f}")
		print(f" F1-Score = {fscore:.3f}")
		```

		%% Output

		Precision = 0.932
		Recall = 0.942
		F1-Score = 0.937

KDDCup99_ja.ipynb

0 → 100644

+143 −0

原始行号	差异行号	差异行
		%% Cell type:markdown id: tags:

		# KDDCup99 10% データによる異常検知評価
		- KDDCup99 10% データをネットワークからダウンロードし、異常検知の評価を実施します。
		- なおこのサンプルの実行には python(3.6), tensorflow, pandas, numpy, sklearn が必要です。

		%% Cell type:code id: tags:

		``` python
		import tensorflow as tf
		import numpy as np
		import pandas as pd

		from sklearn.model_selection import train_test_split
		from sklearn.metrics import precision_recall_fscore_support

		from dagmm import DAGMM
		```

		%% Cell type:markdown id: tags:

		## データインポート

		%% Cell type:code id: tags:

		``` python
		url_base = "http://kdd.ics.uci.edu/databases/kddcup99"

		# KDDCup 10% Data
		url_data = f"{url_base}/kddcup.data_10_percent.gz"
		# データ属性情報（列名・列の型）
		url_info = f"{url_base}/kddcup.names"
		```

		%% Cell type:code id: tags:

		``` python
		# データ属性情報を読み込み
		df_info = pd.read_csv(url_info, sep=":", skiprows=1, index_col=False, names=["colname", "type"])
		colnames = df_info.colname.values
		coltypes = np.where(df_info["type"].str.contains("continuous"), "float", "str")
		colnames = np.append(colnames, ["status"])
		coltypes = np.append(coltypes, ["str"])

		# データ本体のインポート
		df = pd.read_csv(url_data, names=colnames, index_col=False,
		dtype=dict(zip(colnames, coltypes)))
		```

		%% Cell type:code id: tags:

		``` python
		# データダミー化
		X = pd.get_dummies(df.iloc[:,:-1]).values

		# 目的変数の生成
		# "normal" の場合、異常(1)とする。そうでない場合に正常(0)
		# 通常の考え方と逆だが、論文の趣旨に準じる
		y = np.where(df.status == "normal.", 1, 0)
		```

		%% Cell type:code id: tags:

		``` python
		# 学習/評価用に分割
		X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=123)
		X_train, y_train = X_train[y_train == 0], y_train[y_train == 0]
		```

		%% Cell type:markdown id: tags:

		## DAGMM データ学習
		但し次の点を論文と変更しました:
		- $\lambda_2$ を 0.0001 としました (論文では 0.005)
		- GMM の共分散行列の対角成分に小さな値 ($10^{-6}$) を加えて、特異にならないようにした（論文では特に言及無し）

		これはデータの分布、および学習前の正規化(平均0, 分散1となるようにするなど)の方法に依存するものと思われます。
		（論文と同じパラメータでは、あまり良い精度とならなかった）

		%% Cell type:code id: tags:

		``` python
		model = DAGMM(
		comp_hiddens=[60, 30, 10, 1], comp_activation=tf.nn.tanh,
		est_hiddens=[10, 4], est_dropout_ratio=0.5, est_activation=tf.nn.tanh,
		learning_rate=0.0001, epoch_size=200, minibatch_size=1024, random_seed=1111
		)
		```

		%% Cell type:code id: tags:

		``` python
		model.fit(X_train)
		```

		%% Output

		epoch 100/200 : loss = 80.526
		epoch 200/200 : loss = 72.563

		%% Cell type:markdown id: tags:

		## 学習済みモデルを検証データに適用

		%% Cell type:code id: tags:

		``` python
		y_pred = model.predict(X_test)
		```

		%% Cell type:code id: tags:

		``` python
		# エネルギーのしきい値は、全データのエネルギー分布の80%点（上側20%点）
		anomaly_energy_threshold = np.percentile(y_pred, 80)
		print(f"Energy thleshold to detect anomaly : {anomaly_energy_threshold:.3f}")
		```

		%% Output

		Energy thleshold to detect anomaly : 6.518

		%% Cell type:code id: tags:

		``` python
		# 検証データの異常判定
		y_pred_flag = np.where(y_pred >= anomaly_energy_threshold, 1, 0)
		```

		%% Cell type:code id: tags:

		``` python
		prec, recall, fscore, _ = precision_recall_fscore_support(y_test, y_pred_flag, average="binary")
		print(f" Precision = {prec:.3f}")
		print(f" Recall = {recall:.3f}")
		print(f" F1-Score = {fscore:.3f}")
		```

		%% Output

		Precision = 0.932
		Recall = 0.942
		F1-Score = 0.937

README.md

+11 −6

原始行号	差异行号	差异行
		@@ -65,8 +65,13 @@ model.restore("./fitted_model")
		```

		## Jupyter Notebook Example
		You can use [jupyter notebook example](Example_DAGMM.ipynb).
		You can use next jupyter notebook examples using DAGMM model.
		- [Simple DAGMM Example notebook](Example_DAGMM.ipynb) :
		This example uses random samples of mixture of gaussian.
		If you want to know simple usage, this notebook is recommended.
		- [KDDCup99 10% Data Evaluation](KDDCup99.ipynb) :
		Performance evaluation of anomaly detection for KDDCup99 10% Data
		with the same condition of original paper (need pandas)

		# Notes
		## GMM Implementation
		@@ -86,9 +91,9 @@ elements of covariance matrix for more numerical stability
		(it is same as Tensorflow GMM implementation,
		and [another author of DAGMM](https://github.com/danieltan07/dagmm) also points it out)

		## Parameter of GMM Covariance ($\lambda_2$)
		Default value of $\lambda_2$ is set to 0.0001 (0.005 in original paper).
		When $\lambda_2$ is 0.005, covariances of GMM becomes too large to detect
		anomaly points. But perhaps it depends on data and preprocessing
		(for example a method of normalization). Recommended control $\lambda_2$
		## Parameter of GMM Covariance (lambda_2)
		Default value of lambda_2 is set to 0.0001 (0.005 in original paper).
		When lambda_2 is 0.005, covariances of GMM becomes too large to detect
		anomaly points. But perhaps it depends on distribution of data and method of preprocessing
		(for example a method of normalization). Recommend to control lambda_2
		when performance metrics is not good.

README_ja.md

+9 −4

原始行号	差异行号	差异行
		@@ -66,8 +66,13 @@ model.restore("./fitted_model")
		```

		## Jupyter Notebook サンプル
		Jupyter notebook での[実行サンプル](./Example_DAGMM_ja.ipynb)を用意しました。
		次のJupyter notebook の実行サンプルを用意しました。
		- [DAGMM の利用例](Example_DAGMM_ja.ipynb) :
		このサンプルでは、混合正規分布に対して適用した結果となっています。
		利用方法を手っ取り早く知りたい場合、まずこのサンプルを見てください。
		- [KDDCup99 10% データによる異常検知評価](KDDCup99_ja.ipynb) :
		論文と同条件により、KDDCup99 10% データに対する異常検知を実施し、
		精度評価を行うサンプルです(pandasが必要です)

		# 補足
		## 混合正規分布(GMM)の実装について
		@@ -87,11 +92,11 @@ Jupyter notebook での[実行サンプル](./Example_DAGMM_ja.ipynb)を用意
		[DAGMMの別実装の実装者](https://github.com/danieltan07/dagmm)も
		同じ事情について言及しています)

		## 共分散パラメータ $\lambda_2$ について
		共分散の対角成分を制御するパラメータ $\lambda_2$ のデフォルト値は
		## 共分散パラメータ λ2 について
		共分散の対角成分を制御するパラメータλ2のデフォルト値は
		0.0001 としてあります（論文では 0.005 がおすすめとなっている）
		これは、0.005 とした場合に共分散が大きくなりすぎて、大きなクラスタ
		が選ばれる傾向にあったためです。ただしこれはデータの傾向、および
		前処理の手順（例えば、データの正規化の方法）にも依存すると考えられます。
		意図した精度が得られない場合は、$\lambda_2$ の値をコントロールすることを
		意図した精度が得られない場合は、λ2 の値をコントロールすることを
		お勧めします。