提交 419c8dd1 编辑于 作者: cuitianyu's avatar cuitianyu
浏览文件

Initial commit

上级
加载中
加载中
加载中
加载中

.gitattributes

0 → 100644
+3 −0
原始行号 差异行号 差异行
*.log filter=lfs diff=lfs merge=lfs -text
*.hdf5 filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text

.gitignore

0 → 100644
+138 −0
原始行号 差异行号 差异行
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
#*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
#   However, in case of collaboration, if having platform-specific dependencies or dependencies
#   having no cross-platform support, pipenv may install dependencies that don't work, or not
#   install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# data and models
#data/*/
#saved_models/*
.DS_Store
*.xml
*.iml
*.tar
*.log

.idea/.gitignore

0 → 100644
+2 −0
原始行号 差异行号 差异行
# Default ignored files
/workspace.xml

LICENSE

0 → 100644
+21 −0
原始行号 差异行号 差异行
MIT License

Copyright (c) 2022 LogIntelligence

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

README.md

0 → 100644
+79 −0
原始行号 差异行号 差异行
# NeuralLog
Repository for the paper: [Log-based Anomaly Detection Without Log Parsing](https://ieeexplore.ieee.org/document/9678773).

**Abstract**: Software systems often record important runtime information in system logs for troubleshooting purposes. There have been many studies that use log data  to construct machine learning models for detecting system anomalies. Through our empirical study, we find that existing log-based anomaly detection approaches are significantly affected by log parsing errors that are introduced by 1) OOV (out-of-vocabulary) words, and 2) semantic misunderstandings. The log parsing errors could cause the loss of important information for anomaly detection. To address the limitations of existing methods, we propose NeuralLog, a novel log-based anomaly detection approach that does not require log parsing. NeuralLog extracts the semantic meaning of raw log messages and represents them as semantic vectors. These representation vectors are then used to detect anomalies through a Transformer-based classification model, which can capture the contextual information from log sequences. Our experimental results show that the proposed approach can effectively understand the semantic meaning of log messages and achieve accurate anomaly detection results. Overall, NeuralLog achieves F1-scores greater than 0.95 on four public datasets, outperforming the existing approaches.

## Framework
<p align="center"><img src="docs/images/framework.jpg" width="502"><br>An overview of NeuralLog</p>

NeuralLog consists of the following components:
1. **Preprocessing**: Special characters and numbers are removed from log messages.
2. **Neural Representation**: Semantic vectors are extracted from log messages using BERT.
3. **Transformer-based Classification**: A transformer-based classification model containing Positional Encoding and Transformer Encoder is applied to detect anomalies.

[//]: # ([PyTorch version]&#40;https://github.com/LogIntelligence/LogADEmpirical&#41;)
## Requirements
1. Python 3.5 - 3.8
2. tensorflow 2.4
3. transformers
4. tf-models-official 2.4.0
5. scikit-learn
6. pandas
7. numpy
8. gensim
## Demo
- Extract Semantic Vectors

```python
from neurallog import data_loader

log_file = "../data/raw/BGL.log"
emb_dir = "../data/embeddings/BGL"

(x_tr, y_tr), (x_te, y_te) = data_loader.load_Supercomputers(
     log_file, train_ratio=0.8, windows_size=20,
     step_size=5, e_type='bert')
```
- Train/Test Transformer Model

See [notebook](demo/Transformer_based_Classification.ipynb)

- Full demo on the BGL dataset
```shell
$ pip install -r requirements.txt
$ wget https://zenodo.org/record/3227177/files/BGL.tar.gz && tar -xvzf BGL.tar.gz
$ mkdir logs && mv BGL.log logs/.
$ cd demo
$ python NeuralLog.py
```
## Data and Models
Datasets and pre-trained models can be found here: [Data](https://figshare.com/s/6d3c6a83f4828d17be79)
## Results
| Dataset | Metrics | LR | SVM | IM | LogRobust | Log2Vec | NeuralLog |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
|  | Precision | 0.99 | 0.99 | **1.00** | 0.98 | 0.94 | 0.96 |
| HDFS | Recall | 0.92 | 0.94 | 0.88 | **1.00** | 0.94 | **1.00** |
|  | F1-score | 0.96 | 0.96 | 0.94 | **0.99** | 0.94 | 0.98 |
|  | Precision | 0.13 | 0.97 | 0.13 | 0.62 | 0.80 | **0.98** |
| BGL | Recall | 0.93 | 0.30 | 0.30 | 0.96 | **0.98** | **0.98** |
|  | F1-score | 0.23 | 0.46 | 0.18 | 0.75 | 0.88 | **0.98** |
|  | Precision | 0.46 | 0.34 | - | 0.61 | 0.74 | **0.93** |
| Thunderbird | Recall | 0.91 | 0.91 | - | 0.78 | 0.94 | **1.00** |
|  | F1-score | 0.61 | 0.50 | - | 0.68 | 0.84 | **0.96** |
|  | Precision | 0.89 | 0.88 | - | 0.97 | 0.91 | **0.98** |
| Spirit | Recall | 0.96 | **1.00** | - | 0.94 | 0.96 | 0.96 |
|  | F1-score | 0.92 | 0.93 | - | 0.95 | 0.95 | **0.97** |


## Citation
If you find the code and models useful for your research, please cite the following paper:
```
@inproceedings{le2021log,
  title={Log-based anomaly detection without log parsing},
  author={Le, Van-Hoang and Zhang, Hongyu},
  booktitle={2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE)},
  pages={492--504},
  year={2021},
  organization={IEEE}
}
```