提交 f8233fd3 编辑于 作者: openaiops's avatar openaiops
浏览文件

Initial commit

上级
加载中
加载中
加载中
加载中

LICENSE

0 → 100644
+21 −0
原始行号 差异行号 差异行
MIT License

Copyright (c) 2021 IntelligentDDS

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

README.md

0 → 100644
+100 −0
原始行号 差异行号 差异行
# MicroRank
MicroRank is a novel system to locate root causes that lead to latency issues in microservice environments. 

MicroRank extracts service latency from tracing data then conducts the anomaly detection procedure.

By combining PageRank and spectrum analysis, the service instances that lead to latency issues are ranked with high scores. 

![image](./fig/framwork.png)

## Paper Download
Our paper has been published at WWW'2021.

The paper can be downloaded as below:

[MicroRank: End-to-End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments](https://dl.acm.org/doi/10.1145/3442381.3449905)

## Reference
Please cite our paper if you find this work is helpful. 

```
@inproceedings{microrank,
  title={MicroRank: End-to-End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments},
  author={Guangba Yu, Pengfei Chen, Hongyang Chen, Zijie Guan, Zicheng Huang, Linxiao Jing, TianjunWeng, Xinmeng Sun, and Xiaoyun Li},
  booktitle={Proceedings of the Web Conference 2021 (WWW’2021)},
  year={2021},
  organization={ACM},
  page = {3087-3098},
  doi={https://doi.org/10.1145/3442381.3449905}
}
```

## Running MicroRank

### Notices
If you want to use MicroRank to production system, some notices below should be considered. 
- Our anomaly detetion module is not always suitable for each microservice system. If you have more excellent anomaly detection module for your system, we recommend that replacing the anomaly detetion module with your approach before RCA.
- Microrank needs more iterations in PageRank if your system is a large microservice system. The accuracy of RCA may decline in a large microservice system.
- We acknowledge that the accuracy of RCA may be degraded when intermittent failures and broken traces  are encountered.

### Replace Database

Line 12 in the file [preprocess_data.py](preprocess_data.py) 
```
// ES address
es_url = 'http://11.11.11.24:9200'
root_index = 'root'
```

### Replace Normal Window
Line 32 in [online_rca.py](online_rca.py).

We need to set a normal window to calculate the normal avarge latency and variance for each microservice.

Longer window is prefered.

```
# need to replace 
normal_start = '2020-08-28 14:56:43'
normal_end = '2020-08-28 14:57:44'

span_list = get_span(start=timestamp(start), end=timestamp(end))
# print(span_list)
operation_list = get_service_operation_list(span_list)
print(operation_list)
operation_slo = get_operation_slo(
    service_operation_list=operation_list, span_list=span_list)
print(operation_slo)
```

### Start MicroRank
```
python online_rca.py
```

## File content
```
- anomaly_detector
  - get_slo                                 # get the average latency and variance for each operation
  - system_anormaly_detect                  # determine whether the system is abnormal 
  - trace_anormaly_detect                   # determine whether the single trace is abnormal 
  - trace_list_partition                    # divide traces into normal and abnormal traces
- online_rca.py
  - calculate_spectrum_without_delay_list   # calculate spectrum reuslt
  - online_anomaly_detect_RCA               # running microrank
- pagerank.py                               # calculate pagerank result
- preporcess_data.py
  - get_span 
  - get_service_operation_list 
  - get_operation_slo 
  - get_operation_duration_data 
  - get_pagerank_graph 
```
  






WWW2021_MicroRank.pdf

0 → 100644
+1.2 MB

添加文件。

此文件类型的文件没有差异预览。

anormaly_detector.py

0 → 100644
+148 −0
原始行号 差异行号 差异行
from preprocess_data import get_operation_slo
from preprocess_data import get_span
from preprocess_data import get_operation_duration_data
from preprocess_data import get_service_operation_list
import time

'''
   Input long time trace data and get the slo of operation
   :arg
       date: format 2020-08-14 or 2020-08-*
       start_time  end_time  expect more than one hours traces
       
   :return
       operation dict of the mean of and variance  
       {
           # operation: {mean, variance}
           "Currencyservice_Convert": [600, 3]}
       }        
'''


def get_slo(start_time, end_time):
    span_list = get_span(start_time, end_time)
    operation_list = get_service_operation_list(span_list)
    slo = get_operation_slo(
        service_operation_list=operation_list, span_list=span_list)
    return slo


'''
   Input short time trace data and calculate the expect_duration.
   expect_duration = operation1 * mean_duration1 + variation_duration1 +
                    operation2 * mean_duration2 + variation_duration2
   if expect_duration < real_duration  error                 
   :arg
       date: format 2020-08-14 or 2020-08-*
       start_time end_time  expect 30s or 1min traces
   :return
       if error_rate > 1%:
          return True    
'''


def system_anormaly_detect(start_time, end_time, slo, operation_list):
    span_list = get_span(start_time, end_time)
    if len(span_list) == 0:
        print("Error: Current span list is empty ")
        return False
    #operation_list = get_service_operation_list(span_list)
    operation_count = get_operation_duration_data(operation_list, span_list)

    anormaly_trace = 0
    total_trace = 0
    for trace_id in operation_count:
        total_trace += 1
        real_duration = float(operation_count[trace_id]['duration']) / 1000.0
        expect_duration = 0.0
        for operation in operation_count[trace_id]:
            if "duration" == operation:
                continue
            expect_duration += operation_count[trace_id][operation] * (
                slo[operation][0] + 1.5 * slo[operation][1])

        if real_duration > expect_duration:
            anormaly_trace += 1

    print("anormaly_trace", anormaly_trace)
    print("total_trace", total_trace)
    print()
    if anormaly_trace > 8:
        anormaly_rate = float(anormaly_trace) / total_trace
        print("anormaly_rate", anormaly_rate)
        return True

    else:
        return False


'''
   Determine single trace state
   :arg
       operation_list: operation_count[traceid] # list of operation of single trace
       slo: slo list
   
   :return
        if real_duration > expect_duration:
            return True
        else:
            return False    
'''


def trace_anormaly_detect(operation_list, slo):
    expect_duration = 0.0
    real_duration = float(operation_list['duration']) / 1000.0
    for operation in operation_list:
        if operation == "duration":
            continue
        expect_duration += operation_list[operation] * \
            (slo[operation][0] + slo[operation][1])

    if real_duration > expect_duration + 50:
        return True
    else:
        return False


'''
   Partition all the trace list in operation_count to normal_list and abnormal_list
   :arg
       operation_count: all the trace operation
       operation_count[traceid][operation] = 1
   :return
       normal_list: normal traceid list
       abnormal_list: abnormal traceid list
       
'''


def trace_list_partition(operation_count, slo):
    normal_list = []  # normal traceid list
    abnormal_list = []  # abnormal traceid list
    for traceid in operation_count:
        normal = trace_anormaly_detect(
            operation_list=operation_count[traceid], slo=slo)
        if normal:
            abnormal_list.append(traceid)
        else:
            normal_list.append(traceid)

    return abnormal_list, normal_list


if __name__ == '__main__':
    def timestamp(datetime):
        timeArray = time.strptime(datetime, "%Y-%m-%d %H:%M:%S")
        ts = int(time.mktime(timeArray)) * 1000
        # print(ts)
        return ts

    date = '2020-08-23'
    start = '2020-08-23 14:56:43'
    end = '2020-08-23 14:57:44'

    slo = get_slo(date, start_time=timestamp(start), end_time=timestamp(end))
    flag = system_anormaly_detect(date, start_time=timestamp(
        start), end_time=timestamp(end), slo=slo)
    print(flag)

fig/framwork.png

0 → 100644
+56.5 KB
Loading image diff...