Initial commit (f8233fd3) · 提交 · AIOps-NanKai / model / MicroRank

LICENSE

0 → 100644

+21 −0

原始行号	差异行号	差异行
		MIT License

		Copyright (c) 2021 IntelligentDDS

		Permission is hereby granted, free of charge, to any person obtaining a copy
		of this software and associated documentation files (the "Software"), to deal
		in the Software without restriction, including without limitation the rights
		to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
		copies of the Software, and to permit persons to whom the Software is
		furnished to do so, subject to the following conditions:

		The above copyright notice and this permission notice shall be included in all
		copies or substantial portions of the Software.

		THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
		IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
		FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
		AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
		LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
		OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
		SOFTWARE.

README.md

0 → 100644

+100 −0

原始行号	差异行号	差异行
		# MicroRank
		MicroRank is a novel system to locate root causes that lead to latency issues in microservice environments.

		MicroRank extracts service latency from tracing data then conducts the anomaly detection procedure.

		By combining PageRank and spectrum analysis, the service instances that lead to latency issues are ranked with high scores.

		![image](./fig/framwork.png)

		## Paper Download
		Our paper has been published at WWW'2021.

		The paper can be downloaded as below:

		[MicroRank: End-to-End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments](https://dl.acm.org/doi/10.1145/3442381.3449905)

		## Reference
		Please cite our paper if you find this work is helpful.

		```
		@inproceedings{microrank,
		title={MicroRank: End-to-End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments},
		author={Guangba Yu, Pengfei Chen, Hongyang Chen, Zijie Guan, Zicheng Huang, Linxiao Jing, TianjunWeng, Xinmeng Sun, and Xiaoyun Li},
		booktitle={Proceedings of the Web Conference 2021 (WWW’2021)},
		year={2021},
		organization={ACM},
		page = {3087-3098},
		doi={https://doi.org/10.1145/3442381.3449905}
		}
		```

		## Running MicroRank

		### Notices
		If you want to use MicroRank to production system, some notices below should be considered.
		- Our anomaly detetion module is not always suitable for each microservice system. If you have more excellent anomaly detection module for your system, we recommend that replacing the anomaly detetion module with your approach before RCA.
		- Microrank needs more iterations in PageRank if your system is a large microservice system. The accuracy of RCA may decline in a large microservice system.
		- We acknowledge that the accuracy of RCA may be degraded when intermittent failures and broken traces are encountered.

		### Replace Database

		Line 12 in the file [preprocess_data.py](preprocess_data.py)
		```
		// ES address
		es_url = 'http://11.11.11.24:9200'
		root_index = 'root'
		```

		### Replace Normal Window
		Line 32 in [online_rca.py](online_rca.py).

		We need to set a normal window to calculate the normal avarge latency and variance for each microservice.

		Longer window is prefered.

		```
		# need to replace
		normal_start = '2020-08-28 14:56:43'
		normal_end = '2020-08-28 14:57:44'

		span_list = get_span(start=timestamp(start), end=timestamp(end))
		# print(span_list)
		operation_list = get_service_operation_list(span_list)
		print(operation_list)
		operation_slo = get_operation_slo(
		service_operation_list=operation_list, span_list=span_list)
		print(operation_slo)
		```

		### Start MicroRank
		```
		python online_rca.py
		```

		## File content
		```
		- anomaly_detector
		- get_slo # get the average latency and variance for each operation
		- system_anormaly_detect # determine whether the system is abnormal
		- trace_anormaly_detect # determine whether the single trace is abnormal
		- trace_list_partition # divide traces into normal and abnormal traces
		- online_rca.py
		- calculate_spectrum_without_delay_list # calculate spectrum reuslt
		- online_anomaly_detect_RCA # running microrank
		- pagerank.py # calculate pagerank result
		- preporcess_data.py
		- get_span
		- get_service_operation_list
		- get_operation_slo
		- get_operation_duration_data
		- get_pagerank_graph
		```

WWW2021_MicroRank.pdf

0 → 100644

+1.2 MB

添加文件。

此文件类型的文件没有差异预览。

查看文件

anormaly_detector.py

0 → 100644

+148 −0

原始行号	差异行号	差异行
		from preprocess_data import get_operation_slo
		from preprocess_data import get_span
		from preprocess_data import get_operation_duration_data
		from preprocess_data import get_service_operation_list
		import time

		'''
		Input long time trace data and get the slo of operation
		:arg
		date: format 2020-08-14 or 2020-08-*
		start_time end_time expect more than one hours traces

		:return
		operation dict of the mean of and variance
		{
		# operation: {mean, variance}
		"Currencyservice_Convert": [600, 3]}
		}
		'''


		def get_slo(start_time, end_time):
		span_list = get_span(start_time, end_time)
		operation_list = get_service_operation_list(span_list)
		slo = get_operation_slo(
		service_operation_list=operation_list, span_list=span_list)
		return slo


		'''
		Input short time trace data and calculate the expect_duration.
		expect_duration = operation1 * mean_duration1 + variation_duration1 +
		operation2 * mean_duration2 + variation_duration2
		if expect_duration < real_duration error
		:arg
		date: format 2020-08-14 or 2020-08-*
		start_time end_time expect 30s or 1min traces
		:return
		if error_rate > 1%:
		return True
		'''


		def system_anormaly_detect(start_time, end_time, slo, operation_list):
		span_list = get_span(start_time, end_time)
		if len(span_list) == 0:
		print("Error: Current span list is empty ")
		return False
		#operation_list = get_service_operation_list(span_list)
		operation_count = get_operation_duration_data(operation_list, span_list)

		anormaly_trace = 0
		total_trace = 0
		for trace_id in operation_count:
		total_trace += 1
		real_duration = float(operation_count[trace_id]['duration']) / 1000.0
		expect_duration = 0.0
		for operation in operation_count[trace_id]:
		if "duration" == operation:
		continue
		expect_duration += operation_count[trace_id][operation] * (
		slo[operation][0] + 1.5 * slo[operation][1])

		if real_duration > expect_duration:
		anormaly_trace += 1

		print("anormaly_trace", anormaly_trace)
		print("total_trace", total_trace)
		print()
		if anormaly_trace > 8:
		anormaly_rate = float(anormaly_trace) / total_trace
		print("anormaly_rate", anormaly_rate)
		return True

		else:
		return False


		'''
		Determine single trace state
		:arg
		operation_list: operation_count[traceid] # list of operation of single trace
		slo: slo list

		:return
		if real_duration > expect_duration:
		return True
		else:
		return False
		'''


		def trace_anormaly_detect(operation_list, slo):
		expect_duration = 0.0
		real_duration = float(operation_list['duration']) / 1000.0
		for operation in operation_list:
		if operation == "duration":
		continue
		expect_duration += operation_list[operation] * \
		(slo[operation][0] + slo[operation][1])

		if real_duration > expect_duration + 50:
		return True
		else:
		return False


		'''
		Partition all the trace list in operation_count to normal_list and abnormal_list
		:arg
		operation_count: all the trace operation
		operation_count[traceid][operation] = 1
		:return
		normal_list: normal traceid list
		abnormal_list: abnormal traceid list

		'''


		def trace_list_partition(operation_count, slo):
		normal_list = [] # normal traceid list
		abnormal_list = [] # abnormal traceid list
		for traceid in operation_count:
		normal = trace_anormaly_detect(
		operation_list=operation_count[traceid], slo=slo)
		if normal:
		abnormal_list.append(traceid)
		else:
		normal_list.append(traceid)

		return abnormal_list, normal_list


		if __name__ == '__main__':
		def timestamp(datetime):
		timeArray = time.strptime(datetime, "%Y-%m-%d %H:%M:%S")
		ts = int(time.mktime(timeArray)) * 1000
		# print(ts)
		return ts

		date = '2020-08-23'
		start = '2020-08-23 14:56:43'
		end = '2020-08-23 14:57:44'

		slo = get_slo(date, start_time=timestamp(start), end_time=timestamp(end))
		flag = system_anormaly_detect(date, start_time=timestamp(
		start), end_time=timestamp(end), slo=slo)
		print(flag)

fig/framwork.png

0 → 100644

+56.5 KB

Loading image diff...