前言
一些概念
慢查询: 最常见在 mysql 数据库查询预警中,通过配置可以记录耗时超过一定阈值的 查询 SQL;
慢服务:分布式服务中,网关层耗时超过一定阈值的,分布式服务拓扑链路;
特点:
- 慢服务可能不会影响平均 RT,甚至 95-p/99-p 都不一定能发现;
- 慢服务一定会影响部分用户的体验;
- 慢服务会导致端到端可用性降低,可能会触发服务器报警(大部分报警阈值是 99.95);
慢服务定位的困难点
- 分布式服务往往上下游较多,如果是偶发的,很难确定是哪个服务的问题;
- 存在所有服务都未超时,但是网关总体耗时超过客户端阈值,客户端主动断开的情况;(如:Nginx499 错误)
- 即使抓取到原始请求,事后往往也难以复现,与时间点有关联性;
定位慢服务的一些解决方案
全链路埋点
实现: 针对 RPC 框架、ORM 框架、HttpClient、Redis 客户端等各种组件做统一封装,内置埋点逻辑,并提供统一的存储、查询服务,
优点:
- 对业务方接入透明,可复用性高,业务方可以做一定程度的自定义埋点;
- 接入后,业务方可以针对各种查询条件(如:特定的用户、特定的接口、特定的耗时),读取调用拓扑图和关键路径分析;
- 方便迭代和优化;
缺点:
- 有一定的技术复杂度,需要推动公司层面统一技术框架;
- 存储成本非常大,全量存储会产生 T 级别的存储数据;
- 由于存储量的问题,查询速度往往比较慢,甚至由于采样率问题导致无法抓取关键调用;
基于规则的实时采样
实现: 基于一定的规则(按业务定义的慢服务阈值)和采样率,将实时的慢服务上下游链路打印出来;
优点:
- 实现比较轻量,容易维护,自定义化程度高;
- 资源点扩展性强,不必局限于 RPC 调用、DB 查询等通用服务;
缺点:
- 有代码侵入性,需要硬编码;
- 通用性差,较难推广;
国内商店的实现
采用的方案为 "基于规则的实时采样" , 针对商店的资源调用进行切面拦截,采样后打印日志到磁盘文件.
整体设计
拦截器的实现
public class TraceInfoInterceptor extends ControllerInterceptorAdapter {
private static Logger LOGGER = LoggerFactory.getLogger(TraceInfoInterceptor.class);
public TraceInfoInterceptor() {
setPriority(700);
}
/**
* 覆盖这个方法,表示只有标注@TraceInfo 的控制器或方法才会被此拦截器拦截
*
* @return
*/
@Override
public Class<? extends Annotation> getRequiredAnnotationClass() {
return TraceInfo.class;
}
@Override
protected Object round(Invocation inv, InvocationChain chain) throws Exception {
long startTime = System.currentTimeMillis();
Object result = null;
String rosePath = RoseUtils.getRosePath(inv);
SlowLogConfig slowLogConfig = getSlowLogConfig(rosePath);
try {
SlowLogUtils.enableSlowLog(slowLogConfig);
result = super.round(inv, chain);
SlowLogUtils.log(rosePath, startTime, (int) (System.currentTimeMillis() - startTime), true);
} catch (Exception e) {
SlowLogUtils.log(rosePath, startTime, (int) (System.currentTimeMillis() - startTime), false);
throw e;
} finally {
SlowLogUtils.releaseSlowLogContext();
}
return result;
}
private SlowLogConfig getSlowLogConfig(String token) {
Map<String, SlowLogConfig> slowLogConfigFromCache = ConfigManager.getSlowLogConfigFromCache();
if (slowLogConfigFromCache.containsKey(token)) {
return slowLogConfigFromCache.get(token);
}
return slowLogConfigFromCache.get("default");
}
}
线程池的包装
/**
* servlet 容器在应用启动时初始化配置
*/
public class GrantServiceContextListener implements ServletContextListener {
private static final Logger LOGGER = LoggerFactory.getLogger(GrantServiceContextListener.class);
@Override
public void contextInitialized(ServletContextEvent sce) {
//注入各种线程上下文处理器
Task.registerThreadContextHandler(() -> IdentifierInfoHolder.getIndentifierValue(), x -> IdentifierInfoHolder.setIndentifierValue(x), () -> IdentifierInfoHolder.release());
// slowLog上下文透传
Task.registerThreadContextHandler(() -> SlowLogUtils.getSlowLogContext(), x -> SlowLogUtils.setSlowLogContext(x), () -> SlowLogUtils.releaseSlowLogContext());
Task.registerThreadContextHandler(() -> InvocationHolder.getInvocation(), x -> InvocationHolder.setInvocation(x), () -> InvocationHolder.release());
}
@Override
public void contextDestroyed(ServletContextEvent sce) {
}
}
//提交线程池任务方式一
tasks.forEach(task -> threadPoolExecutor.submit(Task.wrap(task)));
//提交线程池任务方式二
CompletableFuture.runAsync(Task.wrap(() -> {
AdResultInfo adResultInfo = handler.requestAds(adsProcessContext);
adsProcessContext.setAdResultInfo(adResultInfo);
handler.handleAdsInfo(adsProcessContext);
}), EXECUTOR_SERVICE);
资源调用切面拦截
public static <T> T getProxyThriftClient(Class iface, Object target1, Object target2) {
Object newProxyInstance = Proxy.newProxyInstance(getAppstoreClassLoader(), new Class[]{iface}, new AppStoreRpcInvocationHandler(target1, target2));
return (T) newProxyInstance;
}
private static ClassLoader getAppstoreClassLoader() {
ClassLoader contextClassLoader = Thread.currentThread().getContextClassLoader();
if (contextClassLoader != null) {
return contextClassLoader;
}
return ThriftClientManager.class.getClassLoader();
}
private static class AppStoreRpcInvocationHandler implements InvocationHandler {
// 旧服务
private Object target1;
// 新服务
private Object target2;
public AppStoreRpcInvocationHandler(Object target1, Object target2) {
this.target1 = target1;
this.target2 = target2;
}
public AppStoreRpcInvocationHandler(Object target1) {
this.target1 = target1;
this.target2 = null;
}
@Override
public Object invoke(Object proxy, Method method, Object[] args) throws Throwable {
long startTime = System.currentTimeMillis();
Throwable throwable = null;
Object result = null;
//最终需要invoke的目标方法和类
Object invokeTarget = target1;
Method invokeMethod = method;
try {
//尝试替换 invokeTarget
if (switchService() && target2 != null) {
invokeMethod = target2.getClass().getMethod(method.getName(), method.getParameterTypes());
invokeTarget = target2;
}
} catch (Exception e) {
//替换失败会再次尝试使用target1
LOGGER.error("", e);
}
try {
result = invokeMethod.invoke(invokeTarget, args);
} catch (Exception e) {
throwable = e;
}
long endTime = System.currentTimeMillis();
SlowLogUtils.RpcInvokeInfo rpcInvokeInfo = new SlowLogUtils.RpcInvokeInfo();
String rpcToken = generateRpcToken(invokeMethod);
rpcInvokeInfo.setMethod(rpcToken);
rpcInvokeInfo.setStartTime(startTime);
rpcInvokeInfo.setEndTime(endTime);
rpcInvokeInfo.setCostTime(endTime - startTime);
rpcInvokeInfo.setSuccess(throwable == null);
//统计slowLog 参见wiki:https://wiki.n.miui.com/pages/viewpage.action?pageId=403240360
SlowLogUtils.addRpcInvokeInfo(rpcInvokeInfo);
if (throwable != null) {
throw throwable;
}
return result;
}
private String generateRpcToken(Method invokedMethod) {
String className = invokedMethod.getDeclaringClass().getName();
String methodName = invokedMethod.getName();
if (className.startsWith("com.sun.proxy")) {
try {
Class<?> anInterface = invokedMethod.getDeclaringClass().getInterfaces()[0];
Method realMethod = anInterface.getMethod(invokedMethod.getName(), invokedMethod.getParameterTypes());
if (realMethod != null) {
className = realMethod.getDeclaringClass().getName();
}
} catch (Exception e) {
LOGGER.error("", e);
}
}
return className + "." + methodName;
}
/***
* 服务切换控制
* @return
*/
private boolean switchService() {
return true;
}
}
线上效果
api 慢服务配置:
[
{
"costTimeThreshold": 300,
"rpcCostTimeThreshold": 10,
"logRate": 1,
"token": "/apm/featuredV2"
}
]
2020-11-30 19:42:43 slow method path: /apm/featuredV2 , costTime: 569ms , success: true slowLogContext:
{
"rpcInvokeInfos":[
{
"costTime":32,
"endTime":39,
"method":"com.xiaomi.misearch.appstore.recommend.thrift.RecommendService$Iface.categoryRec",
"startTime":7,
"success":true
},
{
"costTime":207,
"endTime":276,
"method":"com.xiaomi.misearch.appstore.recommend.thrift.RecommendService$Iface.categoryRec",
"startTime":69,
"success":true
},
{
"costTime":80,
"endTime":365,
"method":"com.xiaomi.appstore.thrift.service.AppStoreContentService$Iface.getAppsetListByAppsetIds",
"startTime":285,
"success":true
},
{
"costTime":141,
"endTime":507,
"method":"com.xiaomi.miui.ad.thrift.service.MiuiAdQueryService$Iface.getAdsV3",
"startTime":366,
"success":true
}
],
"slowLogConfig":{
"costTimeThreshold":300,
"logRate":1,
"rpcCostTimeThreshold":10,
"token":"/apm/featuredV2"
}
# 统计慢方法排行
cat slow-8085.log | grep "method\"" | sort -b | uniq -c | sort -nk1 -r
14646 "method":"com.xiaomi.misearch.appstore.recommend.thrift.RecommendService$Iface.categoryRec",
12108 "method":"com.xiaomi.miui.ad.thrift.service.MiuiAdQueryService$Iface.getAdsV3",
7328 "method":"com.xiaomi.appstore.thrift.service.AppStoreContentService$Iface.getAppsetListByAppsetIds",
2714 "method":"com.xiaomi.appstore.thrift.service.AppStoreContentService$Iface.listAppByIds",
2401 "method":"com.xiaomi.appstore.thrift.service.AppStoreContentService$Iface.getFeedPageResult",
1627 "method":"game-personal",
159 "method":"com.xiaomi.appstore.thrift.service.AppStoreContentService$Iface.getCategoryById",
35 "method":"com.xiaomi.market.usercomment.thrift.service.UserCommentService$Iface.getCommentScore",
12 "method":"com.xiaomi.appstore.thrift.service.AppStoreContentService$Iface.getCmsListContent",
总结
发现问题:排查线上问题时候遇到的各种偶发问题、RT 毛刺问题,抽象总结问题;
分析问题:调研分析业界的解决方案,因地适宜,最终采用了"基于规则的实时采样"轻量级方案;
解决问题:"大胆开发,小心上线",验证效果,实践与理论结合;