厦门做网站找谁,保健食品东莞网站建设,平面广告创意设计,wordpress主题 网络公司敦位导纳项目上使用SpringCloudGateway作为网关承接公网上各个业务线进来的请求流量#xff0c;在网关的前面有两台Nginx反向代理了网关#xff0c;网关做了一系列的前置处理后转发请求到后面各个业务线的服务#xff0c;简要的网络链路为#xff1a; 网关域名(wmg.test.co…敦位导纳项目上使用SpringCloudGateway作为网关承接公网上各个业务线进来的请求流量在网关的前面有两台Nginx反向代理了网关网关做了一系列的前置处理后转发请求到后面各个业务线的服务简要的网络链路为网关域名(wmg.test.com) - ... - Nginx -F5(硬负载域名fp.wmg.test) - 网关 - 业务系统某天负责运维Nginx的团队要增加两台新的Nginx机器原因说来话长按下不表使用两台新的Nginx机器替代掉原先反向代理网关的两台Nginx。SRE等级定性P1一个月黑风高的夜晚负责运维Nginx的团队进行了生产变更在两台新机器上部署了Nginx然后让网络团队将网关域名的流量切换到了两台新的Nginx机器上刚切换完立马有业务线团队的人反应过网关的接口请求都变成400了。负责运维Nginx的团队又让网络团队将网关域名流量切回到原有的两台Nginx上业务线过网关的接口请求恢复正常持续了两分多钟SRE等级定性P1。负责运维Nginx的团队说两台新的Nginx配置和原有的两台Nginx配置一样看不出什么问题找到我让我从网关排查有没有什么错误日志。不太可能吧如果新的两台Nginx配置和原有的两台Nginx配置一样的话不会出现请求都是400的问题啊我心想不过还是去看了网关上的日志在那个时间段网关没有错误日志出现。看了下新Nginx的日志Options请求正常返回204其它的GET、POST请求都是400Options是预检请求在Nginx层面就处理返回了新Nginx的日志示例如下10.x.x.x:63048 - 10.x.x.x:8099 [2025-07-17T10:36:2608:00] 10.x.x.x:8099 OPTIONS /api/xxx HTTP/1.1 204 0 https://domain/ Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 - [req_time:0.000 s] [upstream_connect_time:- s] [upstream_header_time:- s] [upstream_resp_time:- s] [-]10.x.x.x:63048 - 10.x.x.x:8099 [2025-07-17T10:36:2608:00] 10.x.x.x:8099 POST /api/xxx HTTP/1.1 400 0 https://domain/ Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 - [req_time:0.001 s] [upstream_connect_time:0.000 s] [upstream_header_time:0.001 s] [upstream_resp_time:0.001 s] [10.x.x.x:8082]去找了网络团队从流量回溯设备上看到400确实是网关返回的还没有到后面的业务系统400代表BadRequest我怀疑是不是请求体的问题想让网络将那个时间段的流量包数据取下来分析网络没给只给我了业务报文参数走网关请求的业务参数报文是加密的我本地运行程序可以正常解密报文我反馈给了负责运维Nginx的团队。负责运维Nginx的团队又花了一段时间定位问题还是没有头绪又找到我让我帮忙分析调查下。介入调查我说测试环境地址是啥我先在测试环境看下能不能复现负责运维Nginx的团队成员说没有在测试环境搭建测试这一次变更是另一个成员直接生产变更。??我要来了新的Nginx配置文件和老的Nginx配置文件比对了下发现有不一样的地方老Nginx上反向代理网关的配置如下server {listen 8080;server_name wmg.test.com;add_header X-Frame-Options SAMEORIGIN;add_header X-Content-Type-Options nosniff;add_header Content-Security-Policy frame-ancestors self;location / {proxy_hide_header host;client_max_body_size 100m;add_header Access-Control-Allow-Origin $http_origin always;add_header Access-Control-Allow-Credentials true always;add_header Access-Control-Allow-Methods GET, POST, OPTIONS, DELETE, PUT;add_header Access-Control-Allow-Headers ...;if ($request_method OPTIONS) {return 204;}proxy_pass http://fp.wmg.test:8090;}}新Nginx配置如下upstream http_gateways{server fp.wmg.test:8090;keepalive 30;}server {listen 8080 backlog512;server_name wmg.test.com;add_header X-Frame-Options SAMEORIGIN;add_header X-Content-Type-Options nosniff;add_header Content-Security-Policy frame-ancestors self;location / {proxy_hide_header host;proxy_http_version 1.1;proxy_set_header Connection ;client_max_body_size 100m;add_header Access-Control-Allow-Origin $http_origin always;add_header Access-Control-Allow-Credentials true always;add_header Access-Control-Allow-Methods GET, POST, OPTIONS, DELETE, PUT;add_header Access-Control-Allow-Headers ...;if ($request_method OPTIONS) {return 204;}proxy_pass http://http_gateways;}}新Nginx代理网关的配置与原有Nginx上的配置区别在于使用upstream配置了网关的F5负载均衡地址upstream http_gateways{server fp.wmg.test:8090;keepalive 30;}设置http协议为1.1启用长连接proxy_http_version 1.1;proxy_set_header Connection ;我让负责运维Nginx的团队在测试环境的Nginx上按照新的Nginx配置模拟了生产环境Nginx10.100.8.11 监听9104端口网关10.100.22.48 监听8081端口Nginx的9104端口转发到网关的8081端口配置如下upstream http_gateways{server 10.100.22.48:8081;keepalive 30;}server {listen 9104 backlog512;server_name localhost;add_header X-Frame-Options SAMEORIGIN;add_header X-Content-Type-Options nosniff;add_header Content-Security-Policy frame-ancestors self;location / {proxy_hide_header host;proxy_http_version 1.1;proxy_set_header Connection ;client_max_body_size 100m;add_header Access-Control-Allow-Origin $http_origin always;add_header Access-Control-Allow-Credentials true always;add_header Access-Control-Allow-Methods GET, POST, OPTIONS, DELETE, PUT;add_header Access-Control-Allow-Headers ...;if ($request_method OPTIONS) {return 204;}proxy_pass http://http_gateways;}}问题复现通过Nginx请求网关到后端服务接口问题复现请求响应400curl -v -X GET http://10.100.8.11:9104/wechat-web/actuator/info去掉下面的两个配置请求正常响应200proxy_http_version 1.1;proxy_set_header Connection ;天外来锅将这个现象反馈给了负责运维Nginx的团队结果负责运维Nginx的团队查了半天说网关不支持长连接要让网关改造。??不应该啊以往网关发版的时候是滚动发版的F5上先下掉一个机器的流量停启这个机器上的网关服务然后F5上流量F5下流量的时候是有长连接存在的每次都会等个5分钟左右才能下掉一路的流量。得先放下手头的工作花点时间来证明网关是支持长连接的。在Nginx机器上通过命令行指定长连接方式访问网关请求后端服务接口wget -d --headerConnection: keepalive http://10.100.22.48:8081/wechat-web/actuator/info http://10.100.22.48:8081/wechat-web/actuator/info http://10.100.22.48:8081/wechat-web/actuator/info回车出现如下日志Setting --header (header) to Connection: keepaliveDEBUG output created by Wget 1.14 on linux-gnu.URI encoding ‘UTF-8’Converted file name info (UTF-8) - info (UTF-8)Converted file name info (UTF-8) - info (UTF-8)--2025-07-17 13:45:08-- http://10.100.22.48:8081/wechat-web/actuator/infoConnecting to 10.100.22.48:8081... connected.Created socket 3.Releasing 0x0000000000c95a90 (new refcount 0).Deleting unused 0x0000000000c95a90.---request begin---GET /wechat-web/actuator/info HTTP/1.1User-Agent: Wget/1.14 (linux-gnu)Accept: */*Host: 10.100.22.48:8081Connection: keepalive---request end---HTTP request sent, awaiting response...---response begin---HTTP/1.1 200 OKtransfer-encoding: chunkedContent-Type: application/vnd.spring-boot.actuator.v3jsonDate: Thu, 17 Jul 2025 05:25:34 GMT---response end---200 OKRegistered socket 3 for persistent reuse.Length: unspecified [application/vnd.spring-boot.actuator.v3json]Saving to: ‘info’[ ] 83 --.-K/s in 0s2025-07-17 13:45:08 (7.75 MB/s) - ‘info’ saved [83]URI encoding ‘UTF-8’Converted file name info (UTF-8) - info (UTF-8)Converted file name info (UTF-8) - info (UTF-8)--2025-07-17 13:45:08-- http://10.100.22.48:8081/wechat-web/actuator/infoReusing existing connection to 10.100.22.48:8081.Reusing fd 3.---request begin---GET /wechat-web/actuator/info HTTP/1.1User-Agent: Wget/1.14 (linux-gnu)Accept: */*Host: 10.100.22.48:8081Connection: keepalive---request end---HTTP request sent, awaiting response...---response begin---HTTP/1.1 200 OKtransfer-encoding: chunkedContent-Type: application/vnd.spring-boot.actuator.v3jsonDate: Thu, 17 Jul 2025 05:25:34 GMT---response end---200 OKLength: unspecified [application/vnd.spring-boot.actuator.v3json]Saving to: ‘info.1’[ ] 83 --.-K/s in 0s2025-07-17 13:45:08 (9.47 MB/s) - ‘info.1’ saved [83]URI encoding ‘UTF-8’Converted file name info (UTF-8) - info (UTF-8)Converted file name info (UTF-8) - info (UTF-8)--2025-07-17 13:45:08-- http://10.100.22.48:8081/wechat-web/actuator/infoReusing existing connection to 10.100.22.48:8081.Reusing fd 3.---request begin---GET /wechat-web/actuator/info HTTP/1.1User-Agent: Wget/1.14 (linux-gnu)Accept: */*Host: 10.100.22.48:8081Connection: keepalive---request end---HTTP request sent, awaiting response...---response begin---HTTP/1.1 200 OKtransfer-encoding: chunkedContent-Type: application/vnd.spring-boot.actuator.v3jsonDate: Thu, 17 Jul 2025 05:25:34 GMT---response end---200 OKLength: unspecified [application/vnd.spring-boot.actuator.v3json]Saving to: ‘info.2’[ ] 83 --.-K/s in 0s2025-07-17 13:45:08 (11.1 MB/s) - ‘info.2’ saved [83]FINISHED --2025-07-17 13:45:08--Total wall clock time: 0.1sDownloaded: 3 files, 249 in 0s (9.25 MB/s)可以看到第一个请求建立了socket 3Connection: keepalive请求成功http响应状态码为200image第二个请求重用了第一个连接socket 3Connection: keepalive请求成功http响应状态码为200image第三个请求依然重用了第一个连接socket 3Connection: keepalive请求成功http响应状态码为200image网关是支持长连接的反馈给负责运维Nginx的团队负责运维Nginx的团队又查了半天又找到我说还是得拜托我来调查解决掉这个问题。深度调查在测试环境Nginx机器10.100.8.11上使用tcpdump命令抓取与网关相关的流量包tcpdump -vv -i ens192 host 10.100.22.48 and tcp port 8081 -w /tmp/ng400.cap找到出现http响应码为400的请求可以看到流量包中的wechat-web/actuator/info请求响应为HTTP/1.1 400 Bad Request观察请求体其中一个请求头Host的值为http_gateways这引起了我的注意image查阅资料得到HTTP/1.1协议规范定义HTTP/1.1版本必须传递Host请求头- Both clients and servers MUST support the Host request-header.- A client that sends an HTTP/1.1 request MUST send a Host header.- Servers MUST report a 400 (Bad Request) error if an HTTP/1.1request does not include a Host request-header.- Servers MUST accept absolute URIs.https://www.w3.org/Protocols/rfc2616/rfc2616-sec5.html#sec5.2https://www.w3.org/Protocols/rfc2616/rfc2616-sec19.html#sec19.6.1.1Host的格式可以包含. 和 - 特殊符号_ 不被支持查阅Nginx的官方文档得知proxy_set_header 有两个默认配置proxy_set_header Host $proxy_host;proxy_set_header Connection close;https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_set_header可以看出Nginx启用了HTTP/1.1协议Host如果没有指定会取$proxy_host那么使用upstream的情况下$proxy_host就是upstream的名称而此处的upstream中包含_不是合法的Host格式。HTTP/1.1规定必须传递Host的一方面原因就是为了支持单IP地址托管多域名的虚拟主机功能方便后端服务根据不同来源Host做不同的处理。Older HTTP/1.0 clients assumed a one-to-one relationship of IP addresses and servers; there was no other established mechanism for distinguishing the intended server of a request than the IP address to which that request was directed. The changes outlined above will allow the Internet, once older HTTP clients are no longer common, to support multiple Web sites from a single IP address, greatly simplifying large operational Web servers, where allocation of many IP addresses to a single host has created serious problems.那么只要遵循了HTTP/1.1协议规范的框架Tomcat、SpringCloudGateway、...在解析Host时发现Host不是合法的格式时就响应了400。本地搭建了一个测试环境debug了下网关的代码在SpringCloudGateway解析http请求类ReactorHttpHandlerAdapter中的apply方法里面可以看到解析Host失败会响应400image下面是SpringCloudGateway解析http请求类ReactorHttpHandlerAdapter中的apply方法逻辑public Mono apply(HttpServerRequest reactorRequest, HttpServerResponse reactorResponse) {NettyDataBufferFactory bufferFactory new NettyDataBufferFactory(reactorResponse.alloc());try {ReactorServerHttpRequest request new ReactorServerHttpRequest(reactorRequest, bufferFactory);ServerHttpResponse response new ReactorServerHttpResponse(reactorResponse, bufferFactory);if (request.getMethod() HttpMethod.HEAD) {response new HttpHeadResponseDecorator(response);}return this.httpHandler.handle(request, response).doOnError(ex - logger.trace(request.getLogPrefix() Failed to complete: ex.getMessage())).doOnSuccess(aVoid - logger.trace(request.getLogPrefix() Handling completed));}catch (URISyntaxException ex) {if (logger.isDebugEnabled()) {logger.debug(Failed to get request URI: ex.getMessage());}reactorResponse.status(HttpResponseStatus.BAD_REQUEST);return Mono.empty();}}SpringCloudGateway通过debug级别日志输出这类不符合协议规范的日志生产日志级别为info因此不会打印这样异常的日志。解决方案既然HTTP/1.1协议规定必须传递Host且没有通过配置显式指定Nginx传递的Host时Nginx会有默认值那么在Nginx的配置中增加传递Host的配置覆盖默认值的逻辑查阅Nginx的文档可以通过增加下面的配置解决proxy_set_header Host $host;在测试环境Nginx9104端口代理配置中增加上面的配置再次执行请求正常响应200。image完整配置如下upstream http_gateways{server 10.100.22.48:8081;keepalive 30;}server {listen 9104 backlog512;server_name wmg.test.com;add_header X-Frame-Options SAMEORIGIN;add_header X-Content-Type-Options nosniff;add_header Content-Security-Policy frame-ancestors self;location / {proxy_set_header Host $host;proxy_hide_header host;proxy_http_version 1.1;proxy_set_header Connection ;client_max_body_size 100m;add_header Access-Control-Allow-Origin $http_origin always;add_header Access-Control-Allow-Credentials true always;add_header Access-Control-Allow-Methods GET, POST, OPTIONS, DELETE, PUT;add_header Access-Control-Allow-Headers ...;if ($request_method OPTIONS) {return 204;}proxy_pass http://http_gateways;}}解决方案不止一个:可以修改upstream的名称去掉不支持的_比如更换为http-gateways、httpgateways还可以直接指定Host的值为域名domainproxy_set_header Host doamin;