gpt4 book ai didi

python - 解析原始 HTTP header

转载 作者:IT老高 更新时间:2023-10-28 21:50:20 27 4
gpt4 key购买 nike

我有一个原始 HTTP 字符串,我想表示对象中的字段。有什么方法可以解析 HTTP 字符串中的各个 header ?

'GET /search?sourceid=chrome&ie=UTF-8&q=ergterst HTTP/1.1\r\nHost: www.google.com\r\nConnection: keep-alive\r\nAccept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\nUser-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13\r\nAccept-Encoding: gzip,deflate,sdch\r\nAvail-Dictionary: GeNLY2f-\r\nAccept-Language: en-US,en;q=0.8\r\n
[...]'

最佳答案

Update: It’s 2019, so I have rewritten this answer for Python 3, following a confused comment from a programmer trying to use the code. The original Python 2 code is now down at the bottom of the answer.

标准库中提供了出色的工具,既可用于解析 RFC 821 header ,也可用于解析整个 HTTP 请求。这是一个示例请求字符串(请注意,Python 将其视为一个大字符串,即使我们为了可读性将其分成几行),我们可以将其提供给我的示例:

request_text = (
b'GET /who/ken/trust.html HTTP/1.1\r\n'
b'Host: cm.bell-labs.com\r\n'
b'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3\r\n'
b'Accept: text/html;q=0.9,text/plain\r\n'
b'\r\n'
)

正如@TryPyPy 指出的那样,您可以使用 Python 的电子邮件消息库来解析 header - 尽管我们应该添加生成的 Message 对象在您完成创建后就像一个 header 字典:

from email.parser import BytesParser
request_line, headers_alone = request_text.split(b'\r\n', 1)
headers = BytesParser().parsebytes(headers_alone)

print(len(headers)) # -> "3"
print(headers.keys()) # -> ['Host', 'Accept-Charset', 'Accept']
print(headers['Host']) # -> "cm.bell-labs.com"

但这当然会忽略请求行,或者让您自己解析它。事实证明,有一个更好的解决方案。

如果您使用它的 BaseHTTPRequestHandler,标准库将为您解析 HTTP。尽管它的文档有点晦涩——标准库中的整套 HTTP 和 URL 工具存在问题——但要让它解析字符串,你所要做的就是 (a) 将字符串包装在 BytesIO() ,(b) 读取 raw_requeSTLine 以便它准备好被解析,并且 (c) 捕获在解析期间发生的任何错误代码,而不是让它尝试将它们写回客户(因为我们没有客户!)。

这是我们对标准库类的特化:

from http.server import BaseHTTPRequestHandler
from io import BytesIO

class HTTPRequest(BaseHTTPRequestHandler):
def __init__(self, request_text):
self.rfile = BytesIO(request_text)
self.raw_requestline = self.rfile.readline()
self.error_code = self.error_message = None
self.parse_request()

def send_error(self, code, message):
self.error_code = code
self.error_message = message

再次,我希望标准库的人们已经意识到 HTTP 解析应该以一种不需要我们编写 9 行代码来正确调用它的方式进行分解,但是你能做什么呢?下面是如何使用这个简单的类:

# Using this new class is really easy!

request = HTTPRequest(request_text)

print(request.error_code) # None (check this first)
print(request.command) # "GET"
print(request.path) # "/who/ken/trust.html"
print(request.request_version) # "HTTP/1.1"
print(len(request.headers)) # 3
print(request.headers.keys()) # ['Host', 'Accept-Charset', 'Accept']
print(request.headers['host']) # "cm.bell-labs.com"

如果解析时出错,error_code不会是None:

# Parsing can result in an error code and message

request = HTTPRequest(b'GET\r\nHeader: Value\r\n\r\n')

print(request.error_code) # 400
print(request.error_message) # "Bad request syntax ('GET')"

我更喜欢像这样使用标准库,因为我怀疑他们已经遇到并解决了任何边缘情况,如果我尝试自己使用正则表达式重新实现 Internet 规范,可能会对我造成困扰。

旧的 Python 2 代码

这是我第一次写这个答案的原始代码:

request_text = (
'GET /who/ken/trust.html HTTP/1.1\r\n'
'Host: cm.bell-labs.com\r\n'
'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3\r\n'
'Accept: text/html;q=0.9,text/plain\r\n'
'\r\n'
)

还有:

# Ignore the request line and parse only the headers

from mimetools import Message
from StringIO import StringIO
request_line, headers_alone = request_text.split('\r\n', 1)
headers = Message(StringIO(headers_alone))

print len(headers) # -> "3"
print headers.keys() # -> ['accept-charset', 'host', 'accept']
print headers['Host'] # -> "cm.bell-labs.com"

还有:

from BaseHTTPServer import BaseHTTPRequestHandler
from StringIO import StringIO

class HTTPRequest(BaseHTTPRequestHandler):
def __init__(self, request_text):
self.rfile = StringIO(request_text)
self.raw_requestline = self.rfile.readline()
self.error_code = self.error_message = None
self.parse_request()

def send_error(self, code, message):
self.error_code = code
self.error_message = message

还有:

# Using this new class is really easy!

request = HTTPRequest(request_text)

print request.error_code # None (check this first)
print request.command # "GET"
print request.path # "/who/ken/trust.html"
print request.request_version # "HTTP/1.1"
print len(request.headers) # 3
print request.headers.keys() # ['accept-charset', 'host', 'accept']
print request.headers['host'] # "cm.bell-labs.com"

还有:

# Parsing can result in an error code and message

request = HTTPRequest('GET\r\nHeader: Value\r\n\r\n')

print request.error_code # 400
print request.error_message # "Bad request syntax ('GET')"

关于python - 解析原始 HTTP header ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/4685217/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com