gpt4 book ai didi

python - 我如何在Python中手动管理内存?

转载 作者:行者123 更新时间:2023-11-30 22:23:45 25 4
gpt4 key购买 nike

当我并行运行我的程序(网络爬虫)时,它需要通过我的系统占用异常数量的内存或内存,我还使用其他网络爬虫进行了测试,我的网络爬虫占用的内存是它们的两倍,所以我的问题是 如何在 python 中手动管理内存或 RAM(如果可能)?

这是我的代码:-

from bs4 import BeautifulSoup
import requests
import MySQLdb as sql
import time
import warnings

print("starting")

warnings.filterwarnings('ignore')

db = sql.connect("localhost", "root", "arpit", "website")
cursor = db.cursor()
db.autocommit(True)

print("connected to database")

url = "http://www.example.com"
extension = ".com"
print("scrapping url -",url)

r = requests.head(url)
cursor.execute("insert ignore into urls(urls,status,status_code)
values(%s,'pending',%s)", [url, r.status_code])

cursor.execute("select status from urls where status ='pending' limit 1")
result = str(cursor.fetchone())

while (result != "None"):

cursor.execute("select urls from urls where status ='pending' limit 1")
result = str(cursor.fetchone())

s_url = result[2:-3]

cursor.execute("update urls set status = 'done' where urls= %s ", [s_url])

if "https" in url:
url1 = url[12:]
else:
url1 = url[11:]
zone = 0
while True:

try:
r = requests.get(s_url,timeout=60)
break

except:
if s_url == "":

print("done")
break
elif zone >= 4:
print("this url is not valid -",s_url)
break
else:
print("Oops! may be connection was refused. Try again...",s_url)
time.sleep(0.2)
zone = zone + 1

soup = BeautifulSoup(r.content.lower(), 'lxml')

links = soup.find_all("a")

for x in links:
a = x.get('href')
if a is not None and a != "":

if a != "" and a.find("\n") != -1:
a = a[0:a.find("\n")]

if a != "" and a[-1] == "/":
a = a[0:-1]

if a != "":
common_extension = [',',' ',"#",'"','.mp3',"jpg",'.wav','.wma','.7z','.deb','.pkg','.rar','.rpm','.tar','.zip','.bin','.dmg','.iso','.toast','.vcd','.csv','.dat','.log','.mdb','.sav','.sql','.apk','.bat','.exe','.jar','.py','.wsf','.fon','.ttf','.bmp','.gif','.ico','.jpeg','.png','.part','.ppt','.pptx','.class','.cpp','.java','.swift','.ods','.xlr','.xls','.xlsx','.bak','.cab','.cfg','.cpl','.dll','.dmp','.icns','.ini','.lnk','.msi','.sys','.tmp','.3g2','.3gp','.avi','.flv','.h264','.m4v','.mkv','.mov','.mp4','.mpg','.vob','.wmv','.doc','.pdf','.txt']
for ext in common_extension:
if ext in a:
a = ""
break

if a != "":
if a[0:5] == '/http':
a = a[1:]
if a[0:6] == '//http':
a = a[2:]

if a[0:len(url1) + 12] == "https://www." + url1:
cursor.execute("insert ignore into urls(urls,status,status_code) values(%s,'pending',%s)",
[a, r.status_code])
elif a[0:len(url1) + 11] == "http://www." + url1:
cursor.execute("insert ignore into urls(urls,status,status_code) values(%s,'pending',%s)",
[a, r.status_code])
elif a[0:len(url1) + 8] == "https://" + url1:
cursor.execute("insert ignore into urls(urls,status,status_code) values(%s,'pending',%s)",
[url + (a[(a.find(extension + "/")) + 4:]), r.status_code])
elif a[0:len(url1) + 7] == "http://" + url1:
cursor.execute("insert ignore into urls(urls,status,status_code) values(%s,'pending',%s)",
[url + (a[(a.find(extension + "/")) + 4:]), r.status_code])
elif a[0:2] == "//" and a[0:3] != "///" and "." not in a and "http" not in a and "www." not in a:
cursor.execute("insert ignore into urls(urls,status,status_code) values(%s,'pending',%s)",
[url + a[1:], r.status_code])
elif a[0:1] == "/" and a[0:2] != "//" and "." not in a and "http" not in a and "www." not in a:
cursor.execute("insert ignore into urls(urls,status,status_code) values(%s,'pending',%s)",
[url + a[0:], r.status_code])
elif 'http' not in a and 'www.' not in a and "." not in a and a[0] != "/":
cursor.execute("insert ignore into urls(urls,status,status_code) values(%s,'pending',%s)",
[url + '/' + a, r.status_code])

cursor.execute("alter table urls drop id")
cursor.execute("alter table urls add id int primary key not null
auto_increment first")
print("new id is created")

最佳答案

您的代码内存效率非常低,因为您进行了大量切片 - 并且由于字符串是不可变的,因此每个切片都会分配一个新对象。

例如:

if a[0:5] == '/http'
a = a[1:]

分配一个新字符串,复制 a来自05对其进行比较,将其与 '/http' 进行比较,然后把它扔掉;此外,如果测试相等,它会分配一个新字符串,复制 a来自1并抛出 a离开。如果 a很长,或者如果这种情况发生很多,这可能会成为一个很大的问题。

查看 memoryview s - 这是一种无需复制字符串即可对字符串进行切片的方法(在 Python 3 中为 bytes)。

还有很多其他方法可以优化代码:

  1. 而不是重新定义common_extension对于每个链接,在循环之前定义一次。

  2. 而不是 a[0:5] == '/http' ,使用a.startswith('/http') .

  3. 而不是前 4 url1比较,使用正则表达式,如 re.match('https?://(www\.)?' + re.escape(url1), a) .

    如果你这样做,而不是连接 'https?://(www\.)?'re.escape(url1)对于每个链接,在循环之前执行一次,甚至 re.compile那里的正则表达式。

关于python - 我如何在Python中手动管理内存?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47962083/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com