gpt4 book ai didi

php - 为 50,000 行表优化代码/数据库

转载 作者:行者123 更新时间:2023-11-29 00:40:28 25 4
gpt4 key购买 nike

我在数据库中存储了 300 个新闻文章的 RSS 提要列表,每隔几分钟我就会抓取每个提要的内容。每个提要包含大约 10 篇文章,我想将每篇文章存储在数据库中。

问题:我的数据库表超过 50,000 行并且还在快速增长;每次我运行我的脚本来获取新的提要时,它都会添加至少 100 行。已经到了我的数据库达到 100% CPU 使用率的地步。

问题:如何优化我的代码/数据库?

注意:我不关心我服务器的 CPU(运行时 <15%)。我非常关心我的数据库的 CPU。

我看到的可能的解决方案:

  • 目前,每次脚本运行时,它都会转到 $this->set_content_source_cache,从中返回表中所有行的数组('link'、'link'、'link' 等)。这用于以后交叉引用以确保没有重复链接。不会这样做并简单地更改数据库以便链接列是唯一的来加快速度吗?可能将此数组放入 memcached 中,这样它只需每小时/每天创建一次此数组?
  • 如果链接设置为移动到下一个源,则中断语句?
  • 只检查不到一周的链接?

这是我正在做的:

//$this->set_content_source_cache goes through all 50,000 rows and adds each link to an array so that it's array('link', 'link', 'link', etc.)
$cache_source_array = $this->set_content_source_cache();

$qry = "select source, source_id, source_name, geography_id, industry_id from content_source";
foreach($this->sql->result($qry) as $row_source) {

$feed = simplexml_load_file($row_source['source']);

if(!empty($feed)) {

for ($i=0; $i < 10 ; $i++) {
// most often there are only 10 feeds per rss. Since we check every 2 minutes, if there are
// a few more, then meh, we probably got it last time around
if(!empty($feed->channel->item[$i])) {
// make sure that the item is not blank
$title = $feed->channel->item[$i]->title;
$content = $feed->channel->item[$i]->description;
$link = $feed->channel->item[$i]->link;
$pubdate = $feed->channel->item[$i]->pubdate;
$source_id = $row_source['source_id'];
$source_name = $row_source['source_name'];
$geography_id = $row_source['geography_id'];
$industry_id = $row_source['industry_id'];

// random stuff in here to each link / article to make it data-worthy
if(!isset($cache_source_array[$link])) {

// start the transaction
$this->db->trans_start();

$qry = "insert into content (headline, content, link, article_date, status, source_id, source_name, ".
"industry_id, geography_id) VALUES ".
"(?, ?, ?, ?, 2, ?, ?, ?, ?)";
$this->db->query($qry, array($title, $content, $link, $pubdate, $source_id, $source_name, $industry_id, $geography_id));

// this is my framework's version of mysqli_insert_id()
$content_id = $this->db->insert_id();

$qry = "insert into content_ratings (content_id, comment_count, company_count, contact_count, report_count, read_count) VALUES ".
"($content_id, '0', '0', 0, '0', '0')";
$result2 = $this->db->query($qry);

$this->db->trans_complete();

if($this->db->trans_status() == TRUE) {
$cache_source_array[$link] = $content_id;
echo "Good!<br />";
} else {
echo "Bad!<br />";
}
} else {
// link alread exists
echo "link exists!";
}
}
}
} else {
// feed is empty
}
}
}

最佳答案

我想你已经回答了你自己的问题:

Currently, every time the script runs, it goes to $this->set_content_source_cache where it returns an array of array('link', 'link', 'link', etc.) from all the rows in the table. This is used to later cross-reference to make sure there are no duplicating links. Would not doing this and simply changing the DB so the link column is unique speed things up?

是的,创建主键或唯一索引并允许数据库在存在重复项时抛出错误是一种更好的做法,而且效率应该更高。

引用编辑:

mysql 5.0 indexes - Unique vs Non Unique

http://dev.mysql.com/doc/refman/5.0/en/create-index.html

关于php - 为 50,000 行表优化代码/数据库,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12488225/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com