- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我正在运行 etree.HTML( data )
,如下所示,以获取许多不同的 data
内容。然而,对于特定的 data
内容,lxml.etree.HTML
不会解析它,而是进入无限循环并消耗 100% CPU。
有谁知道下面的数据
中到底是什么原因导致的?更重要的是,如何防止这种情况在无限数量的随机、损坏的数据
上发生?
Edit: Turns out this is a bug with lxml version 2.7.8 and below (at least). Updated to lxml 2.9.0, and bug is gone.
编辑:我知道这构成了无限循环,但这并不是我遇到的不良行为。它运行良好(作为无限循环)并具有健康的数据内容。对于不健康的 data
内容,如下所示,循环将停止,RAM 将开始填满,当填满时,所有 CPU 都会进入 WAIT 状态。请参阅this question用于原始调试。
#!/usr/bin/python
# -*- coding: utf-8 -*-
#
import sys
from lxml import etree
data = '''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml">
<head>
<meta charset="UTF-8">
<title>The 20 Most Despicable Things Gordon Ramsay Has Said and Done, Ranked -- Grub Street New York</title>
<link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="http://feedproxy.google.com/nymag/grubstreet" />
<meta name="Headline" content="The 20 Most Despicable Things Gordon Ramsay Has Said and Done, Ranked" />
<meta name="keywords" content="april bloomfield, el gordo, frank bruni, gordon ramsay, lawsuits, lists, marcus samuelsson, mario batali, shitlist, spotted pig, sued" />
<meta name="description" content="Racism, fat-shaming, and vegetarian trickery." />
<meta name="Byline" content="Sierra Tishgart" />
<meta name="Type_of_Feature" content="" />
<meta name="Issue_Date" content="March 8, 2013 12:50 PM" />
<meta name="related_stories" content="The 20 Most Despicable Things Gordon Ramsay Has Said and Done, Ranked" />
<meta name="document_type" content="Blog" />
<meta name="category" content="Lists" />
<link rel="image_src" href="http://pixel.nymag.com/imgs/daily/grub/2013/03/08/08-gorgon-ramsay.o.jpg/a_146x97.jpg" />
<link rel="canonical" href="http://newyork.grubstreet.com/2013/03/20-despicable-things-gordon-ramsay.html" id="canonical" />
<script>
var canonicalUrl = "http://newyork.grubstreet.com/2013/03/20-despicable-things-gordon-ramsay.html";
</script>
<meta name="content.tags.primary" content=";network - Grub Street,;city - New York City,;tag - lists" />
<meta name="content.tags" content=";tag - april bloomfield,;tag - el gordo,;tag - frank bruni,;tag - gordon ramsay,;tag - lawsuits,;tag - marcus samuelsson,;tag - mario batali,;tag - shitlist,;tag - spotted pig,;tag - sued" />
<meta name="content.hierarchy" content="New York City:Grub Street" />
<meta name="content.type" content="Blog" />
<meta name="content.subtype" content="Blog Entry" />
<meta property="fb:app_id" content="206283005644" />
<meta property="og:title" content="The 20 Most Despicable Things Gordon Ramsay Has Said and Done, Ranked" />
<meta property="og:description" content="Racism, fat-shaming, and vegetarian trickery." />
<meta property="og:image" content="http://pixel.nymag.com/imgs/daily/grub/2013/03/08/08-gorgon-ramsay.o.jpg/a_146x97.jpg"/>
<meta property="og:url" content="http://newyork.grubstreet.com/2013/03/20-despicable-things-gordon-ramsay.html" />
<meta property="og:type" content="article" />
<meta property="og:site_name" content="Grub Street New York" />
<meta name="viewport" content="width=1020">
<link type="text/css" rel="stylesheet" href="http://cache.nymag.com/css/screen/grubstreet/grubstreet-core.css" media="all" />
<link type="text/css" rel="stylesheet" href="http://cache.nymag.com/css/screen/section/daily/slideshow.css" media="all" />
<link type="text/css" rel="stylesheet" href="http://cache.nymag.com/css/screen/echo.css" media="all" />
<link type="text/css" rel="stylesheet" href="http://cache.nymag.com/css/screen/loginRegister.css" media="all" />
<link rel="stylesheet" href="http://cache.nymag.com/css/screen/advertising.css" media="all" />
<link rel="shortcut icon" href="http://images.nymag.com/gfx/grubst/favicon.ico" />
<style type="text/css">
#adsplashtop,#pushdown {padding:5px 5px;}
#pushdown {border-top:1px solid #737373}
</style>
<!--[if IE 6]>
<link rel="stylesheet" href="http://cache.nymag.com/css/screen/grubstreet/win-ie6.css" type="text/css" media="screen, projection" />
<![endif]-->
<!--[if IE 7]>
<link rel="stylesheet" href="http://cache.nymag.com/css/screen/grubstreet/win-ie7.css" type="text/css" media="screen, projection" />
<![endif]-->
<script type="text/javascript">
var NYM = {};
NYM.config = {};
NYM.config.membership = {
"service":"nym"
};
NYM.config.advertising = {
"sitename":"nym.grubstreet"
};
</script>
<script type="text/javascript">
var date = 'March 12, 2013 12:42:38';
var currDate=new Date(date);
var GRUBST = {};
if (!NYM) {
var NYM = {};
NYM.config = {};
NYM.config.membership = {
"service":"nym"
};
NYM.config.advertising = {
"sitename":"nym.grubstreet"
};
}
</script>
<script type="text/javascript" src="http://cache.nymag.com/scripts/modernizr-1.7.min.js"></script>
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js"></script>
<script type="text/javascript" src="http://cache.nymag.com/scripts/jquery-ui-1.8.2.custom.min.js"></script>
<script type="text/javascript" src="http://cache.nymag.com/scripts/ad_manager.js"></script>
<script type="text/javascript" src="http://cache.nymag.com/js/2/global.js"></script>
<script type="text/javascript" src="http://cache.nymag.com/scripts/skinTakeover.js"></script>
<script type="text/javascript" src="http://cache.nymag.com/scripts/grubstreet-controls.js"></scr
'''
n = 0
while True:
n += 1
tree = etree.HTML( data )
m = tree.xpath("//meta[@property]")
print '-', n
for i in m:
print n
#print (i.attrib['property'], i.attrib['content'])
对于快速版本,您可以使用:
import sys
from lxml import etree
print("%-20s: %s" % ('Python', sys.version_info))
print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION))
print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION))
print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION))
print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION))
print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))
我有:
OS : Ubuntu 12.10 (AWS)
Python : sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0)
lxml.etree : (3, 1, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)
最佳答案
这是一种使用 lxml 解析部分 HTML 的方法。它似乎解决了 libxml 版本(2、7、8)或更早版本中似乎出现的挂起问题:
parser = LH.HTMLParser()
parser.feed(data)
root = parser.close()
m = root.xpath('//meta[@property]')
<小时/>
import sys
import lxml.html as LH
import lxml.etree as ET
data = '''
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="ie6"> <![endif]-->
<!--[if IE 7]> <html class="ie7"> <![endif]-->
<!--[if IE 8]> <html class="ie8"> <![endif]-->
<!--[if gt IE 8]><!--> <html> <!--<![endif]-->
<head profile="http://gmpg.org/xfn/11">
<meta charset="UTF-8">
<title>
Erased US data shows 1 in 4 missiles in Afghan airstrikes now fired by drone: The Bureau of Investigative Journalism </title>
<meta name="description" content="Drone data has been wiped from the Air Force website.">
<meta name="generator" content="Magicalia 2010" />
<meta name="google-site-verification" content="bGFVI6kAZGjMNNiS6LGvBDWSGydwyWQI3gogCD4xP50" />
<link href="http://cdn-images.mailchimp.com/embedcode/slim-081711.css" rel="stylesheet" type="text/css">
<link rel="stylesheet" href="http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/css/screen.css" type="text/css" media="screen, projection" />
<link rel="stylesheet" href="http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/css/print.css" type="text/css" media="print" />
<link rel="stylesheet" href="http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/style.css?3" type="text/css" media="screen, projection" />
<!--[if IE]>
<link rel="stylesheet" href="http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/css/lib/ie.css" type="text/css" media="screen, projection" />
<![endif]-->
<!--[if lt IE 7]>
<script defer type="text/javascript" src="http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/pngfix.js"></script>
<![endif]-->
<!--[if gte IE 5.5]>
<script language="javaScript" src="http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/dhtml.js" type="text/javaScript"></script>
<![endif]-->
<link rel="alternate" type="application/rss+xml" title="The Bureau of Investigative Journalism RSS Feed" href="http://www.thebureauinvestigates.com/feed/" />
<link rel="pingback" href="http://www.thebureauinvestigates.com/xmlrpc.php" />
<link rel="alternate" type="application/rss+xml" title="The Bureau of Investigative Journalism » Erased US data shows 1 in 4 missiles in Afghan airstrikes now fired by drone Comments Feed" href="http://www.thebureauinvestigates.com/2013/03/12/erased-us-data-shows-1-in-4-missiles-in-afghan-airstrikes-now-fired-by-drone/feed/" />
<link rel='stylesheet' id='mailchimp-css' href='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/lib/mailchimp.dev.css?ver=3.5.1' type='text/css' media='all' />
<link rel='stylesheet' id='donate-css' href='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/lib/donate.dev.css?ver=3.5.1' type='text/css' media='all' />
<link rel='stylesheet' id='tubepress-css' href='http://www.thebureauinvestigates.com/wp-content/plugins/tubepress/src/main/web/css/tubepress.css?ver=3.5.1' type='text/css' media='all' />
<link rel='stylesheet' id='NextGEN-css' href='http://www.thebureauinvestigates.com/wp-content/plugins/nextgen-gallery/css/nggallery.css?ver=1.0.0' type='text/css' media='screen' />
<link rel='stylesheet' id='shutter-css' href='http://www.thebureauinvestigates.com/wp-content/plugins/nextgen-gallery/shutter/shutter-reloaded.css?ver=1.3.4' type='text/css' media='screen' />
<link rel='stylesheet' id='stbCSS-css' href='http://www.thebureauinvestigates.com/wp-content/plugins/wp-special-textboxes/css/wp-special-textboxes.css.php?ver=4.3.72' type='text/css' media='all' />
<link rel='stylesheet' id='grid-css' href='http://www.thebureauinvestigates.com/wp-content/plugins/big-brother/css/grid.css?ver=3.5.1' type='text/css' media='all' />
<link rel='stylesheet' id='reveal-css' href='http://www.thebureauinvestigates.com/wp-content/plugins/big-brother/css/reveal.css?ver=3.5.1' type='text/css' media='all' />
<link rel='stylesheet' id='app-css' href='http://www.thebureauinvestigates.com/wp-content/plugins/big-brother/css/app.css?ver=3.5.1' type='text/css' media='all' />
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-includes/js/jquery/jquery.js?ver=1.8.3'></script>
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-content/plugins/tubepress/src/main/web/js/tubepress.js?ver=3.5.1'></script>
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/jquery.cycle.js?ver=3.5.1'></script>
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/search.js?ver=3.5.1'></script>
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/nav/superfish.js?ver=3.5.1'></script>
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/nav/supersubs.js?ver=3.5.1'></script>
<script type='text/javascript' src='http://www.thebureauinvestigates.com/wp-content/themes/dxw_magicalia/js/home.js?ver=3.5.1'></sc
'''
if __name__ == '__main__':
print("%-20s: %s" % ('Python', sys.version_info))
print("%-20s: %s" % ('lxml.etree', ET.LXML_VERSION))
print("%-20s: %s" % ('libxml used', ET.LIBXML_VERSION))
print("%-20s: %s" % ('libxml compiled', ET.LIBXML_COMPILED_VERSION))
print("%-20s: %s" % ('libxslt used', ET.LIBXSLT_VERSION))
print("%-20s: %s" % ('libxslt compiled', ET.LIBXSLT_COMPILED_VERSION))
n = 0
while True:
n += 1
print '-', n
parser = LH.HTMLParser()
parser.feed(data)
root = parser.close()
m = root.xpath('//meta[@property]')
for i in m:
print(n)
产量
% test.py
Python : sys.version_info(major=2, minor=7, micro=2, releaselevel='final', serial=0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)
- 1
- 2
- 3
- 4
- 5
...
关于python - 如何防止 lxml.etree.HTML( data ) 在某些类型的数据上崩溃?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15367001/
我有一段代码看起来像这样: void update_clock(uint8_t *time_array) { time_t time = *((time_t *) &time_array[0]
应用程序崩溃了 :( 请帮助我.. 在这方面失败了。我找不到错误?该应用程序可以连接到 iTunesConnect 但它会出错。 谁能根据下面的崩溃报告判断问题出在哪里? share_with_app
小二是新来的实习生,作为技术 leader,我给他安排了一个非常简单的练手任务,把前端 markdown 编辑器里上传的图片保存到服务器端,结果他真的就把图片直接保存到了服务器上,这下可把我气坏了,就
我正在创建一个函数,它将目录路径作为参数传递,或者如果它留空,则提示用户输入。 我已经设置了我的 PATH_MAX=100 和 if 语句来检查 if ((strlen(folder path) +
我已将“arial.ttf”文件(从我的/Windows/Fonts 文件夹中获取)加载到内存中,但是将其传递到 FT_New_Memory_Face 时会崩溃(在 FT_Open_Face 中的某处
我正在尝试在我的计算机上的两个控制台之间进行 rtsp 流。 在控制台 1 上,我有: ffmpeg -rtbufsize 100M -re -f dshow -s 320x240 -i video=
我正在尝试使用 scio_beast在一个项目中。我知道它还没有完成,但这并不重要。我已经设法让它工作得很好。 我现在正在尝试连接到 CloudFlare 后面的服务器,我知道我需要 SNI 才能工作
我有一个带有关联宏的下拉列表,如下所示: Sub Drop() If Range("Hidden1!A1") = "1" Then Sheets("Sheet1").Se
我对 bash 很陌生。我要做的就是运行这个nvvp -vm /usr/lib64/jvm/jre-1.8.0/bin/java无需记住最后的路径。我认为 instafix 就是这样做...... n
我在 Windows 上使用 XAMPP 已经两年左右了,它运行完美,没有崩溃没有问题。 (直到四个月前。) 大约四个月前,我们将服务器/系统升级到了更快的规范。 这是旧规范的内容 - Windows
我面临着一个非常烦人的 android 崩溃,它发生在大约 1% 的 PRODUCTION session 中,应用程序始终在后台运行。 Fatal Exception: android.app.Re
尝试使用下面的函数: public void createObjectType() { try { mCloudDB.createObjectType(ObjectTypeIn
由于我正在进行的一个项目,我在 CF11 管理员中弄乱了类路径,我设法使服务器崩溃,以至于我唯一得到的是一个漂亮的蓝屏和 500 错误.我已经检查了日志,我会把我能做的贴在帖子的底部,但我希望有人会启
关闭。这个问题不满足Stack Overflow guidelines .它目前不接受答案。 想改善这个问题吗?更新问题,使其成为 on-topic对于堆栈溢出。 10 个月前关闭。 Improve
我最近从 xcode 3.x 更新到 4.2,当我在 4.2 中运行应用程序时,我遇到了核心数据问题。我还更新到了 iOS 5,所以问题可能就在那里,我不太确定。 这些应用程序在 3.x 中运行良好,
我是一个相对较新的 iPhone 应用程序开发人员,所以我的知识有点粗略,所以如果这是一个微不足道的问题,请原谅我。 我有一个导航应用程序,它通过在navigationController对象上调用p
if ([MFMailComposeViewController canSendMail]) { MFMailComposeViewController *mailViewController
你能帮我吗? 我正在设置 UILocalNotification,当我尝试设置其 userInfo 字典时,它崩溃了。 fetchedObjects 包含 88 个对象。 这是代码: NSDi
为什么我的代码中突然出现 NSFastEnumeration Mutation Handler 崩溃。我很茫然为什么会突然出现这个崩溃以及如何解决它。 最佳答案 崩溃错误: **** 由于未捕获的异常
当我从表中删除行时,我的应用程序崩溃了。这是我检测到错误和堆栈跟踪的来源。谢谢! //delete row from database - (void)tableView:(UITableView *
我是一名优秀的程序员,十分优秀!