gpt4 book ai didi

java - HttpUrlConnection 获取内容的标题并得到 "Moved Permanently"

转载 作者:塔克拉玛干 更新时间:2023-11-01 19:14:07 27 4
gpt4 key购买 nike

这是我用 Groovy 编写的代码,用于从 URL 中获取页面标题。然而,一些网站我得到了“永久移动”,我认为这是因为 301 重定向。我如何避免这种情况并让 HttpUrlConnection 跟随正确的 URL 并获得正确的页面标题

例如这个网站我得到的是“永久移动”而不是正确的页面标题 http://www.nytimes.com/2011/08/14/arts/music/jay-z-and-kanye-wests-watch-the-throne.html


def con = (HttpURLConnection) new URL(url).openConnection()
con.connect()

def inputStream = con.inputStream

HtmlCleaner cleaner = new HtmlCleaner()
CleanerProperties props = cleaner.getProperties()

TagNode node = cleaner.clean(inputStream)
TagNode titleNode = node.findElementByName("title", true);

def title = titleNode.getText().toString()
title = StringEscapeUtils.unescapeHtml(title).trim()
title = title.replace("\n", "");
return title

最佳答案

如果我自己管理重定向,我可以让它工作......

我认为问题在于该网站会期望它在重定向链中途发送的 cookie,如果没有收到,它会将您发送到登录页面。

这段代码显然需要一些清理(并且可能有更好的方法来做到这一点),但它展示了我如何提取标题:

@Grab( 'net.sourceforge.htmlcleaner:htmlcleaner:2.2' )
@Grab( 'commons-lang:commons-lang:2.6' )
import org.apache.commons.lang.StringEscapeUtils
import org.htmlcleaner.*

String location = 'http://www.nytimes.com/2011/08/14/arts/music/jay-z-and-kanye-wests-watch-the-throne.html'
String cookie = null
String pageContent = ''

while( location ) {
new URL( location ).openConnection().with { con ->
// We'll do redirects ourselves
con.instanceFollowRedirects = false

// If we got a cookie last time round, then add it to our request
if( cookie ) con.setRequestProperty( 'Cookie', cookie )
con.connect()

// Get the response code, and the location to jump to (in case of a redirect)
int responseCode = con.responseCode
location = con.getHeaderField( "Location" )

// Try and get a cookie the site will set, we will pass this next time round
cookie = con.getHeaderField( "Set-Cookie" )

// Read the HTML and close the inputstream
pageContent = con.inputStream.withReader { it.text }
}
}

// Then, clean paceContent and get the title
HtmlCleaner cleaner = new HtmlCleaner()
CleanerProperties props = cleaner.getProperties()

TagNode node = cleaner.clean( pageContent )
TagNode titleNode = node.findElementByName("title", true);

def title = titleNode.text.toString()
title = StringEscapeUtils.unescapeHtml( title ).trim()
title = title.replace( "\n", "" )

println title

希望对您有所帮助!

关于java - HttpUrlConnection 获取内容的标题并得到 "Moved Permanently",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7055957/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com