gpt4 book ai didi

java - 如何分割网址?

转载 作者:塔克拉玛干 更新时间:2023-11-02 08:05:26 26 4
gpt4 key购买 nike

这是我用来分割 URL 的代码,但是那个代码有问题。所有链接都以双字出现,例如 www.utem.edu.my/portal/portal 。/portal/portal always double 出现在任何链接中。对我提取网页中的链接有什么建议吗?

public String crawlURL(String strUrl) {
String results = ""; // For return
String protocol = "http://";

// Assigns the input to the inURL variable and checks to add http
String inURL = strUrl;
if (!inURL.toLowerCase().contains("http://".toLowerCase()) &&
!inURL.toLowerCase().contains("https://".toLowerCase())) {
inURL = protocol + inURL;
}

// Pulls URL contents from the web
String contectURL = pullURL(inURL);
if (contectURL == "") { // If it fails, then try with https
protocol = "https://";
inURL = protocol + inURL.split("http://")[1];
contectURL = pullURL(inURL);
}

// Declares some variables to be used inside loop
String aTagAttr = "";
String href = "";
String msg = "";

// Finds A tag and stores its href value into output var
String bodyTag = contectURL.split("<body")[1]; // Find 1st <body>
String[] aTags = bodyTag.split(">"); // Splits on every tag

//To show link different from one another
int index = 0;

for (String s: aTags) {
// Process only if it is A tag and contains href
if (s.toLowerCase().contains("<a") && s.toLowerCase().contains("href")) {

aTagAttr = s.split("href")[1]; // Split on href

// Split on space if it contains it
if (aTagAttr.toLowerCase().contains("\\s"))
aTagAttr = aTagAttr.split("\\s")[2];

// Splits on the link and deals with " or ' quotes
href = aTagAttr.split(((aTagAttr.toLowerCase().contains("\""))? "\"" : "\'"))[1];

if (!results.toLowerCase().contains(href))
//results += "~~~ " + href + "\r\n";

/*
* Last touches to URl before display
* Adds http(s):// if not exist
* Adds base url if not exist
*/

if(results.toLowerCase().indexOf("about") != -1) {
//Contains 'about'
}
if (!href.toLowerCase().contains("http://") && !href.toLowerCase().contains("https://")) {

// http:// + baseURL + href
if (!href.toLowerCase().contains(inURL.split("://")[1]))
href = protocol + inURL.split("://")[1] + href;
else
href = protocol + href;
}

System.out.println(href);//debug

最佳答案

考虑使用 URL 类...

按照文档的建议使用它:)

public static void main(String[] args) throws Exception {

URL aURL = new URL("http://example.com:80/docs/books/tutorial"
+ "/index.html?name=networking#DOWNLOADING");

System.out.println("protocol = " + aURL.getProtocol());
System.out.println("authority = " + aURL.getAuthority());
System.out.println("host = " + aURL.getHost());
System.out.println("port = " + aURL.getPort());
System.out.println("path = " + aURL.getPath());
System.out.println("query = " + aURL.getQuery());
System.out.println("filename = " + aURL.getFile());
System.out.println("ref = " + aURL.getRef());
}
}

输出:

protocol = http

authority = example.com:80

host = example.com

port = 80

etc

在此之后你可以获取你需要的元素并创建一个新的字符串/URL :)

关于java - 如何分割网址?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38633989/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com