gpt4 book ai didi

java - 将允许和不允许的 URL 规则附加到 java 列表中

转载 作者:行者123 更新时间:2023-12-02 00:59:45 25 4
gpt4 key购买 nike

我正在尝试使用以下代码捕获java中robots.txt文件的允许和不允许的规则:-

package robotest;
public class RoboTest {
public static void main(String[] args) {
String robo="user-agent:hello user-agent:ppx user-agent:bot allow:/world disallow:/ajax disallow:/posts user-agent:abc allow:/myposts/like disallow:/none user-agent:* allow:/world";
String[] strarr=robo.split(" ");
String[] allowed={};
String[] disallowed={};
boolean new_block=false;
boolean a_or_d=false;
for (String line: strarr){
if(line!=""){
if(line.contains("user-agent:pp")==false && a_or_d){
break;
}
if (line.contains("user-agent:ppx")||(new_block )){
new_block=true;
System.out.println(line);
if(line.contains("allow") || line.contains("disallow")){
a_or_d=true;
}
if(line.contains("allow:")){
//append to allowed
}
if(line.contains("disallowed")) {
//append to disallowed
}
}
}
System.out.println(allowed);;
}
}
}

代码没有像我预期的那样正常工作。 robots.txt字符串的规则是用空格分隔。我想捕获用户代理 ppx 的规则。代码应在发现 user-agent:ppx 后查找允许或禁止 block 并将其附加到列表中。但它不起作用并且也令人困惑。我对java中的正则表达式也是新手。有什么办法可以解决这个问题。

最佳答案

对代码进行一些最小修改:

String robo = "user-agent:hello user-agent:ppx user-agent:bot allow:/world disallow:/ajax disallow:/posts user-agent:abc allow:/myposts/like disallow:/none user-agent:* allow:/world";
String[] strarr = robo.split(" ");
Set<String> allowed = new HashSet<>();
Set<String> disallowed = new HashSet<>();
Pattern allowPattern = Pattern.compile("^allow:\\s*(.*)");
Pattern disallowPattern = Pattern.compile("^disallow:\\s*(.*)");
boolean isUserAgentPpx = false;
boolean a_or_d = false;
for (String line : strarr) {
line = line.trim();

// Skip empty lines
if (line.isEmpty()) continue;

if (line.startsWith("user-agent:")) {
// If previous lines were allowed/disallowed rules, then start a new user-agent block
if (a_or_d) {
a_or_d = false;
isUserAgentPpx = false;
}
// Skip block of user-agent if we already found 'user-agent: ppx' or 'user-agent: *'
if (isUserAgentPpx) continue;
if (line.matches("^user-agent:\\s*(ppx|\\*)$")) {
isUserAgentPpx = true;
}
continue;
}

// Process block of allow/disallow
a_or_d = true;
if (isUserAgentPpx) {
Matcher allowMatcher = allowPattern.matcher(line);
if (allowMatcher.find()) {
allowed.add(allowMatcher.group(1));
}
Matcher disallowMatcher = disallowPattern.matcher(line);
if (disallowMatcher.find()) {
disallowed.add(disallowMatcher.group(1));
}
}
}

System.out.println("Allowed rules for Ppx:");
for (String s : allowed) {
System.out.println(s);
}
System.out.println("Disallowed rules for Ppx:");
for (String s : disallowed) {
System.out.println(s);
}

我正在使用Set<String>存储规则以避免重复。

关于java - 将允许和不允许的 URL 规则附加到 java 列表中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60815556/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com