gpt4 book ai didi

java - 使用 TIKA 提取 url 的内容(文本)

转载 作者:行者123 更新时间:2023-12-02 08:14:29 24 4
gpt4 key购买 nike

如何从网址中提取文本?在我的代码中,它正在提取该网址的源代码...

DefaultHttpClient client = null;
client = new DefaultHttpClient();
client.getCredentialsProvider().setCredentials(
new AuthScope(AuthScope.ANY_HOST, AuthScope.ANY_PORT, AuthScope.ANY_REALM),
new UsernamePasswordCredentials("test", "test"));
client.getParams().setParameter(ClientPNames.ALLOW_CIRCULAR_REDIRECTS, true);
HttpGet request = new HttpGet("http://somehost.com");
HttpResponse response = client.execute(request);
HttpEntity entity = response.getEntity();
InputStream content = entity.getContent();

Tika t = new Tika();
Metadata md = new Metadata();
Reader r = t.parse(content, md);
System.out.println(md);
System.out.println("Yes1: " +md.get("keywords"));
System.out.println("Yes2: " +md.get("title"));
System.out.println("Yes3: " +md.get("authors"));

//This gives the source code of that url not the actual content...
String ss= t.parseToString(content);
System.out.println("Yes4: " +ss);

有什么建议吗?

最佳答案

正如我所读到的..你可以使用此代码通过 tika 来完成

 byte[] raw = content.getContent();
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
parser.parse(new ByteArrayInputStream(raw), handler, metadata, new ParseContext());
LOG.info("content: " + handler.toString());

尽管我测试了它,但我发现 handler.toString() 是空的!

关于java - 使用 TIKA 提取 url 的内容(文本),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/6713927/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com