gpt4 book ai didi

java - 仅针对全文搜索中需要的英语单词修改 StringTokenizer 输出的最佳方法是什么?

转载 作者:行者123 更新时间:2023-12-01 19:22:03 24 4
gpt4 key购买 nike

为了向我的 App Engine 应用添加全文搜索,我已将以下字段添加到我的模型中:

private List<String> fullText;

为了测试搜索,我使用了以下文本:

Oxandrolone is a synthetic anabolic steroid derived from dihydrotestosterone  by substituting 2nd carbon atom for oxygen (O). It is widely known for its exceptionally small level of androgenicity accompanied by moderate anabolic effect. Although oxandrolone is a 17-alpha alkylated steroid, its liver toxicity is very small as well. Studies have showed that a daily dose of 20 mg oxandrolone used in the course of 12 weeks had only a negligible impact on the increase of liver enzymes[1][2]. As a DHT derivative, oxandrolone does not aromatize (convert to estrogen, which causes gynecomastia  or male breast tissue). It also does not significantly influence the body's normal testosterone production (HPTA axis) at low dosages (10 mg). When dosages are high, the human body reacts by reducing the production of LH (luteinizing hormone), thinking endogenous testosterone production is too high; this in turn eliminates further stimulation of Leydig cells in the testicles, causing testicular atrophy (shrinking). Oxandrolone used in a dose of 80 mg/day suppressed endogenous testosterone by 67% after 12 weeks of therapy[3].

并将以下 Java 代码应用到它上面:

StringTokenizer st = new StringTokenizer(recordText);
List<String> fullTextSearchSupport = new ArrayList<String>();
while (st.hasMoreTokens())
{
String token = st.nextToken().trim();
if (token.length() > 3)
{
fullTextSearchSupport.add(token);
}
}

我得到了以下字符串标记的 ArrayList:

[Oxandrolone, synthetic, anabolic, steroid, derived, from, dihydrotestosterone, substituting, carbon, atom, oxygen, (O)., widely, known, exceptionally, small, level, androgenicity, accompanied, moderate, anabolic, effect., Although, oxandrolone, 17-alpha, alkylated, steroid,, liver, toxicity, very, small, well., Studies, have, showed, that, daily, dose, oxandrolone, used, course, weeks, only, negligible, impact, increase, liver, enzymes[1][2]., derivative,, oxandrolone, does, aromatize, (convert, estrogen,, which, causes, gynecomastia, male, breast, tissue)., also, does, significantly, influence, body&#039;s, normal, testosterone, production, (HPTA, axis), dosages, mg)., When, dosages, high,, human, body, reacts, reducing, production, (luteinizing, hormone),, thinking, endogenous, testosterone, production, high;, this, turn, eliminates, further, stimulation, Leydig, cells, testicles,, causing, testicular, atrophy, (shrinking)., Oxandrolone, used, dose, mg/day, suppressed, endogenous, testosterone, after, weeks, therapy[3].]

令我惊讶的是,StringTokenizer 在将字符串分解为标记时会留下逗号、句点、方括号和圆括号等标点符号。

例如,对于文本搜索, token :

derivative,

可能只是

derivative

enzymes[1][2].

可以简单地是:

enzymes

仅生成文本搜索所需的英文单词输出(不包括标点符号和特殊字符)的最佳方法是什么?

我尝试在这种情况下减少较小的连接词(a、by、for):

token.length() > 3

但显然这还不够。

最佳答案

是的,默认分隔符是空格字符,但您可以使用双参数构造函数指定自己的分隔符:

StringTokenizer st = new StringTokenizer(recordText, ".,! ()[]");

关于java - 仅针对全文搜索中需要的英语单词修改 StringTokenizer 输出的最佳方法是什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/3695699/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com