gpt4 book ai didi

r - 从R中的文件夹中的所有文件中提取两个单词之间的文本

转载 作者:行者123 更新时间:2023-12-04 03:07:43 26 4
gpt4 key购买 nike

我有一个包含许多 .txt 文件的文件夹。我想读取所有文件,然后从位于两个单词之间的每个文件中提取文本并将它们存储在 .csv 文件中。

要提取的文本总是在两个词之间

IMPRESSION:  "text to be extracted"  (Dr. Deepak Bhatt)

OR

IMPRESSION : "text to be extracted" (Dr. Deepak Bhatt)

我在下面写的代码没有从所有文件中提取文本。我该如何解决这个问题?

    names <- list.files(path = "C:\\Users\\Admin\\Downloads\\data\\data",
pattern = "*.txt", all.files = FALSE,
full.names = FALSE, recursive = FALSE,
ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE)

all.names <- lapply(names,readFn)

readFn <- function(i)
{

file <- read_file(i)

file <- gsub("[\r\n\t]", " ", file)

extracted_txt <- rm_between(file,
'IMPRESSION :', '(Dr. Deepak Bhatt)',
extract=TRUE, trim = TRUE, clean = TRUE)

if(is.na(extracted_txt))
{
extracted_txt <- rm_between(file,
'IMPRESSION:', '(Dr. Deepak Bhatt)',
extract=TRUE, trim = TRUE, clean = TRUE)
}

}


output <- do.call(rbind,all.names)
name_of_file <- sub(".txt","",names)
final_output <- cbind(name_of_file,output)
colnames(final_output) <- c('filename','text')
write.csv(final_output,"final_output.csv",row.names=F)

示例 1:文件名 = 15-1-2011.txt

The optic nerve is normal.


There is diffuse enlargement of the lacrimal gland (more marked on the left side).

IMPRESSION:

Bilateral diffuse irregular enlargement of the lacrimal gland is due to inflammatory enlargement (? Sjogerns syndrome).
The left gland is more enlarged than right.
No mass lesion or cystic lesion noted.
No evidence of retinal detachment.


(Dr. Deepak Bhatt)

(B-Scan findings are interpretation of echoes and need to be correlated clinically)
#

示例 2:1-12-48.txt

The ciliary body and ciliary process are normal in position and texture.

There is marked steching of the zonules.


IMPRESSION :

Left sided marked stretching of the zonules noted from 2 to 6 O’clock position.
There is absence of zonules at 3 O’clock position.
The angle is normal and the ciliary body, processes are normal in position.


(Dr. Deepak Bhatt)

(UBM findings are interpretation of echoes and need to be correlated clinically)
#### 客观的
OUTPUT file: final_output.csv

15-1-2011 Bilateral diffuse.....retinal detachment.

1-12-48 Left sided marked stretching of the zonules ...in position.

最佳答案

您可以为此使用 gsub:

text_between_words <- "IMPRESSION:  text to be extracted  (Dr. Deepak Bhatt)"
gsub('IMPRESSION:\\s+(.*)\\s+\\(.*\\)', '\\1', text_between_words)

结果:

[1] "text to be extracted "

或者结合trimws:

trimws(gsub('IMPRESSION:(.*)\\(.*\\)', '\\1', text_between_words))

结果:

[1] "text to be extracted"

IMPRESSION: 之间有时有空格时,您可以将代码调整为:

text_between_words2 <- "IMPRESSION :  text to be extracted  (Dr. Deepak Bhatt)"
trimws(gsub('IMPRESSION\\s{0,1}:(.*)\\(.*\\)', '\\1', text_between_words2))

如您所见,我在 IMPRESSION: 之间添加了 \\s{0,1}。这将查看 IMPRESSION: 之间是否有零个或一个空格。结果:

[1] "text to be extracted"

对于下面评论中要求的调整,您也需要调整方法:

text_between_words3 <- "Some Text before..... IMPRESSION: text to be extracted (Dr. Deepak Bhatt) text that should go too"
trimws(gsub('.*IMPRESSION\\s{0,1}:(.*)\\(.*\\).*', '\\1', text_between_words3))

结果:

[1] "text to be extracted"

如果文本中只有那个特定的名字(Dr. Deepak Bhatt),你也可以这样做:

trimws(gsub('.*IMPRESSION\\s{0,1}:(.*)\\(Dr. Deepak Bhatt\\).*', '\\1', text_between_words3))

关于r - 从R中的文件夹中的所有文件中提取两个单词之间的文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47590481/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com