linux - 如何根据字数在第一个句点字符处拆分行并在结果行中重复该过程(在模式空间中)-6ren

linux - 如何根据字数在第一个句点字符处拆分行并在结果行中重复该过程(在模式空间中)

转载作者：塔克拉玛干更新时间：2023-11-02 23:19:15

我正在尝试拆分一个文本文档，其中任何超过 10 个单词的行(定义为两边空格之间的任何单词)应该在从左到右出现的第一个句点字符处拆分。任何超过 10 个单词的结果行也应该被拆分。

示例输入数据:

1I got from Dr. Smith, the OK to keep working.
2I got from Dr. Smith, the O.K. to keep working.
3I got from Dr. Smith, the OK to keep working more.
4I got from Dr. Smith, the O.K. to keep working more.
5I got from Dr. Smith, the O.K. to keep working more, although I'm sick.
6I got from Dr. Smith, the O.K. to keep working more, although I'm so sick.

所需的输出数据:

1I got from Dr. Smith, the OK to keep working.
2I got from Dr. Smith, the O.K. to keep working.
3I got from Dr.
Smith, the OK to keep working more.
4I got from Dr.
Smith the O.K. to keep working more.
5I got from Dr.
Smith, the O.K. to keep working more, although I'm sick.
6I got from Dr.
Smith, the O.K.
to keep working more, although I'm so sick.

我试过下面的代码:

sed -r ':a; /((\w)+[., ]+){11}/s/\./\r\n/; ta' grab.txt | tr '\r' '.' > output.txt

该代码产生以下不准确的结果:

1I got from Dr. Smith, the OK to keep working.
2I got from Dr.
 Smith, the O.K. to keep working.
3I got from Dr.
 Smith, the OK to keep working more.
4I got from Dr.
 Smith, the O.K. to keep working more.
5I got from Dr.
 Smith, the O.K. to keep working more, although I'm sick.
6I got from Dr.
 Smith, the O.K. to keep working more, although I'm so sick.

请注意第 1 行和第 2 行都有 10 个单词，但第 2 行被拆分(似乎在单词中添加了句点......例如单词 O.K......让它认为该行中的单词比实际的多) .

请注意第 6 行实际上应该分成 3 行，因为第二行有 11 个单词，但由于某些原因它没有。

我正在寻找一种可以通过管道传入和传出的解决方案。

谢谢。

最佳答案

awk 的直接解决方案:

awk '{
  while (NF>10) {
    if (!(i=index($0,".")))
      break
    print substr($0,1,i)
    $0=substr($0,i+1)
    # trim leading blank(s)
    $1=$1
  }
  if ($0!="")
    print
}' file

一行中只要超过十个词，就被第一个句点一分为二；打印第一部分，并用第二部分更新该行，依此类推。

用 sed 顺便说一句，这根本不是一个好主意。

关于linux - 如何根据字数在第一个句点字符处拆分行并在结果行中重复该过程(在模式空间中)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58310322/