javascript - 将 innerhtml 拆分为文本以在 javascript 中翻译 JSON-6ren

javascript - 将 innerhtml 拆分为文本以在 javascript 中翻译 JSON

转载作者：行者123 更新时间：2023-11-28 02:31:22

目前我正在开发一个应用程序，它需要提取 Body 的 innerHTML，然后以 JSON 格式从中提取文本。该 JSON 将用于翻译，然后翻译后的 JSON 将用作输入以创建相同的 HTML 标记，但带有翻译后的文本。请参阅下面的代码段。

HTML 输入

<section>Hello, <div>This is some text which I need to extract.<a class="link">It can be <strong> complicated.</strong></a></div><span>The extracted text should contain the html tag if it has any html tag in the span,p or a tag</span><p>Please see the <span>desired output below.</span></p>Thanks!</section>';

翻译 JSON 输出

{
"text1":"Hello, ",
"text2":"This is some text which I need to extract.",
"text3":"It can be <strong> complicated.</strong>",
"text4":"The extracted text should contain the html tag if it 
             has any html tag in the span,p or a tag",
"text5":"Please see the <span>desired output below.</span>",
"text6":"Thanks!"
}

翻译后的 JSON 输入

{
"text1":"Hello,-in spanish ",
"text2":"This is some text which I need to extract.-in spanish",
"text3":"It can be <strong> complicated.-in spanish</strong>",
"text4":"The extracted text should contain the html tag if it 
             has any html tag in the span,p or a tag-in spanish",
"text5":"Please see the <span>desired output below.-in spanish</span>",
"text6":"Thanks!-in spanish"
}

翻译后的 HTML 输出

<section>Hello,-in spanish <div>This is some text which I need to extract.-in spanish<a class="link">It can be <strong> complicated.-in spanish</strong></a></div><span>The extracted text should contain the html tag if it has any html tag in the span,p or a tag-in spanish</span><p>Please see the <span>desired output below.</span></p>Thanks!-in spanish</section>';

我尝试了各种正则表达式，但下面是我最终执行的流程之一，但我无法用它获得所需的输出。

//encode
const bodyHTML = '<a class="test">hello world<strong> this is gonna be hard</strong></a>';
//replace the quotes with escape quotes
const htmlContent = bodyHTML.replace(/"/g, '\\"');
let count = 0;
let translationObj = {};
let newHtml = htmlContent.replace(/\>(.*?)\</g, function(match) {
  //remove the special character	
  match = match.replace(/\>|\</g, '');
  count = count + 1;
  translationObj[count] = match;

  return '>~' + count + '~<';
});

const translationJSON = '{"1":"hello world in spanish","2":" this is gonna be hard in spanish","3":""}';

//decode
let trasnaltedHtml = '';
const translatedObj = JSON.parse(translationJSON)
trasnaltedHtml = newHtml.replace(/\~(.*?)\~/g, function(match) {
  //remove the special character	
  match = match.replace(/\~|\~/g, '');

  return translatedObj[match];
});
//replace the escape quotes with quotes
trasnaltedHtml = trasnaltedHtml.replace(/\\"/g, '"');
//console.log()
console.log("bodyHTML", bodyHTML);
console.log('tranlationObj', translationObj);
console.log("translationJSON", translationJSON);
console.log('newHtml', newHtml);
console.log("trasnaltedHtml", trasnaltedHtml);

我正在寻找一个有效的正则表达式或 JS 世界中可以获得预期结果的任何其他方法。我想以 JSON 的形式获取 HTML 中的所有文本。另一个条件是如果文本有一些内部 html 标签，则不要拆分文本，这样我们就不会丢失句子的上下文，例如 <p>Click <a>here</a></p>它应该被视为一个文本 "Click <a>here</a>" .我希望我澄清了所有的疑惑

提前致谢!

最佳答案

到目前为止，最好的方法是使用 HTML 解析器，然后遍历树中的文本节点。您无法仅使用简单的 JavaScript 正则表达式¹ 正确处理 HTML 这样的非常规标记语言(许多人已经浪费了很多时间尝试)，而且这甚至没有考虑到 HTML 的所有特定特性。

在 npm 上至少有几个，可能是几个，经过良好测试，积极支持的 DOM 解析器模块。

所以基本结构是:

将 HTML 解析为 DOM。
按定义的顺序(通常是深度优先遍历)遍历 DOM，构建您的对象或文本字符串数组，以从您遇到的文本节点进行翻译。
如有必要，将该对象/数组转换为 JSON，将其发送出去进行翻译，取回结果，如有必要，再次将其从 JSON 解析为对象/数组。
以相同的顺序遍历 DOM，应用对象/数组的结果。
将 DOM 序列化为 HTML。
发送结果。

这里有一个例子——当然，我在这里使用浏览器内置的 HTML 解析器而不是 npm 模块，并且您使用的任何模块的 API 可能略有不同，但概念是一样的:

var html = '<section>Hello, <div>This is some text which I need to extract.<a class="link">It can be <strong> complicated.</strong></a></div><span>The extracted text should contain the html tag if it has any html tag in the span,p or a tag</span><p>Please see the <span>desired output below.</span></p>Thanks!</section>';
var dom = parseHTML(html);
var strings = [];
walk(dom, function(node) {
  if (node.nodeType === 3) { // text node
    strings.push(node.nodeValue);
  }
});
console.log("strings = ", strings);
var translation = translate(strings);
console.log("translation = ", translation);
var n = 0;
walk(dom, function(node) {
  if (node.nodeType === 3) { // text node
    node.nodeValue = translation[n++];
  }
});
var newHTML = serialize(dom);
document.getElementById("before").innerHTML = html;
document.getElementById("after").innerHTML = newHTML;


function translate(strings) {
  return strings.map(str => str.toUpperCase());
}

function walk(node, callback) {
  var child;
  callback(node);
  switch (node.nodeType) {
    case 1: // Element
      for (child = node.firstChild; child; child = child.nextSibling) {
        walk(child, callback);
      }
  }
}

// Placeholder for module function
function parseHTML(html) {
  var div = document.createElement("div");
  div.innerHTML = html;
  return div;
}

// Placeholder for module function
function serialize(dom) {
  return dom.innerHTML;
}

<strong>Before:</strong>
<div id="before"></div>
<strong>After:</strong>
<div id="after"></div>

¹ 一些“正则表达式”库(或其他语言的正则表达式功能)实际上是正则表达式+更多功能，可以帮助您做类似的事情，但它们不仅仅是正则表达式，而且 JavaScript 的内置库不是没有这些功能。

关于javascript - 将 innerhtml 拆分为文本以在 javascript 中翻译 JSON，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50493160/

文章推荐： c++ - 加密变量

文章推荐： html - 如何将样式应用于按钮上显示的 HiddenField

文章推荐： css - bootstrap 已排队但 normalize.less 未显示

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城