gpt4 book ai didi

mysql - Solr DataImport 中的中文字符问题

转载 作者:行者123 更新时间:2023-11-29 08:45:41 27 4
gpt4 key购买 nike

我在 Solr 3.4 中索引中文/日文文本时遇到问题。我正在使用 DIH 导入数据,连接 block 是

<dataSource type="JdbcDataSource"    driver="com.mysql.jdbc.Driver"    url="jdbc:mysql://localhost/db_development?useUnicode=true&amp;characterEncoding=UTF-8&amp;characterSetResults=UTF-8"    user="user"    useUnicode="true"    characterEncoding="UTF-8"    encoding="UTF-8"    password="password"    zeroDateTimeBehavior="convertToNull"    name="app" />

该字段的字段类型定义为

  <fieldType name="text_commongrams" class="solr.TextField">    <analyzer>      <charFilter class="solr.HTMLStripCharFilterFactory" />      <tokenizer class="solr.ICUTokenizerFactory" />      <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>      <filter class="solr.ICUFoldingFilterFactory"/>      <filter class="solr.ASCIIFoldingFilterFactory"/>      <filter class="solr.ICUNormalizer2FilterFactory" name="nfkc_cf" mode="compose"/>      <filter class="solr.RemoveDuplicatesTokenFilterFactory" />      <filter class="solr.TrimFilterFactory" />      <filter class="solr.LowerCaseFilterFactory" />      <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>    <filter class="solr.SynonymFilterFactory"      synonyms="synonyms.txt"      ignoreCase="true"      expand="true" />    <filter class="solr.CommonGramsFilterFactory"      words="stopwords_en.txt"      ignoreCase="true" />    <filter class="solr.StopFilterFactory"      words="stopwords_en.txt"      ignoreCase="true" />    <filter class="solr.WordDelimiterFilterFactory"      generateWordParts="1"      splitOnNumerics="0"      generateNumberParts="1"      catenateWords="1"      catenateNumbers="1"      catenateAll="0"      preserveOriginal="1" />  </analyzer></fieldType>

MySQL字符编码详细信息如下

+--------------------------+-----------------------------------------+| Variable_name            | Value                                   |+--------------------------+-----------------------------------------+| character_set_client     | latin1                                  || character_set_connection | latin1                                  || character_set_database   | latin1                                  || character_set_filesystem | binary                                  || character_set_results    | latin1                                  || character_set_server     | utf8                                    || character_set_system     | utf8                                    || character_sets_dir       | /opt/local/share/mysql5/mysql/charsets/ |+--------------------------+-----------------------------------------+

我正在使用 java 参数 -Dfile.encoding=UTF-8 启动 Solr 。

输入文本为JavaOne Tokyo 2012での発表スライド当我将其导入 Solr 并使用 ID 查询该文档时,我看到文本为 JavaOne Tokyo 2012ã§ã®ç™ºè¡¨ã‚¹ãƒ©ã‚¤ãƒ‰

谁能告诉我哪里出错了?

最佳答案

所以我最终不得不更改我的 MySQL 表以将字符串存储为 UTF8。有关如何将现有表从 latin1 转换为 utf8 的详细信息,请参阅 here .

关于mysql - Solr DataImport 中的中文字符问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12580155/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com