c++ - 如何解析列式存储格式的 XML 文件？-6ren

c++ - 如何解析列式存储格式的 XML 文件？

转载作者：太空宇宙更新时间：2023-11-04 12:31:16

我想分块解析 XML 文件，这样它就不会耗尽内存并以列式存储方式解析它即 key1:value1, key2:value2, key3:value3,等等。

目前，我正在读取这样的文件:

string parseFieldFromLine(const string &line, const string &key)
{
    // We're looking for a thing that looks like:
    // [key]="[value]"
    // as part of a larger string.
    // We are given [key], and want to return [value].

    // Find the start of the pattern
    string keyPattern = key + "=\"";
    ssize_t idx = line.find(keyPattern);

    // No match
    if (idx == -1)
        return "";

    // Find the closing quote at the end of the pattern
    size_t start = idx + keyPattern.size();

    size_t end = start;
    while (line[end] != '"')
    {
        end++;
    }

    // Extract [value] from the overall string and return it
    // We have (start, end); substr() requires,
    // so we must compute, (start, length).
    return line.substr(start, end - start);
}

map<string, User> users;

void readUsers(const string &filename)
{
    ifstream fin;
    fin.open(filename.c_str());

    string line;
    while (getline(fin, line))
    {
        User u;
        u.Id = parseFieldFromLine(line, "Id");
        u.DisplayName = parseFieldFromLine(line, "DisplayName");
        users[u.Id] = u;
    }
}

如您所见，我正在调用一个函数来查找一行中的子字符串。这是错误的，因为如果我有一个格式错误的文件(行)，我会得到意想不到的值，导致静默失败。

我读过有关使用 XML 解析器的信息，但对 C++ 是新手，我无法确定哪一个在键值格式中最有效，因为我对测试工作/效率也知之甚少。我当前的 i/p 数据如下所示:

<?xml version="1.0" encoding="utf-8"?>
<posts>
  <row Id="1" PostTypeId="1" AcceptedAnswerId="509" CreationDate="2009-04-30T06:49:01.807" Score="13" ViewCount="903" Body="&lt;p&gt;Our nightly full (and periodic differential) backups are becoming quite large, due mostly to the amount of indexes on our tables; roughly half the backup size is comprised of indexes.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;We're using the &lt;strong&gt;Simple&lt;/strong&gt; recovery model for our backups.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;Is there any way, through using &lt;code&gt;FileGroups&lt;/code&gt; or some other file-partitioning method, to &lt;strong&gt;exclude&lt;/strong&gt; indexes from the backups?&lt;/p&gt;&#xA;&#xA;&lt;p&gt;It would be nice if this could be extended to full-text catalogs, as well.&lt;/p&gt;&#xA;" OwnerUserId="3" LastEditorUserId="919" LastEditorDisplayName="" LastEditDate="2009-05-04T02:11:16.667" LastActivityDate="2009-05-10T15:22:39.707" Title="How to exclude indexes from backups in SQL Server 2008" Tags="&lt;sql-server&gt;&lt;backup&gt;&lt;sql-server-2008&gt;&lt;indexes&gt;" AnswerCount="3" CommentCount="0" FavoriteCount="3" />
  <row Id="2" PostTypeId="1" AcceptedAnswerId="1238" CreationDate="2009-04-30T07:04:18.883" Score="18" ViewCount="1951" Body="&lt;p&gt;We've struggled with the RAID controller in our database server, a &lt;a href=&quot;http://www.pc.ibm.com/europe/server/index.html?nl&amp;amp;cc=nl&quot; rel=&quot;nofollow&quot;&gt;Lenovo ThinkServer&lt;/a&gt; RD120. It is a rebranded Adaptec that Lenovo / IBM dubs the &lt;a href=&quot;http://www.redbooks.ibm.com/abstracts/tips0054.html#ServeRAID-8k&quot; rel=&quot;nofollow&quot;&gt;ServeRAID 8k&lt;/a&gt;.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;We have patched this &lt;a href=&quot;http://www.redbooks.ibm.com/abstracts/tips0054.html#ServeRAID-8k&quot; rel=&quot;nofollow&quot;&gt;ServeRAID 8k&lt;/a&gt; up to the very latest and greatest:&lt;/p&gt;&#xA;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;RAID bios version&lt;/li&gt;&#xA;&lt;li&gt;RAID backplane bios version&lt;/li&gt;&#xA;&lt;li&gt;Windows Server 2008 driver&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&#xA;&lt;p&gt;This RAID controller has had multiple critical BIOS updates even in the short 4 month time we've owned it, and the &lt;a href=&quot;ftp://ftp.software.ibm.com/systems/support/system%5Fx/ibm%5Ffw%5Faacraid%5F5.2.0-15427%5Fanyos%5F32-64.chg&quot; rel=&quot;nofollow&quot;&gt;change history&lt;/a&gt; is just.. well, scary. &lt;/p&gt;&#xA;&#xA;&lt;p&gt;We've tried both write-back and write-through strategies on the logical RAID drives. &lt;strong&gt;We still get intermittent I/O errors under heavy disk activity.&lt;/strong&gt; They are not common, but serious when they happen, as they cause SQL Server 2008 I/O timeouts and sometimes failure of SQL connection pools.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;We were at the end of our rope troubleshooting this problem. Short of hardcore stuff like replacing the entire server, or replacing the RAID hardware, we were getting desperate.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;When I first got the server, I had a problem where drive bay #6 wasn't recognized. Switching out hard drives to a different brand, strangely, fixed this -- and updating the RAID BIOS (for the first of many times) fixed it permanently, so I was able to use the original &quot;incompatible&quot; drive in bay 6. On a hunch, I began to assume that &lt;a href=&quot;http://www.newegg.com/Product/Product.aspx?Item=N82E16822136143&quot; rel=&quot;nofollow&quot;&gt;the Western Digital SATA hard drives&lt;/a&gt; I chose  were somehow incompatible with the ServeRAID 8k controller.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;Buying 6 new hard drives was one of the cheaper options on the table, so I went for &lt;a href=&quot;http://www.newegg.com/Product/Product.aspx?Item=N82E16822145215&quot; rel=&quot;nofollow&quot;&gt;6 Hitachi (aka IBM, aka Lenovo) hard drives&lt;/a&gt; under the theory that an IBM/Lenovo RAID controller is more likely to work with the drives it's typically sold with.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;Looks like that hunch paid off -- we've been through three of our heaviest load days (mon,tue,wed) without a single I/O error of any kind. Prior to this we regularly had at least one I/O &quot;event&quot; in this time frame. &lt;strong&gt;It sure looks like switching brands of hard drive has fixed our intermittent RAID I/O problems!&lt;/strong&gt;&lt;/p&gt;&#xA;&#xA;&lt;p&gt;While I understand that IBM/Lenovo probably tests their RAID controller exclusively with their own brand of hard drives, I'm disturbed that a RAID controller would have such subtle I/O problems with particular brands of hard drives.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;So my question is, &lt;strong&gt;is this sort of SATA drive incompatibility common with RAID controllers?&lt;/strong&gt; Are there some brands of drives that work better than others, or are &quot;validated&quot; against particular RAID controller? I had sort of assumed that all commodity SATA hard drives were alike and would work reasonably well in any given RAID controller (of sufficient quality).&lt;/p&gt;&#xA;" OwnerUserId="1" LastActivityDate="2011-03-08T08:18:15.380" Title="Do RAID controllers commonly have SATA drive brand compatibility issues?" Tags="&lt;raid&gt;&lt;ibm&gt;&lt;lenovo&gt;&lt;serveraid8k&gt;" AnswerCount="8" FavoriteCount="2" />
  <row Id="3" PostTypeId="1" AcceptedAnswerId="104" CreationDate="2009-04-30T07:48:06.750" Score="26" ViewCount="692" Body="&lt;ul&gt;&#xA;&lt;li&gt;How do you keep your servers up to date?&lt;/li&gt;&#xA;&lt;li&gt;When using a package manager like &lt;a href=&quot;http://wiki.debian.org/Aptitude&quot; rel=&quot;nofollow&quot;&gt;Aptitude&lt;/a&gt;, do you keep an upgrade / install history, and if so, how do you do it?&lt;/li&gt;&#xA;&lt;li&gt;When installing or upgrading packages on multiple servers, are there any ways to speed the process up as much as possible?&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;" OwnerUserId="22" LastEditorUserId="22" LastEditorDisplayName="" LastEditDate="2009-04-30T08:05:02.217" LastActivityDate="2009-06-05T04:01:09.423" Title="Best practices for keeping UNIX packages up to date?" Tags="&lt;unix&gt;&lt;package-management&gt;&lt;server-management&gt;" AnswerCount="11" FavoriteCount="14" />
  <row Id="4" PostTypeId="2" ParentId="3" CreationDate="2009-04-30T07:49:58.027" Score="10" ViewCount="" Body="&lt;p&gt;Regarding your third question: I always run a local repository. Even if it's only for one machine, it saves time in case I need to reinstall (I generally use something like aptitude autoclean), and for two machines, it almost always pays off.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;For the clusters I admin, I don't generally keep explicit logs: I let the package manager do it for me. However, for those machines (as opposed to desktops), I don't use automatic installations, so I do have my notes about what I intended to install to all machines.&lt;/p&gt;&#xA;" OwnerUserId="28" LastActivityDate="2009-04-30T07:49:58.027" CommentCount="1" />
  <row Id="5" PostTypeId="2" ParentId="2" CreationDate="2009-04-30T07:56:20.070" Score="4" ViewCount="" Body="&lt;p&gt;I don't think it's common per se. However, as soon as you start using enterprise storage controllers, whether that be SAN's or standalone RAID controllers, you'll generally want to adhere to their compatibility list rather closely.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;You may be able to save some bucks on the sticker price by buying a cheap range of disks, but that's probably one of the last areas I'd want to save money on - given the importance of data in most scenarios.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;In other words, explicit incompatibility is very uncommon, but explicit compatibility adherence is recommendable.&lt;/p&gt;&#xA;" OwnerUserId="24" LastActivityDate="2009-04-30T07:56:20.070" />
  <row Id="6" PostTypeId="1" AcceptedAnswerId="537" CreationDate="2009-04-30T07:57:06.247" Score="8" ViewCount="2648" Body="&lt;p&gt;Our database currently only has one FileGroup, PRIMARY, which contains roughly 8GB of data (table rows, indexes, full-text catalog).&lt;/p&gt;&#xA;&#xA;&lt;p&gt;When is a good time to split this into secondary data files?  What are some criteria that I should be aware of?&lt;/p&gt;&#xA;" OwnerUserId="3" LastActivityDate="2009-07-08T07:23:49.527" Title="In SQL Server, when should you split your PRIMARY Data FileGroup into secondary data files?" Tags="&lt;sql-server&gt;&lt;files&gt;&lt;filegroups&gt;" AnswerCount="3" FavoriteCount="1" />
  <row Id="7" PostTypeId="1" AcceptedAnswerId="17" CreationDate="2009-04-30T07:57:09.117" Score="12" ViewCount="529" Body="&lt;p&gt;What enterprise virus-scanning systems do you recommend?&lt;/p&gt;&#xA;" OwnerUserId="32" LastActivityDate="2009-04-30T11:51:09.290" Title="What is the best enterprise virus-scanning system?" Tags="&lt;antivirus&gt;" AnswerCount="8" CommentCount="3" FavoriteCount="2" />
  <row Id="8" PostTypeId="2" ParentId="3" CreationDate="2009-04-30T07:57:15.653" Score="0" ViewCount="" Body="&lt;p&gt;You can have a local repository and configure all servers to point to it for updates. Not only you get speed of local downloads, you also get to control which official updates you want installed on your infrastructure in order to prevent any compatibility issues.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;On the Windows side of things, I've used &lt;a href=&quot;http://technet.microsoft.com/en-us/wsus/default.aspx&quot; rel=&quot;nofollow&quot;&gt;Windows Server Update Services&lt;/a&gt; with very satisfying results.&lt;/p&gt;&#xA;" OwnerUserId="36" LastActivityDate="2009-04-30T07:57:15.653" />

其他文件:

<?xml version="1.0" encoding="utf-8"?>
<users>
  <row Id="1" Reputation="4220" CreationDate="2009-04-30T07:08:27.067" DisplayName="Jeff Atwood" EmailHash="51d623f33f8b83095db84ff35e15dbe8" LastAccessDate="2011-09-03T13:30:29.990" WebsiteUrl="http://www.codinghorror.com/blog/" Location="El Cerrito, CA" Age="40" AboutMe="&lt;p&gt;&lt;img src=&quot;http://img377.imageshack.us/img377/4074/wargames1xr6.jpg&quot; width=&quot;250&quot;&gt;&lt;/p&gt;&#xA;&#xA;&lt;p&gt;&lt;a href=&quot;http://www.codinghorror.com/blog/archives/001169.html&quot; rel=&quot;nofollow&quot;&gt;Stack Overflow Valued Associate #00001&lt;/a&gt;&lt;/p&gt;&#xA;&#xA;&lt;p&gt;Wondering how our software development process works? &lt;a href=&quot;http://www.youtube.com/watch?v=08xQLGWTSag&quot; rel=&quot;nofollow&quot;&gt;Take a look!&lt;/a&gt;&lt;/p&gt;&#xA;" Views="3562" UpVotes="1995" DownVotes="31" />
  <row Id="2" Reputation="697" CreationDate="2009-04-30T07:08:27.067" DisplayName="Geoff Dalgas" EmailHash="b437f461b3fd27387c5d8ab47a293d35" LastAccessDate="2011-09-05T22:14:06.527" WebsiteUrl="http://stackoverflow.com" Location="Corvallis, OR" Age="34" AboutMe="&lt;p&gt;Developer on the StackOverflow team.  Find me on&lt;/p&gt;&#xA;&#xA;&lt;p&gt;&lt;a href=&quot;http://www.twitter.com/SuperDalgas&quot; rel=&quot;nofollow&quot;&gt;Twitter&lt;/a&gt;&#xA;&lt;br&gt;&lt;br&gt;&#xA;&lt;a href=&quot;http://blog.stackoverflow.com/2009/05/welcome-stack-overflow-valued-associate-00003/&quot; rel=&quot;nofollow&quot;&gt;Stack Overflow Valued Associate #00003&lt;/a&gt; &lt;/p&gt;&#xA;" Views="291" UpVotes="46" DownVotes="2" />
  <row Id="3" Reputation="259" CreationDate="2009-04-30T07:08:27.067" DisplayName="Jarrod Dixon" EmailHash="2dfa19bf5dc5826c1fe54c2c049a1ff1" LastAccessDate="2011-09-01T20:43:27.743" WebsiteUrl="http://stackoverflow.com" Location="New York, NY" Age="32" AboutMe="&lt;p&gt;&lt;a href=&quot;http://blog.stackoverflow.com/2009/01/welcome-stack-overflow-valued-associate-00002/&quot; rel=&quot;nofollow&quot;&gt;Developer on the Stack Overflow team&lt;/a&gt;.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;Was dubbed &lt;strong&gt;SALTY SAILOR&lt;/strong&gt; by Jeff Atwood, as filth and flarn would oft-times fly when dealing with a particularly nasty bug!&lt;/p&gt;&#xA;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Twitter me: &lt;a href=&quot;http://twitter.com/jarrod_dixon&quot; rel=&quot;nofollow&quot;&gt;jarrod_dixon&lt;/a&gt;&lt;/li&gt;&#xA;&lt;li&gt;Email me: jarrod.m.dixon@gmail.com&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;" Views="210" UpVotes="259" DownVotes="4" />

最佳答案

我猜你要找的是 SAX parser它不会一次读取整个文档(如 DOM-parser 那样)，但可以为特定事件定义回调(例如，新 XML 元素的开头)。由于您正在逐个元素处理，这听起来很适合您。

我必须承认我从来没有用 C++ 做过任何 XML 解析，但他的两个库听起来很适合你的问题:

expat
sequel max
xerces :在 2000 年代初期曾是 Java 中事实上的标准，但被其他库超越。尽管如此，C++ 实现似乎仍然得到维护。

关于c++ - 如何解析列式存储格式的 XML 文件？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58574000/

文章推荐： c++ - 矩阵的传递性

文章推荐： css - 如何在我的子窗口中使用 Site.css？

文章推荐： c++ - 如何在 macOS 上的 Qt 中使用 OpenMP 进行编译？

java - 以 Clojure 格式(java.util.Formatter)、cl 格式(Common Lisp 格式)以编程方式控制填充？
有没有办法使用 Clojure format(基于 java.util.Formatter)或 cl-format(基于 Common Lisp 的format) 以编程方式设置空格填充？如果您事先知
java - 在数据库和 postman 上无法看到实际上传的文件(.pdf 格式)格式？
我正在尝试创建一个用户实体以及数据/文件(pdf格式)。上传并保存到数据库很好，但是当我让用户进入 postman 时尝试发送获取请求方法，然后在数据字段中显示一些糟糕的数据，而且我无法在数据库中看到
java - 将字符串转换为 ASCII 格式，然后再转换为 HEX 格式
我必须将值为 {"STX","ETX"} 的普通字符串数组转换为十六进制值，并且我应该根据 http://www.asciitable.com/ 得到 {2,3} . 最佳答案听起来你想要一个 Ma
flutter - dartfmt vs dart 格式 vs flutter 格式
我想格式化我的代码，但不确定哪种格式类型最适合我的项目需要。我发现仅对于 dart 和 flutter 项目(我都有)，有不止一个选项可用于格式化编程语言/框架中预先构建的代码。 Dart : da
excel - 我的 excel 文件是德国(德语)格式，想更改为英语(英国)格式
我已经尝试了多个代码，例如这样 Sub DateFixer() Application.ScreenUpdating = False Application.Calculation =
java - 当我查询 SOLR 时，我希望输出为 csv 格式，但输出仍然为 javabin 格式
SolrQuery query = new SolrQuery(); query.setQuery("*:*"); query.add("wt","csv"); server.query(query)
c++ - 将 QString 日期(RFC 822 格式)转换为另一种基于文化的 QString 格式
我有一个包含多个字符串的数据库，我从查询中获取了这些记录，并且我在 QString 中收到了这种格式的数据: "Mon, 13 Nov 2017 09:48:45 +0000" 所以，我需要根据文化来
xml - 如何在未安装 Excel 的情况下将 DBGrid 导出为 OOXML 格式(Excel 2007/2010 格式)？
我有一个 Delphi 2007 DBGrid，我想让用户以更新的 Excel 格式 (OOXML) 保存它，但我的标准是用户不需要安装 Excel。有没有人知道任何已经这样做的组件？是的，我已经搜索
ruby-on-rails - 在 rails 3.1 中更改 View 格式(提供移动 html 格式，回退到普通 html)
我正在我们的普通 html 站点旁边创建一个移动站点。使用 rails 3.1。移动站点在子域 m.site.com 中访问。我已经定义了移动格式(Mime::Type.register_alias
xmlstarlet 格式
我正在尝试使用 xmlstarlet 格式化 xml 文件，但我不想创建新的 xml 文件。我试过了 xmlstarlet fo --inplace --indent-tab --omit-decl
Excel 格式
我在 A 列中有一个带有文本的电子表格。例如 A1=MY TEXT1 A2=MY TEXT2 A3=MY TEXT3 A4=MY TEXT4 A5=MY TEXT5 我想在文本的前后添加撇号结果是
解析haskell保留注释/格式
我想做一些源代码转换(自动导入列表清理)，我想保留注释和格式。我听说过一些关于解析器这样做的事情，我认为是 ghc 解析器。看起来我可以通过从文件中提取内容来使用 hs-src-exts Langu
用于使值相等的 Excel 格式
我在 Excel 中工作，我想根据另一张表中的列表找出一张表中是否有匹配项。我已将值粘贴到列表中，并希望从另一张表中返回它们的相应值。包含字母和数字的单元格可以正常工作(例如:D5765000)，但
django - DurationField 格式
我有一个 DurationField在我的模型中定义为 day0 = models.DurationField('Duration for Monday', default=datetime.time
wmi - PNPDeviceID 格式
我正在为我的应用程序开发 WMI 查询。它需要为给定的 VID/PID 找到分配的虚拟 COM 端口。使用 WMI Code Creator 我发现...... 命名空间:root\CIMV2 类:W
swift - NSTextList 格式
我试图弄清楚如何使用 NSTextList，但除了 this SO question 之外，在网上几乎没有找到有用的信息。和 the comment in this blog . 使用这个我已经能够创
Oracle last_ddl_time 格式
我要查询all_objects表在哪里last_ddl_time='01 jan 2010'但它拒绝日期格式... 任何机构给我查询的确切格式？最佳答案正如 AKF 所说，您应该使用 Trunc除
Java JEditorPane 格式
我试图在我的应用程序中实现聊天功能。我使用了 2 个 JEditorPane。一个用于保存聊天记录，另一个用于将聊天发送到前一个 JEditorPane。 JEditorPane 是 text/h
assembly - 玩具编译器的输出语言/格式
我在大学里修了一个编译器类(class)，内容非常丰富，很有趣，尽管也很多工作。既然给了我们要实现的语言规范，所以我学不到的一件事就是语言设计。我现在正在考虑创建一种有趣的简单玩具语言，以便我可以玩耍
gradle - Gradle异常的结构/格式
Closed. This question does not meet Stack Overflow guidelines。它当前不接受答案。想改善这个问题吗？更新问题，以便将其作为on-topic

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c++ - 如何解析列式存储格式的 XML 文件？