gpt4 book ai didi

c++ - std::regex:Ubuntu(15.10)-Clang++ 生成的二进制文件比 Debian-8-Clang++(均为 v.3.4。)

转载 作者:搜寻专家 更新时间:2023-10-31 00:59:22 26 4
gpt4 key购买 nike

我创建了一个测试程序,它在解析 csv 数据时测量 std::regex 的性能:

#include <string.h>
#include <iostream>
#include <stdexcept>
#include <chrono>
#include <regex>
#include <set>
#include <iomanip>

#define DEFAULT_REGEX \
R"(^((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;)*);)" \
R"((L|P|D|DN|R|W|LS|PS|RS|LU|PU|RU|LK|PK|RK|F);)" \
R"(((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;)*);)" \
R"(((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;)*);)" \
R"(((?:[^\x00-\x1F\x80-\xFF\\;]|\\\\|\\;|\\:)*))" \
R"((?:;((?:[^\x00-\x1F\x80-\xFF\\;])" \
R"(|\\\\|\\;|\';\')*))?$)"

struct results_t {
std::string address;
std::string command;
std::string client;
std::string param;
std::string value;
std::string error;
};

void std_regex(std::size_t num, const std::string &str, results_t &res) {
std::smatch pieces;
static const std::regex pattern{DEFAULT_REGEX};
for (auto i = 0u; i < num; i++) {
bool matched = std::regex_match(str, pieces, pattern);
if (!(matched && pieces.size() == 7)) {
throw std::runtime_error("ERROR");
}
}
res.address = pieces[1];
res.command = pieces[2];
res.client = pieces[3];
res.param = pieces[4];
res.value = pieces[5];
res.error = pieces[6];
}

std::size_t get_median(const std::multiset<std::size_t> &measured_values) {
std::size_t i = 0;
std::size_t median = 0;
for (auto it = measured_values.cbegin();; it++, i++) {
double tmp = static_cast<double>(measured_values.size() - 1) / 2.0;
if (i == floor(tmp)) {
median = *it;
}
if (i == ceil(tmp)) {
median += *it;
break;
}
}
return static_cast<std::size_t>(static_cast<double>(median) / 2.0 + 0.5);
}

std::size_t get_avg(const std::multiset<std::size_t> &measured_values) {
return static_cast<std::size_t>(
std::accumulate(measured_values.cbegin(), measured_values.cend(), 0) /
static_cast<double>(measured_values.size()) +
0.5);
}

int main(void) {
constexpr std::size_t num = 100000;
constexpr std::size_t measure_num = 250;
std::string str = "zzz\\\\bbbb;L;babaa;bubu\\;cc;vvvv;asdff";

std::multiset<std::size_t> measured_values;
results_t res;

for (std::size_t i = 0; i < measure_num; i++) {
auto start = std::chrono::system_clock::now();
std_regex(num, str, res);
auto end = std::chrono::system_clock::now();
measured_values.insert(
std::chrono::duration_cast<std::chrono::microseconds>(end - start)
.count());
}

std::cout << *measured_values.cbegin() << ";" // min
<< *measured_values.crbegin() << ";" /// max
<< get_avg(measured_values) << ";" // average
<< get_median(measured_values) << std::endl; // median
}

使用 Ubuntu 15.10 和 Debian 8,代码编译(没有错误或警告):

clang++-3.4 -DCOMPILER='"clang++-3.4"' -Wall -pedantic-errors -Werror -Wextra -DNDEBUG -O3 -mtune=native -march=native -std=c++1y -o eval_clang_3_4 eval.cpp

正如预期的那样,如果使用不同的编译器,该程序会显示不同的时间。例如。如果您使用 g++5.2 而不是 g++4.9,性能会变得更好。

但是这个评估程序也显示了一个有趣的特征:如果你在 Debian 8 而不是 Ubuntu 15.10 上使用 clang++-3.4,它会产生更糟糕的时间。该软件在同一台机器(Intel i7-3770k 和 8GB RAM)上运行两次,并且在这两种情况下都使用 clang++-3.4。

评估执行了 250 次,在接下来的几行中,您会看到此测量的统计信息。

以下是 Debian 8 上的测量值:(最小值;最大值;平均值;中值)

691244;1160628;713112;700739

以下是 Ubuntu 15.10 上的测量值:(最小值;最大值;平均值;中值)

198484;290986;202656;200637

如果差异大约是 10% 或 20%,我不会关心这个,但在这种情况下,差异大约是 350%。

为什么在执行这个二进制文件时会有如此大的差异?

最佳答案

我做了更多的基准测试,详细说明了 my earlier answer 中的测试.

我已经在

中创建了替代的解析器实现
  • 灵气 (v2.x)
  • Spirit X3(仅限 c++14,实验性)
  • 手动解析器(用 c++14 风格编写,但很容易用 c++03 编写)

性能结果:

显然,无论使用何种编译器,手写解析器无疑是赢家。

Spirit X3 紧随其后。

灵气完全匹配std_regex性能,除了libc++,因为std_regex只是慢。

总结:

我建议使用 Spirit 或手动解析器,因为:

  • 维护正则表达式是一场噩梦(事实上 I already spotted errors )
  • 所有三种选择都会为您提供更有用的结果(因为转义序列是实际解释的,因此您不必再次处理这些)
  • X3语法非常容易维护

备选方案1:Spirit X3

如果您负担得起使用需要 C++14 的实验性增强库,这是我个人的最爱。看看代码你就会明白为什么:

void spiritX3(const std::string &payload, results_t &res) {

using namespace boost::spirit::x3;
auto escaping = [](auto&& set) { return ('\\' >> char_(set)) | (print - char_(set)); };
auto text = escaping(";\\");

symbols<unused_type> cmds;
cmds += "L", "P", "D", "DN", "R", "W", "LS", "PS", "RS", "LU", "PU", "RU", "LK", "PK", "RK", "F";

auto address_ = *text;
auto command_ = raw [ cmds ];
auto client_ = *text;
auto param_ = *text;
auto value_ = *escaping(";:\\"); // note the ':'
auto error_ = *("'" >> char_(';') >> "'" | escaping(";\\"));

auto attr = std::tie(res.address, res.command, res.client, res.param, res.value, res.error);

if (!parse(
payload.begin(), payload.end(),
address_ >> ';' >> command_ >> ';' >> client_ >> ';' >> param_ >> ';' >> value_ >> -(';' >> error_),
attr))
{
throw std::runtime_error("ERROR");
}
}

方案二:灵气

这基本上是 X3 语法的直接反射(reflect),但是有一些宏滥用来弥补 Qi 的局限性(您也可以通过重复代码来“修复”它)。

Spirit Qi 具有完全兼容 C++03 的优点,并且是近十年稳定提升的一部分:

void spiritQi(const std::string &payload, results_t &res) {

using namespace boost::spirit::qi;

#define ESCAPING(set) (('\\' >> char_(set)) | (print - char_(set)))
#define TEXT *ESCAPING(";\\")

symbols<char, unused_type> cmds;
cmds += "L", "P", "D", "DN", "R", "W", "LS", "PS", "RS", "LU", "PU", "RU", "LK", "PK", "RK", "F";

using It = std::string::const_iterator;
rule<It, std::string()> address_ = TEXT;
rule<It, std::string()> command_ = raw [ cmds ];
rule<It, std::string()> client_ = TEXT;
rule<It, std::string()> param_ = TEXT;
rule<It, std::string()> value_ = *ESCAPING(";:\\"); // note the ':'
rule<It, std::string()> error_ = *("'" >> char_(';') >> "'" | ESCAPING(";\\"));

BOOST_SPIRIT_DEBUG_NODES((address_)(command_)(client_)(param_)(value_)(error_))

#undef TEXT
#undef ESCAPING

auto attr = std::tie(res.address, res.command, res.client, res.param, res.value, res.error);

if (!parse(
payload.begin(), payload.end(),
address_ >> ';' >> command_ >> ';' >> client_ >> ';' >> param_ >> ';' >> value_ >> -(';' >> error_),
attr))
{
throw std::runtime_error("ERROR");
}
}

方案三:手动解析代码

此代码不采用任何依赖项,是完全标准的 C++。

当然,如您所见,这将需要更多编码。

我们通过使用 C++14 lambda 使其“自包含”,但在 C++03 中编写等效的解析代码“很容易”,优化后应该会产生相同的性能。

void manual(const std::string &payload, results_t &res) {

using It = std::string::const_iterator;
It it = payload.begin();
It const end = payload.end();

auto consume = [&](char const* escape_set, std::string& into, auto&& specials) {
while (it != end)
if (!specials(into)) switch (*it) {
case '\\':
if (++it != end && strchr(escape_set, *it))
into += *it++;
else
throw "invalid escape";
break;
default:
if (isprint(*it) && !strchr(escape_set, *it))
into += *it++;
else
return true;
}
return true;
};

auto escaping = [&](char const* escape_set, std::string& into) {
return consume(escape_set, into, [](std::string&) { return false; });
};
auto matched = [&](char const* what) {
auto saved = it;
auto wit = what;
while (*wit) {
if (it != end && *wit == *it)
{ ++wit; ++it; }
else {
it = saved;
// throw "expected: '" + std::string(what);
return false;
}
}

return true;
};

auto expect = [&](char const* what) {
if (!matched(what))
throw "expected: '" + std::string(what);
return true;
};

auto cmd = [&](std::string& into) {
static const char *const cmds[] = { "D", "DN", "F", "L", "LK", "LS", "LU", "P", "PK", "PS", "PU", "R", "RK", "RS", "RU", "W" };
for (auto cmd : cmds)
if (matched(cmd)) {
into.assign(cmd);
return true;
}
return false;
};

bool ok = escaping(";\\", res.address) && expect(";")
&& cmd(res.command) && expect(";")
&& escaping(";\\", res.client) && expect(";")
&& escaping(";\\", res.param) && expect(";")
&& escaping(":;\\", res.value);

auto squoted_semicolon = [&](std::string& into) {
if (!matched("';'"))
return false;
into += ';';
return true;
};

ok &= (it==end) || (expect(";") && consume(";\\", res.error, squoted_semicolon));

if (!ok)
throw std::runtime_error("ERROR");
}

示例输出

使用 libc++ 配置 clang 3.6 的输出:

---- parsed with regex:
address: zzz\\bbbb
command: L
client: babaa
param: bubu\;cc
value: vvvv
error: asd';'ff
---- parsed with manual parser (note: unescaping taken care of):
address: zzz\bbbb
command: L
client: babaa
param: bubu;cc
value: vvvv
error: asd;ff
---- parsed with spirit Qi (note: unescaping taken care of):
address: zzz\bbbb
command: L
client: babaa
param: bubu;cc
value: vvvv
error: asd;ff
clock resolution: mean is 16.9379 ns (40960002 iterations)

benchmarking std_regex
collecting 100 samples, 1 iterations each, in estimated 4.968 ms
mean: 15.2716 μs, lb 14.8763 μs, ub 16.1072 μs, ci 0.95
std dev: 2.81028 μs, lb 1668.21 ns, ub 5.63468 μs, ci 0.95
found 1 outliers among 100 samples (1%)
variance is severely inflated by outliers

benchmarking spirit Qi
collecting 100 samples, 7 iterations each, in estimated 1780.1 μs
mean: 2.15209 μs, lb 2.06754 μs, ub 2.22874 μs, ci 0.95
std dev: 412.372 ns, lb 369.921 ns, ub 453.462 ns, ci 0.95
found 0 outliers among 100 samples (0%)
variance is severely inflated by outliers

benchmarking manual
collecting 100 samples, 37 iterations each, in estimated 1705.7 μs
mean: 451.902 ns, lb 448.665 ns, ub 459.504 ns, ci 0.95
std dev: 23.7123 ns, lb 7.42683 ns, ub 41.7546 ns, ci 0.95
found 2 outliers among 100 samples (2%)
variance is severely inflated by outliers

关于c++ - std::regex:Ubuntu(15.10)-Clang++ 生成的二进制文件比 Debian-8-Clang++(均为 v.3.4。),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33477982/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com