data-warehouse - 数据仓库 : Working with accumulated data-6ren

data-warehouse - 数据仓库 : Working with accumulated data

转载作者：行者123 更新时间：2023-12-04 06:46:21

29

4

我们的数据仓库使用来自数据源的累积数据(并且无法反转累积)来创建雪花模式。我们必须满足的一个要求是我们的架构必须可用于创建基于日期范围的报告。

我们的模式看起来像这样(简化):

+------------------------------------------+
| fact                                     |
+-------+-----------------+----------------+
|    id | statisticsDimId | dateRangeDimId |
+-------+-----------------+----------------+
|     1 |               1 |             10 |
|     2 |               2 |             11 |
|     3 |               3 |             12 |
|     4 |               4 |             13 |
|     5 |               5 |             14 |
|     6 |               5 |             15 |
|     7 |               5 |             16 |
|   ... |             ... |            ... |
| 10001 |            9908 |             11 |
| 10002 |            9909 |             11 |
+-------+-----------------+----------------+

+-------------------------------------------------+
| date_range_dimension                            |
+-------+-----------------------------------------+
|    id | startDateTime      | endDateTime        |
+-------+--------------------+--------------------+
|    10 | '2012-01-01 00:00' | '2012-01-01 23:59' |
|    11 | '2012-01-01 00:00' | '2012-01-02 23:59' |
|    12 | '2012-01-01 00:00' | '2012-01-03 23:59' |
|    13 | '2012-01-01 00:00' | '2012-01-04 23:59' |
|    14 | '2012-01-01 00:00' | '2012-01-05 23:59' |
|    15 | '2012-01-01 00:00' | '2012-01-06 23:59' |
|    16 | '2012-01-01 00:00' | '2012-01-07 23:59' |
|    17 | '2012-01-01 00:00' | '2012-01-08 23:59' |
|    18 | '2012-01-01 00:00' | '2012-01-09 23:59' |
|   ... |                ... |                ... |
+-------+--------------------+--------------------+

+-----------------------------------------------------+
| statistics_dimension                                |
+-------+-------------------+-------------------+-----+
|    id | accumulatedValue1 | accumulatedValue2 | ... |
+-------+-------------------+-------------------+-----+
|     1 |    [not relevant] |    [not relevant] | ... |
|     2 |    [not relevant] |    [not relevant] | ... |
|     3 |    [not relevant] |    [not relevant] | ... |
|     4 |    [not relevant] |    [not relevant] | ... |
|     5 |    [not relevant] |    [not relevant] | ... |
|     6 |    [not relevant] |    [not relevant] | ... |
|     7 |    [not relevant] |    [not relevant] | ... |
|   ... |    [not relevant] |    [not relevant] | ... |
|   ... |    [not relevant] |    [not relevant] | ... |
| 10001 |    [not relevant] |    [not relevant] | ... |
| 10002 |    [not relevant] |    [not relevant] | ... |
+-------+-------------------+-------------------+-----+

我们想用这样的东西创建我们的报告数据集:

SELECT *
    FROM fact
INNER JOIN statistics_dimension
    ON (fact.statisticsDimId = statistics_dimension.id)
INNER JOIN date_range_dimension
    ON (fact.dateDimId = date_range_dimension.id)
WHERE
    date_range_dimension.startDateTime = [start]
AND
    date_range_dimension.endDateTime = [end]

问题是我们统计维度的数据已经积累了，我们不能反转积累。我们计算了事实表中的近似行数，得到 5,250,137,022,180。我们的数据大约有 250 万个日期范围排列，由于积累，我们需要将它们计算到我们的日期维度和事实表中。由于累加，SQL 的 SUM 函数对我们不起作用(您不能添加属于非不同集合的两个值)。

是否有我们可以遵循的最佳实践来使其在计算上可行？我们的架构设计有问题吗？

我们需要报告有关在线培训的数据。数据源是遗留数据提供者，其中的部分已超过 10 年 - 因此没有人可以重建内部逻辑。统计维度包含 - 例如 - 用户在基于网络的培训 (WBT) 中完成的进度(%)、每个 WBT 页面的调用次数、WBT 的状态(对于用户，例如“已完成”) , a.s.o.. 数据提供者的重要之处在于:它只是给我们一个当前状态的快照。我们无权访问历史数据。

最佳答案

我假设您为此使用了一些非常强大的硬件。您的设计有一个主要缺点 - 事实表和“统计”维度之间的连接。

一般情况下，事实表包含维度和度量。在我看来，您的“统计”维度和事实表之间可能存在 1-1 关系。由于事实表本质上是一个“多对多”关系表，因此将您的统计数据放在单独的表上没有意义。此外，您说统计表包含“按用户”的信息。

每当您在仓储中说“按 X”时，您几乎总是可以确定 X 应该是一个维度。

我会考虑直接使用度量来构建您的事实表。我不确定您要通过“反转”统计表上的累积来做什么？您是说它是跨日期范围累积的吗？用户？如果数据不是原子的，你能做的最好的就是给出你拥有的...

关于data-warehouse - 数据仓库 : Working with accumulated data，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/13913957/

29

4

0

文章推荐： android-studio - 忽略大小写android studio 3代码完成

文章推荐： text-to-speech - 在 Watson 文本转语音中使用不同的语调

文章推荐： android-ndk-r5 - 替换 Android L NDK 中的 `__system_property_get`

C++ 线程池 : should work be able to add more work to the work-queue
我想避免创建 std::thread 的开销，因此我要实现一个线程池。我正在为一个设计决策而苦苦挣扎: 工作队列中的工作是否应该能够将工作添加到工作队列中？如果是，如何？问题出现了，因为我想让我添加
html - 为什么伪类:visited doesn't work properly like font-size or text-shadow doesn't work at all but color works
color 属性正常工作，但其他两个属性(font-size 和 text-shadow)不起作用。当链接被访问时，它的字体大小应该减小到 20 px 并且应用 text-shadow 属性，但它没有
laravel - Php工匠队列:work doesn't work with supervisor
我已经安装并配置了 supervisor。 ps -ax 显示 10 个进程，例如:php/home/vagrant/Sites/mysite/artisan queue:work --tries=1
mongodb - Php工匠队列:work not working but job are inserted
我对 php artisan queue::work 命令感到不安。我的命令不起作用，但我的作业已插入作业表但从未执行。我正在为队列使用 mongodb 驱动程序。我做错了什么，请给我建议。最
terminology - "Work stealing"与 "Work shrugging"?
为什么我可以找到很多关于“工作窃取”的信息而没有关于“工作耸肩”作为动态负载平衡策略的信息？通过“工作耸肩”，我的意思是将多余的工作从繁忙的处理器转移到负载较低的邻居上，而不是让空闲的处理器从忙碌的
PHP 和 MYSQL : Why does A work and B not work?
首先，我正在为 MySQL 使用 DATE_ADD 函数。当试图在 php 中使用 $sqlA 时，由于某种原因它说语法错误(主要是 WHERE 之后的区域)。为什么？ $sqlA = "SELECT
html - :active is not working while a:hover is working well
a:hover { color: #237ca8 !important; font-weight: bold; } a:active { color: #cccccc !imp
html - 什么时候 margin : auto; work and not work?
关闭。这个问题需要更多focused .它目前不接受答案。想改进这个问题吗？更新问题，使其只关注一个问题 editing this post . 关闭 7 年前。 Improve this q
html - :focus isn't working but :hover works fine
我试图让只能使用 Tab 键的用户可以访问我的网站。我遇到的问题是，当我尝试使用 tab 键选择 float 的 div 时，不会触发 :focus in css；我不知道为什么它没有被触发。鼠标悬停
html - 显示 :inline doesn't work work with border
我在尝试将 2 个 div 并排放置时遇到了问题。 display: inline 它会删除我的边框并且不会将两个 div 放在同一行上。请指教: .gig { outline: 1px s
css - 高度 :100%; works, 但最小高度 :100%; doesn't work?
这是 fiddle :http://jsfiddle.net/j9Gmx/ 我怎样才能得到最小高度:100%；上类？最佳答案它正在工作，但由于 div 的父级(正文)没有高度，100% 基本上是
flutter - WebRTC : not working on WIFI/works on Mobile Data
我正在使用 Flutter WebRTC 来创建 P2P 视频通话。我遇到了一个与网络相关的问题:我已经完成了应用程序，但它只适用于移动数据。将网络更改为WiFi时，它不起作用并且连接状态挂起Ch
javascript - 按钮点击平滑滚动 : Not working for me but works great in the Code Snippet
我是 JavaScript 和 jQuery 的初学者。我的 css 和 JavaScript 代码位于 html 文件外部。这个问题已经有了答案，我尝试了所有代码，但滚动不起作用。我不知道我错过了什
rabbitmq - Spring AMQP : Message Priority not working working
我正在使用 Sprin AMQP 的rabbittemplate 通过 RabbitMQ 发送和接收消息。我能够发送和接收消息，但是，我想优先处理消息。例如，如果我推送 1000 条消息，假设奇数消
java - 观察 WorkManager Work 以获得完成的 Work 输出
我已经在 WorkManager 中加入了一个PeriodicWork，并希望每次完成时都获取它的 Worker 的输出数据，但以下代码似乎不起作用，因为 Log 消息没有出现在 Logcat 中:
javascript - AngularJS 指令 : "templateUrl" doesn't work while "template" works
我有一个名为 areaOne 的 AngularJS 指令。当我使用 template 时，会显示模板，但当我在 area1.js 中使用 templateUrl 时，不会呈现模板 HTML。我在这
javascript - 是:after working supposed to work when applied to an input?
“:after”选择器在应用于带有 FF 和 IE 的输入时不起作用 input:after { content: "title"; } 而它正在处理 p、a 等。这是一个错
sql - Count(*) with order by not working on PostgreSQL which works on Oracle
下面是适用于 oracle 但不适用于 PostgreSQL 的 Sql 查询。 select count(*) from users where id>1 order by username; 我知
html - 位置 :fixed not working on chrome but works in firefox
position?:fixed 在 chrome 浏览器上不工作，但在 firefox 中工作正常。我有一个侧边栏可以停止滚动并固定在顶部。它在 firefox 中运行完美，但在 chrome 中，
html - 跨度 :hover isn't working in Firefox but works in Chrome
我有一段代码无法在 Firefox 中运行。当按钮悬停时，.icon 图像不会改变。它在 Chrome 中完美运行。 button.add-to-cart-button .button-left .i

首页

博学

6Ren·AI

商城

data-warehouse - 数据仓库 : Working with accumulated data