python - 加速 Django 数据库函数以对缺失值进行地理插值-6ren

python - 加速 Django 数据库函数以对缺失值进行地理插值

转载作者：行者123 更新时间：2023-11-29 13:15:34

我有一个大型商业地产地址数据库(约 500 万行)，其中 200,000 行缺少建筑面积。这些特性按行业分类，我知道每个特性的租金。

我对缺失的建筑面积进行插值的方法是，在建筑面积未知的特性的指定半径内过滤出类似分类的特性，然后根据附近特性的成本/平方米的中位数计算建筑面积。

最初，我使用 pandas 来解决这个问题，但随着数据集变大(甚至使用 group_by)，这已成为问题。它经常超出可用内存，然后停止。工作时，大约需要 3 个小时才能完成。

我正在测试是否可以在数据库中完成相同的任务。我为径向填充编写的函数如下:

def _radial_fill(self):
    # Initial query selecting all latest locations, and excluding null rental valuations
    q = Location.objects.order_by("locode","-update_cycle") \
                        .distinct("locode")
    # Chained Q objects to use in filter
    f = Q(rental_valuation__isnull=False) & \
        Q(use_category__grouped_by__isnull=False) & \
        Q(pc__isnull=False)
    # All property categories at subgroup level
    for c in LocationCategory.objects.filter(use_category="SGP").all():
        # Start looking for appropriate interpolation locations
        fc = f & Q(use_category__grouped_by=c)
        for l in q.filter(fc & Q(floor_area__isnull=True)).all():
            r_degree = 0
            while True:
                # Default Distance is metres, so multiply accordingly
                r = (constants.BOUNDS**r_degree)*1000 # metres
                ql = q.annotate(distance=Distance("pc__point", l.pc.point)) \
                      .filter(fc & Q(floor_area__isnull=False) & Q(distance__lte=r)) \
                      .values("rental_valuation", "floor_area")
                if len(ql) < constants.LOWER_RANGE:
                    if r > constants.UPPER_RADIUS*1000:
                        # Further than the longest possible distance
                        break
                    r_degree += 1
                else:
                    m = median([x["rental_valuation"]/x["floor_area"]
                                for x in ql if x["floor_area"] > 0.0])
                    l.floor_area = l.rental_valuation / m
                    l.save()
                    break

我的问题是这个函数需要 6 天才能运行。必须有更快的方法，对吧？我敢肯定我做错了什么...

模型如下:

class LocationCategory(models.Model):
    # Category types
    GRP = "GRP"
    SGP = "SGP"
    UST = "UST"
    CATEGORIES = (
        (GRP, "Group"),
        (SGP, "Sub-group"),
        (UST, "Use type"),
    )
    slug = models.CharField(max_length=24, primary_key=True, unique=True)
    usecode = models.CharField(max_length=14, db_index=True)
    use_category = models.CharField(max_length=3, choices=CATEGORIES,
                                    db_index=True, default=UST)
    grouped_by = models.ForeignKey("self", null=True, blank=True,
                                   on_delete=models.SET_NULL,
                                   related_name="category_by_group")

class Location(models.Model):
    # Hereditament identity and location
    slug = models.CharField(max_length=24, db_index=True)
    locode = models.CharField(max_length=14, db_index=True)
    pc = models.ForeignKey(Postcode, null=True, blank=True,
                           on_delete=models.SET_NULL,
                           related_name="locations_by_pc")
    use_category = models.ForeignKey(LocationCategory, null=True, blank=True,
                                     on_delete=models.SET_NULL,
                                     related_name="locations_by_category")
    # History fields
    update_cycle = models.CharField(max_length=14, db_index=True)
    # Location-specific econometric data
    floor_area = models.FloatField(blank=True, null=True)
    rental_valuation = models.FloatField(blank=True, null=True)

class Postcode(models.Model):
    pc = models.CharField(max_length=7, primary_key=True, unique=True) # Postcode excl space
    pcs = models.CharField(max_length=8, unique=True)                  # Postcode incl space
    # http://spatialreference.org/ref/epsg/osgb-1936-british-national-grid/
    point = models.PointField(srid=4326)

使用 Django 2.0 和 Postgresql 10

更新

通过以下代码更改，我的运行时间提高了 35%:

# Initial query selecting all latest locations, and excluding null rental valuations
q = Location.objects.order_by("slug","-update_cycle") \
                    .distinct("slug")
# Chained Q objects to use in filter
f = Q(rental_valuation__isnull=False) & \
    Q(pc__isnull=False) & \
    Q(use_category__grouped_by_id=category_id)
# All property categories at subgroup level
# Start looking for appropriate interpolation locations
for l in q.filter(f & Q(floor_area__isnull=True)).all().iterator():
    r = q.filter(f & Q(floor_area__isnull=False) & ~Q(floor_area=0.0))
    rl = Location.objects.filter(id__in = r).annotate(distance=D("pc__point", l.pc.point)) \
                                            .order_by("distance")[:constants.LOWER_RANGE] \
                                            .annotate(floor_ratio = F("rental_valuation")/
                                                                    F("floor_area")) \
                                            .values("floor_ratio")
    if len(rl) == constants.LOWER_RANGE:
        m = median([h["floor_ratio"] for h in rl])
        l.floor_area = l.rental_valuation / m
        l.save()

id__in=r 效率低下，但它似乎是在添加和排序新注释时保持 distinct 查询集的唯一方法。假设在 r 查询中可以返回大约 100,000 行，在那里应用的任何注释，以及随后的按距离排序，都可能需要非常长的时间。

但是……我在尝试实现子查询功能时遇到了很多问题。 AttributeError: 'ResolvedOuterRef' object has no attribute '_output_field_or_none' 我认为这与注释有关，但我找不到太多关于它的信息。

相关重构代码为:

rl = Location.objects.filter(id__in = r).annotate(distance=D("pc__point", OuterRef('pc__point'))) \
                                        .order_by("distance")[:constants.LOWER_RANGE] \
                                        .annotate(floor_ratio = F("rental_valuation")/
                                                                F("floor_area")) \
                                        .distinct("floor_ratio")

和:

l.update(floor_area= F("rental_valuation") / CustomAVG(Subquery(locs),0))

我可以看出这种方法应该非常有效，但要正确使用它似乎远远超出了我的技能水平。

最佳答案

您可以使用(大部分)经过优化的 Django 内置查询方法来简化您的方法。更具体地说，我们将使用:

Subquery和 OuterRef方法(版本 >= 1.11)。
来自Django aggregation 的注释 和AVG .
dwithin查找。
F()表达式(F() 的详细用例可以在我的 QA 样式示例中找到:How to execute arithmetic operations between Model fields in django

我们将创建一个自定义聚合类来应用我们的 AVG 函数(方法的灵感来自于这个出色的答案:Django 1.11 Annotating a Subquery Aggregate)

class CustomAVG(Subquery):
    template = "(SELECT AVG(area_value) FROM (%(subquery)s))"
    output_field = models.FloatField()

我们将使用它来计算以下平均值:

for location in Location.objects.filter(rental_valuation__isnull=True):
    location.update(
        rental_valuation=CustomAVG(
            Subquery(
                Location.objects.filter(
                    pc__point__dwithin=(OuterRef('pc__point'), D(m=1000)),
                    rental_valuation__isnull=False
                ).annotate(area_value=F('rental_valuation')/F('floor_area'))
                .distinct('area_value')
            )
        )
    )

以上分解:

我们收集所有没有 rental_valuation 的 Location 对象，然后“传递”列表。
子查询: 我们选择 radius=1000m 圆内的 Location 对象(将其更改为如你所愿)从我们当前的位置点开始，我们在它们上注释成本/m2计算(使用F()获取列的值每个对象的 rental_valuation 和 floor_area)，作为名为 area_value 的列。为了获得更准确的结果，我们仅选择此列的不同值。
我们将 CustomAVG 应用到 Subquery 并更新我们当前的位置 rental_valuation。

关于python - 加速 Django 数据库函数以对缺失值进行地理插值，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49570712/

文章推荐： postgresql - 如何在 mac 中为 postgres.app 安装 pgrouting

文章推荐： php - 加法有效，减法无效

c# - LoginAsync 缺失
我编写了一个 Android 应用程序，它使用 Azure 来执行用户通过 Google、Twitter 和 Facebook 的登录；它使用 Microsoft.WindowsAzure.Mobil
c# - AdomdClient 缺失
我想将 AdomdClient 引用添加到 C# 项目，但它不在引用列表中。客户端列在程序集文件夹 C:\Windows\Assembly 中。计算机上安装了 SQL Server 2012。最佳
JavaScript - 缺失；声明之前
我正在学习“绘图应用程序”的教程。当我在 Firefox 上启动我的应用程序时，Firebug 告诉我“在语句之前缺少 ;” 我在第 9 行调用函数的位置。我只是不明白应该将这些“;”放在哪里. va
c# - AdomdClient 缺失
我想将 AdomdClient 引用添加到 C# 项目，但它不在引用列表中。客户端列在程序集文件夹 C:\Windows\Assembly 中。计算机上安装了 SQL Server 2012。最佳
Javascript 语法错误 - 缺失)
我在 Firebug 中不断收到关于 onClick 事件的错误。我已经尝试了 "和 ' 的各种不同组合，但无济于事。在添加 onClick 事件之前，这工作正常。有人能发现我可能做错了什么吗？
c++ - WSASetSocketSecurity 缺失
Visual Studio 2015 告诉我找不到 WSASetSocketSecurity。该 dll 存在并且还包括似乎没问题。我的包括: windows.h stdio.h Wincrypt
laravel - Eloquent whereHasNot 缺失
我需要访问 eloquent 的 whereHasNot方法(此处添加: https://github.com/laravel/framework/commit/8f0cb08d8ebd157cbfe
TensorFlow 对象检测评估 pycocotools 缺失
跟随宠物物体检测的 TF 教程:https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/run
Eclipse Galileo - JUnit 缺失
构建路径 > 添加库 > JUnit 无法添加 JUnit3 或 JUnit4 组件。我在.log 中看到这样的消息 !MESSAGE No property tester contributes
camera - Gstreamer ffdec_h264 缺失
我正在运行此脚本来查看网络上的摄像机: gst-launch udpsrc port=1234 ! "application/x-rtp, payload=127" ! rtph264depay !
java - 如何记录资源包 key 缺失
我正在使用http://java.sun.com/jsp/jstl/fmt用于从 Spring 配置中设置的 Message Resource Bundle 输出消息的标签库。消息解析也可以放在 Co
c# - HttpConfiguration.get_ServiceResolver() 缺失
我正在将 Ninject 与 MVC4 连接起来，并让它工作到尝试实际解决依赖关系的程度。但是，我收到以下异常: Method not found: 'System.Web.Http.Services
android - Admob 中的更新 - 缺失
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。这个问题似乎与 help center 中定义的范围内的编程无关。 . 关闭 9 年前。 Improve
c# - Microsoft.ApplicationInsights 缺失
我在启动 ASP.NET MVC5 应用程序时遇到问题。到目前为止一切正常。启动应用程序时出现以下错误: Could not load file or assembly 'Microsoft.Appl
python - conda 环境名称问题(缺失)
我已经使用以下方法创建了一个环境: conda create --prefix C:\Users\Dell\Dropbox\DjangoProjects\webenv python=3.6 执行后:c
c# - MVC 缺失 View
我们有一个遗留的 Web 窗体应用程序，我们最近将其从网站项目转换为 Web 应用程序项目。 Web 窗体项目是解决方案的“启动”项目。有一个 MVC 项目是对 Web 窗体项目的引用。在 MVC
java - Java 中的字体指标不正确/缺失？
使用某种字体，我使用Java的FontLayout来确定它的上升、下降和行距。 (参见 Java 的 FontLayout 教程 here) 在我的具体案例中，我使用的是 Arial Unicode
c++ - 未定义引用。 DSO 缺失
我正在尝试在 linux 下编译 qt ffmpeg 包装器简单编码/解码示例 QTFFmpegWrapper source # Set list of required FFmpeg librari
android - SlidingTabLayout setDistributeEvenly 缺失
我正在使用来自开发人员 android 页面的 SlidingTabLayout.java。在我使用 slidingTabLayout.setDistributeEvenly(true); 使 sli
video - FFmpeg 的常用过滤器 "v360"缺失
我正在尝试使用 v360 filter 将 180° 鱼眼视频转换为普通/常规视频的 FFmpeg . 这是我尝试过的命令:ffmpeg -i in.mp4 -vf "v360=input=fishe

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 加速 Django 数据库函数以对缺失值进行地理插值