gpt4 book ai didi

performance - 如何提高 BigQuery 中 GeoIP 查询的性能?

转载 作者:行者123 更新时间:2023-12-04 01:39:53 26 4
gpt4 key购买 nike

我已经在 BigQuery 中加载了我的应用程序日志,我需要根据这些日志中的 IP 地址计算国家/地区。

我在我的表和从 MaxMind 下载的 GeoIP 映射表之间编写了一个连接查询。 .

理想的查询是 OUTER JOIN使用范围过滤器,但是 BQ仅支持 =在加入条件。
所以查询执行 INNER JOIN并处理 JOIN 两边的缺失值.

我已经修改了我的原始查询,以便它可以在 Wikipedia 公共(public)数据集上运行。

有人可以帮我让它跑得更快吗?

SELECT id, client_ip, client_ip_code, B.Country_Name as Country_Name

FROM
(SELECT id, contributor_ip as client_ip, INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code, 1 AS One
FROM [publicdata:samples.wikipedia] Limit 1000) AS A1

JOIN
(SELECT From_IP_Code, To_IP_Code, Country_Name, 1 AS One
FROM

-- 3 IP sets: 1.valid ranges, 2.Gaps, 3. Gap at the end of the set
-- all Ranges of valid IPs:
(SELECT From_IP_Code, To_IP_Code, Country_Name FROM [QA_DATASET.GeoIP])

-- Missing rages lower from From_IP
,(SELECT
PriorRangeEndIP + 1 From_IP_Code,
From_IP_Code - 1 AS To_IP_Code,
'NA' AS Country_Name
FROM

-- use of LAG function to find prior valid range
(SELECT
From_IP_Code,
To_IP_Code, Country_Name,
LAG(To_IP_Code, 1, INTEGER(0))
OVER(ORDER BY From_IP_Code asc) PriorRangeEndIP
FROM [QA_DATASET.GeoIP]) A

-- If gap from prior valid range is > 1 than its a gap to fill
WHERE From_IP_Code > PriorRangeEndIP + 1)

-- Missing rages higher tan Max To_IP
,(SELECT MAX(To_IP_Code) + 1 as From_IP_Code, INTEGER(4311810304) as To_IP_Code, 'NA' AS Country_Name
FROM [QA_DATASET.GeoIP])
) AS B
ON A1.ONE = B.ONE -- fake join condition to overcome allowed use of only = in joins

-- Join condition where valid IP exists on left
WHERE
A1.client_ip_code >= B.From_IP_Code
AND A1.client_ip_code <= B.To_IP_Code
OR (A1.client_ip_code IS NULL
AND B.From_IP_Code = 1) -- where there is no valid IP on left contributor_ip

最佳答案

灵感来自 https://medium.com/@hoffa/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds-e9e652480bd2我想出了这个处理 ipv6 地址的解决方案:

WITH test_data AS (
SELECT '2a02:2f0c:570c:fe00:1db7:21c4:21fa:f89' AS ip UNION ALL
SELECT '79.114.150.111' AS ip
)
-- replace the input_data with your data
, ipv4 AS (
SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
FROM test_data
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
), ipv4d AS (
SELECT ip, city_name, country_name, latitude, longitude
FROM (
SELECT ip, ip_bytes & NET.IP_NET_MASK(4, mask) network_bin, mask
FROM ipv4, UNNEST(GENERATE_ARRAY(8,32)) mask
)
JOIN `demo_bq_dataset.geoip_city_v4`
USING (network_bin, mask)
), ipv6 AS (
SELECT DISTINCT ip, NET.SAFE_IP_FROM_STRING(ip) AS ip_bytes
FROM test_data
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 16
), ipv6d AS (
SELECT ip, city_name, country_name, latitude, longitude
FROM (
SELECT ip, ip_bytes & NET.IP_NET_MASK(16, mask) network_bin, mask
FROM ipv6, UNNEST(GENERATE_ARRAY(19,64)) mask
)
JOIN `demo_bq_dataset.geoip_city_v6`
USING (network_bin, mask)
)
SELECT * FROM ipv4d
UNION ALL
SELECT * FROM ipv6d
为了得到 geoip_city_v4geoip_city_v6您需要从 https://maxmind.com/ 下载 geoip 数据库
您可以按照本教程更新和准备数据集 https://hodo.dev/posts/post37-gcp-bigquery-geoip/ .

关于performance - 如何提高 BigQuery 中 GeoIP 查询的性能?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19618105/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com