gpt4 book ai didi

r - Sparklyr/Hive : how to use regex (regexp_replace) correctly?

转载 作者:行者123 更新时间:2023-12-04 09:31:00 26 4
gpt4 key购买 nike

考虑下面的例子

dataframe_test<- data_frame(mydate = c('2011-03-01T00:00:04.226Z', '2011-03-01T00:00:04.226Z'))

# A tibble: 2 x 1
mydate
<chr>
1 2011-03-01T00:00:04.226Z
2 2011-03-01T00:00:04.226Z

sdf <- copy_to(sc, dataframe_test, overwrite = TRUE)

> sdf
# Source: table<dataframe_test> [?? x 1]
# Database: spark_connection
mydate
<chr>
1 2011-03-01T00:00:04.226Z
2 2011-03-01T00:00:04.226Z

我想修改字符 timestamp 以使其具有更常规的格式。我尝试使用 regexp_replace 这样做,但失败了。
> sdf <- sdf %>% mutate(regex = regexp_replace(mydate, '(\\d{4})-(\\d{2})-(\\d{2})T(\\d{2}):(\\d{2}):(\\d{2}).(\\d{3})Z', '$1-$2-$3 $4:$5:$6.$7'))
> sdf
# Source: lazy query [?? x 2]
# Database: spark_connection
mydate regex
<chr> <chr>
1 2011-03-01T00:00:04.226Z 2011-03-01T00:00:04.226Z
2 2011-03-01T00:00:04.226Z 2011-03-01T00:00:04.226Z

有任何想法吗?什么是正确的语法?

最佳答案

Spark SQL 和 Hive 提供两种不同的功能:

  • regexp_extract - 它需要字符串、模式和要提取的组的索引。
  • regexp_replace - 它接受一个字符串、模式和替换字符串。

  • 前者可用于提取 单个组 ,索引语义为 being the same java.util.regex.Matcher 相同

    对于 regexp_replace 模式必须匹配整个字符串,如果没有匹配,则返回输入字符串:
    sdf %>% mutate(
    regex = regexp_replace(mydate, '^([0-9]{4}).*', "$1"),
    regexp_bad = regexp_replace(mydate, '([0-9]{4})', "$1"))

    ## Source: query [2 x 3]
    ## Database: spark connection master=local[8] app=sparklyr local=TRUE
    ##
    ## # A tibble: 2 x 3
    ## mydate regex regexp_bad
    ## <chr> <chr> <chr>
    ## 1 2011-03-01T00:00:04.226Z 2011 2011-03-01T00:00:04.226Z
    ## 2 2011-03-01T00:00:04.226Z 2011 2011-03-01T00:00:04.226Z

    而使用 regexp_extract 则不需要:
    sdf %>% mutate(regex = regexp_extract(mydate, '([0-9]{4})', 1))

    ## Source: query [2 x 2]
    ## Database: spark connection master=local[8] app=sparklyr local=TRUE
    ##
    ## # A tibble: 2 x 2
    ## mydate regex
    ## <chr> <chr>
    ## 1 2011-03-01T00:00:04.226Z 2011
    ## 2 2011-03-01T00:00:04.226Z 2011

    此外,由于间接执行(R -> Java),您必须转义两次:
    sdf %>% mutate(
    regex = regexp_replace(
    mydate,
    '^(\\\\d{4})-(\\\\d{2})-(\\\\d{2})T(\\\\d{2}):(\\\\d{2}):(\\\\d{2}).(\\\\d{3})Z$',
    '$1-$2-$3 $4:$5:$6.$7'))

    通常人们会使用 Spark 日期时间函数:
    spark_session(sc) %>%  
    invoke("sql",
    "SELECT *, DATE_FORMAT(CAST(mydate AS timestamp), 'yyyy-MM-dd HH:mm:ss.SSS') parsed from dataframe_test") %>%
    sdf_register


    ## Source: query [2 x 2]
    ## Database: spark connection master=local[8] app=sparklyr local=TRUE
    ##
    ## # A tibble: 2 x 2
    ## mydate parsed
    ## <chr> <chr>
    ## 1 2011-03-01T00:00:04.226Z 2011-03-01 01:00:04.226
    ## 2 2011-03-01T00:00:04.226Z 2011-03-01 01:00:04.226

    但遗憾的是 sparklyr 在这方面似乎极其有限,并将时间戳视为字符串。

    另见 change string in DF using hive command and mutate with sparklyr

    关于r - Sparklyr/Hive : how to use regex (regexp_replace) correctly?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44658852/

    26 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com