2

StringType()在 PySpark 数据框中有一列。我想从该字符串中提取正则表达式模式的所有实例并将它们放入一个新列中ArrayType(StringType())

假设正则表达式模式是[a-z]\*([0-9]\*)

输入 df:

stringValue
+----------+
a1234bc123
av1tb12h18
abcd

输出df:

stringValue    output
+-----------+-------------------+
a1234bc123     ['1234', '123']
av1tb12h18     ['1', '12', '18']
abcd           []
4

3 回答 3

4

Spark 3.1+ regexp_extract_all中可用。

regexp_extract_all(str, regexp[, idx])- 提取与表达式str匹配regexp并对应于正则表达式组索引的所有字符串。

df = df.withColumn('output', F.expr("regexp_extract_all(stringValue, '[a-z]*([0-9]+)', 1)"))

df.show()
#+-----------+-----------+
#|stringValue|     output|
#+-----------+-----------+
#| a1234bc123|[1234, 123]|
#| av1tb12h18|[1, 12, 18]|
#|       abcd|         []|
#+-----------+-----------+
于 2021-03-24T10:33:20.323 回答
0

尝试在火花中使用splitand array_removefrom :functions

  1. 创建测试数据框
from pyspark.sql import functions as F
df = spark.createDataFrame([("a1234bc123",), ("av1tb12h18",), ("abcd",)],["stringValue"])
df.show()

原始数据框:

+-----------+
|stringValue|
+-----------+
| a1234bc123|
| av1tb12h18|
|       abcd|
+-----------+
  1. 仅用于split将字符串分隔为数字
df = df.withColumn("mid", F.split('stringValue', r'[a-zA-Z]'))
df.show()

输出:

+-----------+-----------------+
|stringValue|              mid|
+-----------+-----------------+
| a1234bc123|  [, 1234, , 123]|
| av1tb12h18|[, , 1, , 12, 18]|
|       abcd|       [, , , , ]|
+-----------+-----------------+
  1. 最后,用于array_remove去除非数字元素
df = df.withColumn("output", F.array_remove('mid', ''))
df.show()

最终输出:

+-----------+-----------------+-----------+
|stringValue|              mid|     output|
+-----------+-----------------+-----------+
| a1234bc123|  [, 1234, , 123]|[1234, 123]|
| av1tb12h18|[, , 1, , 12, 18]|[1, 12, 18]|
|       abcd|       [, , , , ]|         []|
+-----------+-----------------+-----------+

于 2019-08-26T20:31:39.510 回答
0

您可以使用regexp_replace和功能模块的split api的组合

import pyspark.sql.types as t
import pyspark.sql.functions as f

l1 = [('anystring',),('a1234bc123',),('av1tb12h18',)]
df = spark.createDataFrame(l1).toDF('col')
df.show()
+----------+
|       col|
+----------+
| anystring|
|a1234bc123|
|av1tb12h18|
+----------+

现在使用替换匹配的正则表达式,然后用“,”分割。这里 $1 指的是替换值,因此匹配正则表达式时它将为空。

e.g replace('anystring')
$0 = anystring
$1 = ""

dfl1 = df.withColumn('temp', f.split(f.regexp_replace("col", "[a-z]*([0-9]*)", "$1,"), ","))

dfl1.show()
+----------+---------------+
|       col|           temp|
+----------+---------------+
| anystring|         [, , ]|
|a1234bc123|[1234, 123, , ]|
|av1tb12h18|[1, 12, 18, , ]|
+----------+---------------+

火花 <2.4

使用 UDF 替换数组的空值

def func_drop_from_array(arr):
    return [x for x in arr if x != '']

drop_from_array = f.udf(func_drop_from_array, t.ArrayType(t.StringType()))

dfl1.withColumn('final', drop_from_array('temp')).show()
+----------+---------------+-----------+
|       col|           temp|      final|
+----------+---------------+-----------+
| anystring|         [, , ]|         []|
|a1234bc123|[1234, 123, , ]|[1234, 123]|
|av1tb12h18|[1, 12, 18, , ]|[1, 12, 18]|
+----------+---------------+-----------+

火花 >=2.4

使用array_remove

dfl1.withColumn('final', f.array_remove('temp','')).show()

+----------+---------------+-----------+
|       col|           temp|      final|
+----------+---------------+-----------+
| anystring|         [, , ]|         []|
|a1234bc123|[1234, 123, , ]|[1234, 123]|
|av1tb12h18|[1, 12, 18, , ]|[1, 12, 18]|
+----------+---------------+-----------+
于 2019-08-26T21:32:07.630 回答