我有一个 JSON 文件,不幸的是每行前面都有一些不需要的文本:
2019-07-02T22:53:16.848Z LOGFILE {"key":{"host":"example1.net","srcIP":"1.0.0.0","dstIp":"2.0.0.0"},"count":4,"last_seen":"2019-07-02T22:48:15.362Z"}
2019-07-02T22:53:16.937Z LOGFILE {"key":{"host":"example2.net","srcIP":"1.0.0.1","dstIp":"2.0.0.1"},"count":2,"last_seen":"2019-07-02T22:53:07.018Z"}
......
我想按如下方式加载此文件:
from pyspark.sql import SparkSession, SQLContext
spark = SparkSession \
.builder \
.appName("LogParser") \
.getOrCreate()
sc = spark.SparkContext()
sqlContext = SQLContext(sc)
df = sqlContext.read.json('log_sample.json')
但需要一种方法来删除那些不需要的文本,例如,2019-07-02T22:53:16.848Z LOGFILE首先使其成为有效的 JSON。在我打电话之前,你能解释一下如何应用正则表达式sqlContext.read.json()吗?否则它会抱怨它是一个_corrupt_record. 非常感谢!