我在将数据从 Elasticsearch 读取到 Spark 集群时遇到问题(我使用的是 Zeppelin 环境,因此所有连接设置都在 Zeppelin 解释器设置中配置)。
首先,我尝试使用 PySpark 阅读它:
%pyspark
from pyspark.sql import SQLContext
from pyspark.sql import functions as F
df = spark.read.format("org.elasticsearch.spark.sql").load("index")
df = df.limit(100).drop('tags').drop('a.b')
# if 'tags' field is not dropped, pyspark cannot map scala field and throws an exception.
# If the limit is not set, pyspark will probably try to get the whole index at once
# if "a.b" is not dropped, the dot in the field name causes mapping error: https://github.com/elastic/elasticsearch-hadoop/issues/853
df = df.cache()
z.show(df)
不幸的是,在这种情况下,我面临许多映射问题。因为我在数据集中有很多包含点的字段,我决定让 Scala 尝试读取数据(以便稍后在 PySpark 中处理它):
%spark
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.elasticsearch.spark._
import org.apache.spark.sql.SQLContext
import org.elasticsearch.spark
import org.elasticsearch.spark.sql
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Encoder
val conf = new SparkConf()
conf.set("spark.es.mapping.date.rich", "false");
conf.set("spark.serializer", classOf[KryoSerializer].getName)
val EsReadRDD = sc.esRDD("index")
但是,即使使用 Scala,我也只能检索少量记录,例如
EsReadRDD.take(10).foreach(println)
出于某种原因,collect() 不起作用:
val esdf = EsReadRDD.collect() //does not work probably because data are too large
错误是:
Job aborted due to stage failure: Task 0 in stage 833.0 failed 4 times, most recent failure: Lost task 0.3 in stage 833.0 (TID 479, 10.10.11.37, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
我也尝试过转换为 DF,但出现错误:
val esdf = EsReadRDD.toDF()
java.lang.UnsupportedOperationException: No Encoder found for scala.AnyRef
- map value class: "java.lang.Object"
- field (class: "scala.collection.Map", name: "_2")
- root class: "scala.Tuple2"
您对如何处理有任何想法吗?