我需要按某个列对一组 csv 行进行分组,并对每个组进行一些处理。
JavaRDD<String> lines = sc.textFile
("somefile.csv");
JavaPairRDD<String, String> pairRDD = lines.mapToPair(new SomeParser());
List<String> keys = pairRDD.keys().distinct().collect();
for (String key : keys)
{
List<String> rows = pairRDD.lookup(key);
noOfVisits = rows.size();
country = COMMA.split(rows.get(0))[6];
accessDuration = getAccessDuration(rows,timeFormat);
Map<String,Integer> counts = getCounts(rows);
whitepapers = counts.get("whitepapers");
tutorials = counts.get("tutorials");
workshops = counts.get("workshops");
casestudies = counts.get("casestudies");
productPages = counts.get("productpages");
}
private static long dateParser(String dateString) throws ParseException {
SimpleDateFormat format = new SimpleDateFormat("MMM dd yyyy HH:mma");
Date date = format.parse(dateString);
return date.getTime();
}
dateParser is called for each row. Then min and max for the group is calculated to get the access duration. Others are string matches.
pairRDD.lookup 非常慢.. 有没有更好的方法来用 spark 做到这一点。