python - Selecting random columns for each group of pyspark RDD/dataframe

Question

My dataframe as got 10,0000 columns, I have to apply some logic on each group (key is region and dept). Each group will use max 30 columns from 10k columns, the 30 columns list is from the second data set column "colList". Each group will have 2-3 millions rows. My approach is to do group by key and call function like below. But it fails - 1. shuffle and 2.data group is more than 2G(can be solved by re-partition but its costly), 3. very slow

def testfunc(iter):
   <<got some complex business logic which cant be done in spark API>>

resRDD = df.rdd.groupBy(region, dept).map(lambda x: testfunc(x))

Input:

region dept week val0 val1  val2  val3 ... val10000   
 US    CS   1     1    2    1     1   ...  2 
 US    CS   2     1.5  2    3     1   ...  2
 US    CS   3     1    2    2     2.1      2
 US    ELE  1     1.1  2    2     2.1      2
 US    ELE  2     2.1  2    2     2.1      2
 US    ELE  3     1    2    1     2   .... 2
 UE    CS   1     2    2    1     2   .... 2

Columns to pick for each group: (data set 2)

region dept colList   
 US    CS   val0,val10,val100,val2000 
 US    ELE  val2,val5,val800,val900
 UE    CS   val21,val54,val806,val9000

My second solution is create a new data set from input data with only 30 columns and rename the columns to col1 to col30. Then use a mapping list for each columns and group. Then i can apply groupbyKey (assuming), which will be Skinner than original input of 10K columns.

region dept week col0 col1  col2  col3 ... col30   
 US    CS   1     1    2    1     1   ...  2 
 US    CS   2     1.5  2    3     1   ...  2
 US    CS   3     1    2    2     2.1      2
 US    ELE  1     1.1  2    2     2.1      2
 US    ELE  2     2.1  2    2     2.1      2
 US    ELE  3     1    2    1     2   .... 2
 UE    CS   1     2    2    1     2   .... 2

Can any one help to convert Input with 10K to 30 columns? Or any other alternative should be fine to avoid group by.

score 0 · Accepted Answer

您可以使用 create_map 函数将所有 10k 列转换为每行映射。现在使用 UDF 获取地图、区域和部门，并将地图稀释到 30 列，并确保所有 30 列始终具有相同的名称。最后，您可以包装您的复杂函数以接收地图而不是原始的 10K 列。希望这将使它足够小以正常工作。

如果没有，您可以获得一个不同的区域和部门，并假设您可以循环遍历一个并分组另一个。

python - Selecting random columns for each group of pyspark RDD/dataframe

1 回答 1

Related

Reference