数据集数据结构    website  请求ip
例如
www.Abaidu.com      192.168.1.101
www.Abaidu.com      192.168.1.102
www.Abaidu.com      192.168.1.103
www.Ataobao.com 192.168.1.101
www.Ataobao.com 192.168.1.102
www.Ajd.com 192.168.1.101
最后想要的结果是
www.Abaidu.com-www.Ataobao.com 重合率 ip个数   2
www.Abaidu.com-www.Ajd.com 重合率 ip个数   1
www.Ataobao.com-www.Ajd.com   重合率 ip个数   1 
用spark  RRD 应该怎么处理。
 本人小白,麻烦给个思路。

解决方案 »

  1.   

    在spark-shell里试试:
    val array = Array(Array("www.Abaidu.com", "192.168.1.101"),
                      Array("www.Abaidu.com", "192.168.1.102"),
               Array("www.Abaidu.com", "192.168.1.103"),
               Array("www.Ataobao.com", "192.168.1.101"),
               Array("www.Ataobao.com",  "192.168.1.102"),
               Array("www.Ajd.com",  "192.168.1.101"))sc.parallelize(array).cartesian(sc.parallelize(array)).map(r => {
                                                                       if(r._1(0) != r._2(0) && r._1(1) == r._2(1)){
                                                                         (r._1(0) + "-" + r._2(0), 1)
                                                                        }
                                                                        else{
                                                                          ("nothing", 1)
                                                                        }
                                                                     }
                                                               ).filter(r => r != ("nothing", 1)).reduceByKey((a, b) => a + b).map(r => (r._1.split("-").sortWith(_>_)(0), r._1.split("-").sortWith(_>_)(1), r._2)).distinct().collect