Spark源码学习笔记(二十三)

Spark SQL之CBO优化

CBO是相对应于RBO的概念，RBO就是应用了一系列的规则到逻辑计划中。CBO是基于代价计算，会考虑到数据的统计、分布等等情况。

CBO优化join reorder，也就是有多个表join，cbo优化会按特定的顺序进行join，仅支持InnerLike类型的join，目前只有inner join和cross join属于InnerLike类型

/**
 * The explicitCartesian flag indicates if the inner join was constructed with a CROSS join
 * indicating a cartesian product has been explicitly requested.
 */
sealed abstract class InnerLike extends JoinType {
  def explicitCartesian: Boolean
}

case object Inner extends InnerLike {
  override def explicitCartesian: Boolean = false
  override def sql: String = "INNER"
}

case object Cross extends InnerLike {
  override def explicitCartesian: Boolean = true
  override def sql: String = "CROSS"
}

join reorder需要开启参数spark.sql.cbo.joinReorder.enabled与spark.sql.cbo.enabled

多表连接顺序优化算法使用了动态规划寻找最优join顺序优势：动态规划算法能够求得整个搜索空间中最优解缺陷：当联接表数量增加时，算法需要搜索的空间增加的非常快，计算最优联接顺序代价很高

计划优劣性计算公式如下面代码中，代码中已经给出了注释

 /**
   * Partial join order in a specific level.
   *
   * @param itemIds Set of item ids participating in this partial plan.
   * @param plan The plan tree with the lowest cost for these items found so far.
   * @param joinConds Join conditions included in the plan.
   * @param planCost The cost of this plan tree is the sum of costs of all intermediate joins.
   */
  case class JoinPlan(
      itemIds: Set[Int],
      plan: LogicalPlan,
      joinConds: Set[Expression],
      planCost: Cost) {

    /** Get the cost of the root node of this plan tree. */
    def rootCost(conf: SQLConf): Cost = {
      if (itemIds.size > 1) {
        val rootStats = plan.stats
        Cost(rootStats.rowCount.get, rootStats.sizeInBytes)
      } else {
        // If the plan is a leaf item, it has zero cost.
        Cost(0, 0)
      }
    }

    def betterThan(other: JoinPlan, conf: SQLConf): Boolean = {
      // card为结果行数，size为结果大小
      if (other.planCost.card == 0 || other.planCost.size == 0) {
        false
      } else {
        // 当前结果集行数/其他计划结果集行数
        val relativeRows = BigDecimal(this.planCost.card) / BigDecimal(other.planCost.card)
        // 当前结果集空间大小/其他计划结果集大小
        val relativeSize = BigDecimal(this.planCost.size) / BigDecimal(other.planCost.size)
        // joinReorderCardWeight的值为参数spark.sql.cbo.joinReorder.card.weight的值，默认0.7，所以当前结果集条目越少，空间越小，会更优
        relativeRows * conf.joinReorderCardWeight +
          relativeSize * (1 - conf.joinReorderCardWeight) < 1
      }
    }
  }

上一章中讲了broadcast join会严格要求join顺序，但是inner join 左右表都可以build，reoder仅支持InnerLike join，所以同时使用join reorder和broadcast join无需注意表顺序，reorder会自动调整。

当然，CBO优化还有其他的地方，比如两个表Join，A表过滤前1亿数据，过滤后10W数据，B表1000万数据，默认会使用B表左右build side，开启CBO优化后，会使用A表作为build side，如果A表足够小，满足broadcast join，则会采用broadcast join

总结下CBO优化就是如下三点：
 1. 选择合适的build side   2. 优化join类型   3. join reoder，优化多表join顺序