珠海网站建设厚瑜,wordpress云主机名,域名解析ip,asp网页制作Optimizer 是在 Analyzer 生成 Resolved Logical Plan 后#xff0c;进行优化的阶段。
1. Batch Finish Analysis
有5条优化规则#xff0c;这些规则都执行一次
1.1 EliminateSubqueryAliases
消除查询别名#xff0c;对应逻辑算子树中的 SubqueryAlias 节点。一般来讲进行优化的阶段。
1. Batch Finish Analysis
有5条优化规则这些规则都执行一次
1.1 EliminateSubqueryAliases
消除查询别名对应逻辑算子树中的 SubqueryAlias 节点。一般来讲Subqueries 仅用于提供查询的视角范围信息一旦 Analyzer 阶段结束该节点就可以被删除该优化规则直接将 SubqueryAlias 替换为其子节点。 如下SQL子查询 alias 为 t在 Analyzed Logical Plan 中还有 SubqueryAlias t节点。
explain extended select sum(len) from ( select c1,length(c1) len from t1 group by c1) t;Analyzed Logical Plan
sum(len): bigint
Aggregate [sum(len#56) AS sum(len)#64L]
- SubqueryAlias t- Aggregate [c1#62], [c1#62, length(c1#62) AS len#56]- SubqueryAlias spark_catalog.test.t1- HiveTableRelation [test.t1, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#62], Partition Cols: []] Optimized Logical Plan
Aggregate [sum(len#56) AS sum(len)#64L]
- Aggregate [c1#62], [length(c1#62) AS len#56]- HiveTableRelation [test.t1, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#62], Partition Cols: []]1.2 ReplaceExpressions
ReplaceExpressions 表达式替换。 4个替换规则如下所示。
case e: RuntimeReplaceable e.childcase CountIf(predicate) Count(new NullIf(predicate, Literal.FalseLiteral))case BoolOr(arg) Max(arg)case BoolAnd(arg) Min(arg)1.2.1 RuntimeReplaceable
RuntimeReplaceable 是一个 trait有好多子类用 child 节点把自己替换。如 Nvl 的child是 Coalesce(Seq(left, right))。那么优化的时候用 child 替换 nvl 。
case class Nvl(left: Expression, right: Expression, child: Expression) extends RuntimeReplaceable {def this(left: Expression, right: Expression) {this(left, right, Coalesce(Seq(left, right)))}explain extended SELECT nvl(c1,c2) FROM VALUES (v1, v12), (v2, v22), (v3, v32) AS tab(c1, c2);输出结果 Analyzed Logical Plan
nvl(c1, c2): string
Project [nvl(c1#85, c2#86) AS nvl(c1, c2)#87]
- SubqueryAlias tab- LocalRelation [c1#85, c2#86] Optimized Logical Plan
LocalRelation [nvl(c1, c2)#87]1.2.2 bool_or
用max替换 bool_or.
explain extended SELECT bool_or(col) FROM
VALUES (true), (false), (false) AS tab(col);输出结果 Analyzed Logical Plan
bool_or(col): boolean
Aggregate [bool_or(col#101) AS bool_or(col)#103]
- SubqueryAlias tab- LocalRelation [col#101] Optimized Logical Plan
Aggregate [max(col#101) AS bool_or(col)#103]
- LocalRelation [col#101]1.2.3 bool_and
用 min 替换 bool_and.
explain extended SELECT bool_and(col) FROM
VALUES (true), (false), (false) AS tab(col);输出结果 Analyzed Logical Plan
bool_and(col): boolean
Aggregate [bool_and(col#112) AS bool_and(col)#114]
- SubqueryAlias tab- LocalRelation [col#112] Optimized Logical Plan
Aggregate [min(col#112) AS bool_and(col)#114]
- LocalRelation [col#112]1.3 ComputeCurrentTime
计算当前时间相关的表达式在同一条 SQL 中可能包含多个计算时间的表达式如 CurentDate 和 CurrentTimestamp保证同一个 SQL query 中多个表达式返回相同的值。
subQuery.transformAllExpressionsWithPruning(transformCondition) {case cd: CurrentDate Literal.create(DateTimeUtils.microsToDays(currentTimestampMicros, cd.zoneId), DateType)case CurrentTimestamp() | Now() currentTimecase CurrentTimeZone() timezonecase localTimestamp: LocalTimestamp val asDateTime LocalDateTime.ofInstant(instant, localTimestamp.zoneId)Literal.create(localDateTimeToMicros(asDateTime), TimestampNTZType)}2. BatchUnion
Combine Union把相邻的 union 节点可以合并为一个 union 节点如以下SQL.
explain extended
select c1 from t1
union
select c1 from t1 where length(c1) 2
union
select c1 from t1 where length(c1) 3;输出结果如下 Analyzed Logical Plan 有2个 UnionOptimized Logical Plan 有 1 个 Union. Analyzed Logical Plan
c1: string
Distinct
- Union false, false:- Distinct: - Union false, false: :- Project [c1#161]: : - SubqueryAlias spark_catalog.test.t1: : - HiveTableRelation [test.t1, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#161], Partition Cols: []]: - Project [c1#162]: - Filter (length(c1#162) 2): - SubqueryAlias spark_catalog.test.t1: - HiveTableRelation [test.t1, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#162], Partition Cols: []]- Project [c1#163]- Filter (length(c1#163) 3)- SubqueryAlias spark_catalog.test.t1- HiveTableRelation [test.t1, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#163], Partition Cols: []] Optimized Logical Plan
Aggregate [c1#161], [c1#161]
- Union false, false:- HiveTableRelation [test.t1, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#161], Partition Cols: []]:- Filter (isnotnull(c1#162) AND (length(c1#162) 2)): - HiveTableRelation [test.t1, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#162], Partition Cols: []]- Filter (isnotnull(c1#163) AND (length(c1#163) 3))- HiveTableRelation [test.t1, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#163], Partition Cols: []]3. Batch Subquery
3.1 OptimizeSubqueries
当SQL语句包含子查询时会在逻辑算子树上生成 SubqueryExpression 表达式。OptimizeSubqueries 优化规则在遇到 SubqueryExpression 表达式时进一步调用 Optimizer 对该表达式的子计划进行优化。
4. Batch Replace Operators
用来执行算子的替换操作。在SQL语句中某些查询算子可以直接改写为已有的算子避免进行重复的逻辑转换。
4.1 ReplaceIntersectWithSemiJoin
将 Intersect 操作算子替换为 Left-Semi Join 操作算子从逻辑上来看这两种算子是等价的。需要注意的是ReplaceIntersectWithSemiJoin 仅适用于 INTERSECT DISTINCT 类型的语句不适用于 INTERSECT ALL 语句。此外该优化规则执行之前必须消除重复的属性避免生成的 Join 条件不正确。 示例
create table t1(c1 string) stored as textfile;
create table t2(c1 string) stored as textfile;
load data local inpath /etc/profile overwrite into table t1;
load data local inpath /etc/profile overwrite into table t2;查找长度为4的。 select c1 from t1 where length(c1)4;输出结果
else
else
else
done
Time taken: 0.064 seconds, Fetched 4 row(s)intersect distinct
explain extended
select c1 from t2 where length(c1)5
intersect distinct
select c1 from t1 where length(c1)4;输出结果如下可以看到Analyzed Logical Plan 中为 Intersect而 Optimized Logical Plan 变为 Join LeftSemi。 Analyzed Logical Plan
c1: string
Intersect false
:- Project [c1#149]
: - Filter (length(c1#149) 5)
: - SubqueryAlias spark_catalog.hzz.t2
: - HiveTableRelation [hzz.t2, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#149], Partition Cols: []]
- Project [c1#150]- Filter (length(c1#150) 4)- SubqueryAlias spark_catalog.hzz.t1- HiveTableRelation [hzz.t1, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#150], Partition Cols: []] Optimized Logical Plan
Aggregate [c1#149], [c1#149]
- Join LeftSemi, (c1#149 c1#150):- Filter (isnotnull(c1#149) AND (length(c1#149) 5)): - HiveTableRelation [hzz.t2, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#149], Partition Cols: []]- Filter (isnotnull(c1#150) AND (length(c1#150) 4))- HiveTableRelation [hzz.t1, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#150], Partition Cols: []]4.2 ReplaceExceptWithAntiJoin
用 AntiJoin 替换 Except。 示例如下
explain extended
select c1 from t2 where length(c1) 5
except
select c1 from t1 where length(c1)4;输出结果 Analyzed Logical Plan
c1: string
Except false
:- Project [c1#156]
: - Filter (length(c1#156) 5)
: - SubqueryAlias spark_catalog.hzz.t2
: - HiveTableRelation [hzz.t2, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#156], Partition Cols: []]
- Project [c1#157]- Filter (length(c1#157) 4)- SubqueryAlias spark_catalog.hzz.t1- HiveTableRelation [hzz.t1, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#157], Partition Cols: []] Optimized Logical Plan
Aggregate [c1#156], [c1#156]
- Join LeftAnti, (c1#156 c1#157):- Filter (isnotnull(c1#156) AND (length(c1#156) 5)): - HiveTableRelation [hzz.t2, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#156], Partition Cols: []]- Filter (isnotnull(c1#157) AND (length(c1#157) 4))- HiveTableRelation [hzz.t1, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#157], Partition Cols: []]4.3 RelaceDistinctWithAggregate
示例
explain extended
select distinct c1 from t1;输出结果如下 Analyzed Logical Plan
c1: string
Distinct
- Project [c1#163]- SubqueryAlias spark_catalog.hzz.t1- HiveTableRelation [hzz.t1, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#163], Partition Cols: []] Optimized Logical Plan
Aggregate [c1#163], [c1#163]
- HiveTableRelation [hzz.t1, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#163], Partition Cols: []]5. Batch Aggregate
5.1 RemoveLiteralFromGroupExceptions
去除 group by中的常数。 示例group by 都是常数用 0 替代
explain extended
select sum(length(c1)) from t1 group by aa,bb;Analyzed Logical Plan
sum(length(c1)): bigint
Aggregate [aa, bb], [sum(length(c1#189)) AS sum(length(c1))#191L]
- SubqueryAlias spark_catalog.hzz.t1- HiveTableRelation [hzz.t1, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#189], Partition Cols: []] Optimized Logical Plan
Aggregate [0], [sum(length(c1#189)) AS sum(length(c1))#191L]
- HiveTableRelation [hzz.t1, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#189], Partition Cols: []]5.2 RemoteRepetitionFromGroupExpressions
去除 group by 中重复的表达式如
explain extended
select sum(length(c1)) from t1 group by c1,c1;输出结果 Analyzed Logical Plan
sum(length(c1)): bigint
Aggregate [c1#201, c1#201], [sum(length(c1#201)) AS sum(length(c1))#203L]
- SubqueryAlias spark_catalog.hzz.t1- HiveTableRelation [hzz.t1, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#201], Partition Cols: []] Optimized Logical Plan
Aggregate [c1#201], [sum(length(c1#201)) AS sum(length(c1))#203L]
- HiveTableRelation [hzz.t1, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#201], Partition Cols: []]6. Batch Operator Optimizations
包括3大分类。1. 算子下推。2. 算子组合。3. 常量折叠与长度消减。 算子下推谓词下推列裁剪。 算子组合
优化规则优化操作PushProjectionThroughUnion列裁剪下推ReorderJoinJoin 顺序优化和 CostBasedJoinReorder 没有关系EliminateOuterJoin消除 OuterJoinPushPredicateThroughJoin谓词下推到Join 算子PushDownPredicate谓词下推LimitPushDownLimit 算子下推ColumnPruning列剪裁InferFiltersFromConstraintsCollapseRepartition重分区组合CollapseProject投影算子组合CollapseWindowWindow 组合CombineFilters投影算子组合CombineLimitsLimit算子组合CombineUnionsUnion算子组合NullPropagationNull 提取FoldablePropagation可折叠算子提取OptimizeInIn 操作优化ConstantFolding常数折叠ReorderAssociativeOperator重排序关联算子优化LikeSimplificationLike 算子简化BooleanSimplificationBoolean 算子简化SimplifyConditionals条件简化RemoveDispensableExpressionsDispensable 表达式消除SimplifyBianryComparison比较算子简化PruneFilter过滤条件剪裁EliminateSorts排序算子消除SimplifyCastsCast 算子简化SimplifyCaseConversionExpressionsCase 表达式简化RewriteCorrelatedScalarSubquery依赖子查询重写EliminateSerialization序列化消除RemoveAliasOnlyPorject消除别名
InferFiltersFromConstraints
explain extended
select t1.c1 from t1 join t2
on t1.c1t2.c1
where t2.c1done;通过 t2.c1 t1.c1 并且t2.c1‘done’ 推测出 t1.c1‘done’. Analyzed Logical Plan
c1: string
Project [c1#235]
- Filter (c1#236 done)- Join Inner, (c1#235 c1#236):- SubqueryAlias spark_catalog.hzz.t1: - HiveTableRelation [hzz.t1, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#235], Partition Cols: []]- SubqueryAlias spark_catalog.hzz.t2- HiveTableRelation [hzz.t2, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#236], Partition Cols: []] Optimized Logical Plan
Project [c1#235]
- Join Inner, (c1#235 c1#236):- Filter ((c1#235 done) AND isnotnull(c1#235)): - HiveTableRelation [hzz.t1, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#235], Partition Cols: []]- Filter (isnotnull(c1#236) AND (c1#236 done))- HiveTableRelation [hzz.t2, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#236], Partition Cols: []]
ConstantFolding
在Analyzed Logical Plan中 Filter 中还是 (1 (2 * 3)在 Optimized Logical Plan 变为了具体的值 7.
explain extended
select c1 from t1 where length(c1) 12*3;Analyzed Logical Plan
c1: string
Project [c1#266]
- Filter (length(c1#266) (1 (2 * 3)))- SubqueryAlias spark_catalog.hzz.t1- HiveTableRelation [hzz.t1, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#266], Partition Cols: []] Optimized Logical Plan
Filter (isnotnull(c1#266) AND (length(c1#266) 7))
- HiveTableRelation [hzz.t1, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#266], Partition Cols: []]RemoveDispensableExpressions
如以下SQL 1 2 可以消除。
explain extended
select c1 from t1 where 1 2 and length(c1) 4;Analyzed Logical Plan
c1: string
Project [c1#272]
- Filter ((1 2) AND (length(c1#272) 4))- SubqueryAlias spark_catalog.hzz.t1- HiveTableRelation [hzz.t1, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#272], Partition Cols: []] Optimized Logical Plan
Filter (isnotnull(c1#272) AND (length(c1#272) 4))
- HiveTableRelation [hzz.t1, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [c1#272], Partition Cols: []]7. Batch Check Cartesian Products
CheckCartesianProducts 判断逻辑算子树是否存在迪卡尔类型的 Join 操作。当存在这样的操作而SQL中没有显示的使用 cross join 表达式则会抛出异常。当spark.sql.crossJoin.enabled为true时该规则会被忽略。
8. Batch Decimal Optimizations DecimalAggregates
一般情况下如果聚和查询中涉及浮点数的精度处理性能就会受到很大的影响。对于固定精度的 Decinal 类型DecimalAggregates 规则将其当做 unscaledLong 类型来执行这样可以加速聚和操作的速度。
9. BatchTyped Filter Optimization CombineTypedFilters
当逻辑算子树中存在两个 TypedFilter 过滤条件且针对同类型的对象条件时CombineTypeFilters 优化规则会将他们合并到同一个过滤函数中。
10. Batch LocalRelation
ConvertToLocalRelation 将一个 LocalRelation 上的本地操作转化为另一个 LocalRelation 如 VALUES (v1, v12), (v2, v22), (v3, v32) AS tab(c1, c2) 就是一个local relation。
explain extended SELECT c1 FROM VALUES (v1, v12), (v2, v22), (v3, v32) AS tab(c1, c2) where c1v1;输出结果, Parsed Logical Plan 中转化为 UnresolvedInlineTable。在Analyzed Logical Plan 中 UnresolvedInlineTable 转化为 LocalRelation。Optimized Logical Plan 变成仅有一个 LocalRelation把 LocalRelation 和其上的操作转化为一个新的 LocalRelation。 Parsed Logical Plan
Project [c1]
- Filter (c1 v1)- SubqueryAlias tab- UnresolvedInlineTable [c1, c2], [[v1, v12], [v2, v22], [v3, v32]] Analyzed Logical Plan
c1: string
Project [c1#323]
- Filter (c1#323 v1)- SubqueryAlias tab- LocalRelation [c1#323, c2#324] Optimized Logical Plan
LocalRelation [c1#323]PropageEmptyRelation 对空的 LocalRelation 进行折叠。 explain extended select t1.c1 from (SELECT c1 FROM VALUES (v1, v12), (v2, v22), (v3, v32) AS tab(c1, c2) where c1v4)t1 join (SELECT c1 FROM VALUES (v1, v12), (v2, v22), (v3, v32) AS tab(c1, c2) where c1v4 )t2 where t1.c1t2.c1;结果如下, Analyzed Logical Plan 还有两个子查询做 join 操作。 到了 Optimized Logical Plan 中仅有一个LocalRelation empty标记 LocalRelation 是空的。因为两个子查询经过优化后都是 LocalRelation emptyjoin 后也是 LocalRelation empty。 Analyzed Logical Plan
c1: string
Project [c1#337]
- Filter (c1#337 c1#339)- Join Inner:- SubqueryAlias t1: - Project [c1#337]: - Filter (c1#337 v4): - SubqueryAlias tab: - LocalRelation [c1#337, c2#338]- SubqueryAlias t2- Project [c1#339]- Filter (c1#339 v4)- SubqueryAlias tab- LocalRelation [c1#339, c2#340] Optimized Logical Plan
LocalRelation empty, [c1#337] Physical Plan
LocalTableScan empty, [c1#337]11. Batch OptimizeCodegen OptimizeCodegen
现在 Optimize 里已经没有 OptimizeCodegen 规则。
12. Batch RewriteSubquery
包含 RewritePredicateSubquery 和 CollapseProject 两条优化规则。