不用代码做网站的软件,网站建设需要大约多少钱,如何实现网站的快速排名,wordpress网页如何公开Querying Tables
1.Batch Query
Paimon的批量读取返回表快照中的所有数据。默认情况下#xff0c;批处理读取返回最新的快照。
-- Flink SQL
SET execution.runtime-mode batch;2.Batch Time Travel
Paimon批量读取指定快照或标签的数据。
Flink 动态配置
-- read the …Querying Tables
1.Batch Query
Paimon的批量读取返回表快照中的所有数据。默认情况下批处理读取返回最新的快照。
-- Flink SQL
SET execution.runtime-mode batch;2.Batch Time Travel
Paimon批量读取指定快照或标签的数据。
Flink 动态配置
-- read the snapshot with id 1L
SELECT * FROM t /* OPTIONS(scan.snapshot-id 1) */;-- read the snapshot from specified timestamp in unix milliseconds
SELECT * FROM t /* OPTIONS(scan.timestamp-millis 1678883047356) */;-- read tag my-tag
SELECT * FROM t /* OPTIONS(scan.tag-name my-tag) */;Flink 1.18
-- read the snapshot from specified timestamp
SELECT * FROM t FOR SYSTEM_TIME AS OF TIMESTAMP 2023-01-01 00:00:00;-- you can also use some simple expressions (see flink document to get supported functions)
SELECT * FROM t FOR SYSTEM_TIME AS OF TIMESTAMP 2023-01-01 00:00:00 INTERVAL 1 DAYSpark3
Spark 3.3可以在查询中使用VERSION AS OF和TIMESTAMP AS OF进行时间旅行
-- read the snapshot with id 1L (use snapshot id as version)
SELECT * FROM t VERSION AS OF 1;-- read the snapshot from specified timestamp
SELECT * FROM t TIMESTAMP AS OF 2023-06-01 00:00:00.123;-- read the snapshot from specified timestamp in unix seconds
SELECT * FROM t TIMESTAMP AS OF 1678883047;-- read tag my-tag
SELECT * FROM t VERSION AS OF my-tag;如果标签的名称是一个数字并且等于快照ID则VERSION AS OF语法将首先考虑标签。
例如标签叫1但基于快照2语句 SELECT * FROM t VERSION AS OF ‘1’ 实际上查询快照2即标签1而不是快照1。
Spark3-DF
// read the snapshot from specified timestamp in unix seconds
spark.read.option(scan.timestamp-millis, 1678883047000).format(paimon).load(path/to/table)// read the snapshot with id 1L (use snapshot id as version)
spark.read.option(scan.snapshot-id, 1).format(paimon).load(path/to/table)// read tag my-tag
spark.read.option(scan.tag-name, my-tag).format(paimon).load(path/to/table)Hive 引擎
Hive需要将以下配置参数添加到hive-site.xml文件中
!--This parameter is used to configure the whitelist of permissible configuration items allowed for use in SQL standard authorization mode.--
propertynamehive.security.authorization.sqlstd.confwhitelist/namevaluemapred.*|hive.*|mapreduce.*|spark.*/value
/property!--This parameter is an additional configuration for hive.security.authorization.sqlstd.confwhitelist. It allows you to add other permissible configuration items to the existing whitelist.--
propertynamehive.security.authorization.sqlstd.confwhitelist.append/namevaluemapred.*|hive.*|mapreduce.*|spark.*/value
/property-- read the snapshot with id 1L (use snapshot id as version)
SET paimon.scan.snapshot-id1
SELECT * FROM t;
SET paimon.scan.snapshot-idnull;-- read the snapshot from specified timestamp in unix seconds
SET paimon.scan.timestamp-millis1679486589444;
SELECT * FROM t;
SET paimon.scan.timestamp-millisnull;-- read tag my-tag
set paimon.scan.tag-namemy-tag;
SELECT * FROM t;
set paimon.scan.tag-namenull;3.批次读取新增数据
在开始的snapshot和结束的snapshot之间读取增量的变化数据。
例如
“5,10”是指快照5和快照10之间的变化。“TAG1TAG3”是指TAG1和TAG3之间的更改。
Flink 引擎
-- incremental between snapshot ids
SELECT * FROM t /* OPTIONS(incremental-between 12,20) */;-- incremental between snapshot time mills
SELECT * FROM t /* OPTIONS(incremental-between-timestamp 1692169000000,1692169900000) */;Spark3引擎
需要Spark 3.2。
Paimon支持使用Spark SQL执行Spark Table Valued Function实现的增量查询。要启用此功能需要以下配置
--conf spark.sql.extensionsorg.apache.paimon.spark.extensions.PaimonSparkSessionExtensions可以在查询中使用paimon_incremental_query来提取增量数据
-- read the incremental data between snapshot id 12 and snapshot id 20.
SELECT * FROM paimon_incremental_query(tableName, 12, 20);Spark-DF
// incremental between snapshot ids
spark.read().format(paimon).option(incremental-between, 12,20).load(path/to/table)// incremental between snapshot time mills
spark.read().format(paimon).option(incremental-between-timestamp, 1692169000000,1692169900000).load(path/to/table)Hive
-- incremental between snapshot ids
SET paimon.incremental-between12,20;
SELECT * FROM t;
SET paimon.incremental-betweennull;-- incremental between snapshot time mills
SET paimon.incremental-between-timestamp1692169000000,1692169900000;
SELECT * FROM t;
SET paimon.incremental-between-timestampnull;在批处理SQL中不允许返回DELETE记录因此-D的记录将被删除。如果想查看DELETE记录可以使用audit_log表
SELECT * FROM t$audit_log /* OPTIONS(incremental-between 12,20) */;4.流式查询
默认情况下流式查询在首次启动时会在表上生成最新得快照并继续读取最新的更改。
-- Flink SQL
SET execution.runtime-mode streaming;可以在没有快照数据的情况下进行流式查询可以使用latest scan模式
-- Continuously reads latest changes without producing a snapshot at the beginning.
SELECT * FROM t /* OPTIONS(scan.mode latest) */;4.Streaming Time Travel
如果只想处理今天及以后的数据可以使用分区进行过滤
SELECT * FROM t WHERE dt 2023-06-26;如果不是分区表或者无法按分区过滤可以使用时间旅行的流式读取。
Flink 动态配置
-- read changes from snapshot id 1L
SELECT * FROM t /* OPTIONS(scan.snapshot-id 1) */;-- read changes from snapshot specified timestamp
SELECT * FROM t /* OPTIONS(scan.timestamp-millis 1678883047356) */;-- read snapshot id 1L upon first startup, and continue to read the changes
SELECT * FROM t /* OPTIONS(scan.modefrom-snapshot-full,scan.snapshot-id 1) */;Flink 1.18
-- read the snapshot from specified timestamp
SELECT * FROM t FOR SYSTEM_TIME AS OF TIMESTAMP 2023-01-01 00:00:00;-- you can also use some simple expressions (see flink document to get supported functions)
SELECT * FROM t FOR SYSTEM_TIME AS OF TIMESTAMP 2023-01-01 00:00:00 INTERVAL 1 DAY时间旅行的流式读取依赖于快照但默认情况下快照仅保留1小时内的数据会影响读取较旧的增量数据。
因此Paimon还提供了另一种流式读取模式scan.file-creation-time-millis该模式保留timeMillis之后生成的文件。
SELECT * FROM t /* OPTIONS(scan.file-creation-time-millis 1678883047356) */;5.Consumer ID
可以在流式读取表时指定consumer-id。
SELECT * FROM t /* OPTIONS(consumer-id myid) */;当流式读取Paimon表时下一个快照ID将记录到文件系统中。优点如下
当上一个作业停止时新开始的作业可以上一个进度开始而无需从状态恢复。新的读取将从消费者文件中找到的下一个快照ID开始读取。如果不希望这种行为可以将“consumer.ignore-progress”设置为True。在决定快照是否已过期时Paimon会查看文件系统中表的所有消费者如果有消费者仍然依赖此快照则此快照不会在过期前删除。当没有水印定义时Paimon表会将快照中的水印传递给下游的Paimon表这意味着可以跟踪整个管道的水印进度。
注意消费者将防止快照过期可以指定“consumer.expiration-time”来管理消费者的生命周期。
默认情况下消费者使用exactly-once模式来记录消费进度这严格确保消费者中记录的是所有reader精确消费的快照ID 1。
可以将consumer.mode设置为at-least-once以允许reader以不同的速率消耗快照并将所有reader中最慢的快照ID记录到消费者中。这种模式可以提供更多功能例如水印对齐。
注意
当没有水印定义时at-least-once模式的消费者无法提供将快照中的水印传递给下游的能力。由于exactly-once模式和at-least-once模式的实现完全不同因此flink的状态是不兼容的在切换模式时无法从状态恢复。
可以使用给定的消费者ID和下一个快照ID重置消费者并删除具有给定消费者ID的消费者。 首先需要使用此消费者ID停止流式传输任务然后执行重置消费者操作作业。 Flink 引擎
FLINK_HOME/bin/flink run \/path/to/paimon-flink-action-0.7.0-incubating.jar \reset-consumer \--warehouse warehouse-path \--database database-name \ --table table-name \--consumer_id consumer-id \[--next_snapshot next-snapshot-id] \[--catalog_conf paimon-catalog-conf [--catalog_conf paimon-catalog-conf ...]]如果想删除消费者请不要指定-next_snapshot参数。
6.Read Overwrite
默认情况下流式读取将忽略INSERT OVERWRITE生成的提交。如果想读取OVERWRITE的提交可以配置streaming-read-overwrite。
a) 并行读取
Flink 引擎
默认情况下批处理读取的并行度与拆分数相同而流读取的并行度与桶数相同但不大于scan.infer-parallelism.max。
禁用scan.infer-parallelism将使用全局并行度配置还可以从scan.parallelism手动指定并行性。
KeyDefaultTypeDescriptionscan.infer-parallelismtrueBooleanIf it is false, parallelism of source are set by global parallelism. Otherwise, source parallelism is inferred from splits number (batch mode) or bucket number(streaming mode).scan.infer-parallelism.max1024IntegerIf scan.infer-parallelism is true, limit the parallelism of source through this option.scan.parallelism(none)IntegerDefine a custom parallelism for the scan source. By default, if this option is not defined, the planner will derive the parallelism for each statement individually by also considering the global configuration. If user enable the scan.infer-parallelism, the planner will derive the parallelism by inferred parallelism.
7.查询优化
强烈建议在查询的同时指定分区和主键进行过滤这将加快查询数据的速度。
可以加速数据查询效率的
IN (...)LIKE abc%IS NULL
Paimon将按主键对数据进行排序可以加快点查询和范围查询的速度使用复合主键时查询过滤器最好匹配主键的最左前缀以便加速。
假设表如下
CREATE TABLE orders (catalog_id BIGINT,order_id BIGINT,.....,PRIMARY KEY (catalog_id, order_id) NOT ENFORCED -- composite primary key
);查询通过为主键最左前缀指定范围过滤器来获得良好的加速。
SELECT * FROM orders WHERE catalog_id1025;SELECT * FROM orders WHERE catalog_id1025 AND order_id29495;SELECT * FROM ordersWHERE catalog_id1025AND order_id2035 AND order_id6000;以下过滤器无法加速查询。
SELECT * FROM orders WHERE order_id29495;SELECT * FROM orders WHERE catalog_id1025 OR order_id29495;