DDL - Tables
These functions are used to create and modify data tables in OmniSci.
- Nomenclature Constraints
- CREATE TABLE
- CREATE TABLE AS SELECT
- ALTER TABLE
- DROP TABLE
- DUMP TABLE
- RESTORE TABLE
- TRUNCATE TABLE
- OPTIMIZE TABLE
- VALIDATE CLUSTER
Nomenclature Constraints
Table names must use the NAME format, described in regex notation as:
[A-Za-z_][A-Za-z0-9\$_]*
CREATE TABLE
CREATE [TEMPORARY] TABLE [IF NOT EXISTS] <table> (<column> <type> [NOT NULL] [DEFAULT <value>] [ENCODING <encodingSpec>], [SHARD KEY (<column>)], [SHARED DICTIONARY (<column>) REFERENCES <table>(<column>)], ...) [WITH (<property> = value, ...)];
Create a table named <table>
specifying <columns>
and table properties. Table and column names cannot include quotes, spaces, or special characters.
Supported Datatypes
Datatype | Size (bytes) | Notes |
---|---|---|
BIGINT |
8 | Minimum value: -9,223,372,036,854,775,807 ; maximum value: 9,223,372,036,854,775,807 . |
BOOLEAN |
1 | TRUE: 'true' , '1' , 't' . FALSE: 'false' , '0' , 'f' . Text values are not case-sensitive. |
DATE * |
4 | Minimum value: -2,147,483,648 ; maximum value: 2,147,483,647 Range in years: +/-5,883,517 around epoch. Maximum date January 1,
5885487 (approximately). Supported formats when using COPY FROM : mm/dd/yyyy , dd-mmm-yy , yyyy-mm-dd , dd/mmm/yyyy .
|
DATE ENCODING FIXED(32) |
4 | Default encoding, equivalent to DATE .Range in years: +/-5,883,517 around epoch, maximum date January 1,
5885487 (approximately). Values range from -185542587187200 to 185542587100800. Supported formats when using COPY FROM : mm/dd/yyyy , dd-mmm-yy , yyyy-mm-dd , dd/mmm/yyyy . |
DATE ENCODING FIXED(16) |
2 | Range: -32,768 - 32,767 Range in years: +/-90 around epoch, April 14, 1880 -
September 9, 2059. Values range from -2831155200 to 2831068800. Supported formats when using COPY FROM : mm/dd/yyyy , dd-mmm-yy , yyyy-mm-dd , dd/mmm/yyyy . |
DECIMAL |
2, 4, or 8 | Takes precision and scale parameters: DECIMAL(precision,scale) .
Size depends on precision:
|
DOUBLE |
8 | Variable precision. Minimum value: -1.79 x e^308 ; maximum value: 1.79 x e^308 . |
FLOAT |
4 | Variable precision. Minimum value: -3.4 x e^38 ; maximum value: 3.4 x e^38 . |
INTEGER |
4 | Minimum value: -2,147,483,647 ; maximum value: 2,147,483,647 . |
SMALLINT |
2 | Minimum value: -32,767 ; maximum value: 32,767 . |
TEXT ENCODING DICT |
4 | Max cardinality 2 billion distinct string values |
TEXT ENCODING NONE |
Variable | Size of the string + 6 bytes |
TIME |
8 | Minimum value: 00:00:00 ; maximum value: 23:59:59 . |
TIMESTAMP |
8 | Linux timestamp from -30610224000 (1/1/1000 00:00:00.000 ) through 29379542399 (12/31/2900 23:59:59.999 ). Can also be inserted and stored in human-readable format:
|
TINYINT |
1 | Minimum value: -127 ; maximum value: 127 . |
* In OmniSci release 4.4.0 and higher, you can use existing 8-byte DATE
columns, but you can create only 4-byte DATE
columns (default) and 2-byte DATE
columns (see DATE ENCODING FIXED(16)
).
For more information, see Datatypes and Fixed Encoding.
For geospatial datatypes, see Geospatial Primitives.
Examples
Create a table named tweets
and specify the columns, including type, in the table.
CREATE TABLE IF NOT EXISTS tweets ( tweet_id BIGINT NOT NULL, tweet_time TIMESTAMP NOT NULL ENCODING FIXED(32), lat FLOAT, lon FLOAT, sender_id BIGINT NOT NULL, sender_name TEXT NOT NULL ENCODING DICT, location TEXT ENCODING DICT, source TEXT ENCODING DICT, reply_to_user_id BIGINT, reply_to_tweet_id BIGINT, lang TEXT ENCODING DICT, followers INT, followees INT, tweet_count INT, join_time TIMESTAMP ENCODING FIXED(32), tweet_text TEXT, state TEXT ENCODING DICT, county TEXT ENCODING DICT, place_name TEXT, state_abbr TEXT ENCODING DICT, county_state TEXT ENCODING DICT, origin TEXT ENCODING DICT, phone_numbers bigint);
Create a table named customers
that accepts a default value of the current time for time created:
CREATE TABLE customers( accountId TEXT, name TEXT, timeCreated TIMESTAMP DEFAULT CURRENT_TIMESTAMP
Supported Encoding
Encoding | Descriptions |
---|---|
DICT |
Dictionary encoding on string columns (default for TEXT columns). Limit of 2 billion unique string values. |
FIXED (bits) |
Fixed length encoding of integer or timestamp columns. See Datatypes and Fixed Encoding. |
NONE |
No encoding. Valid only on TEXT columns. No Dictionary is created. Aggregate operations are not possible on this column type. |
WITH
Clause Properties
Property | Description |
---|---|
fragment_size |
Number of rows per fragment that is a unit of the table for query processing. Default: 32 million rows, which is not expected to be changed. |
max_rows |
Used primarily for streaming datasets to limit the number
of rows in a table, to avoid running out of memory or impeding
performance. When the max_rows limit is
reached, the oldest fragment is removed. When populating a table
from a file, make sure that your row count is below the
max_rows setting. If you attempt load more rows at
one time than the max_rows setting defines, the
records up to the max_rows limit are removed,
leaving only the additional rows. Default: 2^62.In a distributed system, the maximum number of rows is calculated as max_rows * leaf_count . In a
sharded distributed system, the maximum number of rows is
calculated as max_rows * shard_count . |
page_size |
Number of I/O page bytes. Default: 1MB, which does not need to be changed. |
partitions |
Partition strategy option:
|
shard_count |
Number of shards to create, typically equal to the number of GPUs across which the data table is distributed. |
sort_column |
Name of the column on which to sort during bulk import. |
Sharding
Sharding partitions a database table across multiple servers so each server has a part of the table with the same columns but with different rows. Partitioning is based on a sharding key defined when you create the table.
Without sharding, the dimension tables involved in a join are replicated and sent to each GPU, which is not feasible for dimension tables with many rows. Specifying a shard key makes it possible for the query to execute efficiently on large dimension tables.
Currently, specifying a shard key is useful for joins, only:
- If two tables specify a shard key with the same type and the same number of shards, a join on that key only sends a part of the dimension table column data to each GPU.
- For multi-node installs, the dimension table does not need to be replicated and the join executes locally on each leaf.
Constraints
- A shard key must specify a single column to shard on. There is no support for sharding by a combination of keys.
- One shard key can be specified for a table.
- Data are partitioned according to the shard key and the number of shards (
shard_count
). - A value in the column specified as a shard key is always sent to the same partition.
- The number of shards should be equal to the number of GPUs in the cluster.
- Sharding is allowed on the following column types:
- DATE
- INT
- TEXT ENCODING DICT
- TIME
- TIMESTAMP
- Tables must share the dictionary for the column to be involved in sharded joins. If the dictionary is not specified as shared, the join does not take advantage of sharding. Dictionaries are reference-counted and only dropped when the last reference drops.
Recommendations
- Set
shard_count
to the number of GPUs you eventually want to distribute the data table across. - Referenced tables must also be
shard_count
-aligned. - Sharding should be minimized because it can introduce load skew accross resources, compared to when sharding is not used.
Examples
Basic sharding:
CREATE TABLE customers( accountId text, name text, SHARD KEY (accountId)) WITH (shard_count = 4);
Sharding with shared dictionary:
CREATE TABLE transactions( accountId text, action text, SHARD KEY (accountId), SHARED DICTIONARY (accountId) REFERENCES customers(accountId)) WITH (shard_count = 4);
Temporary Tables
Using the TEMPORARY argument creates a table that persists only while the server is live. They are useful for storing intermediate result sets that you access more than once.
Temporary tables do not support updates or deletes.
Example
CREATE TEMPORARY TABLE customers( accountId TEXT, name TEXT, timeCreated TIMESTAMP DEFAULT CURRENT_TIMESTAMP)
CREATE TABLE AS SELECT
CREATE TABLE [IF NOT EXISTS] <newTableName> AS (<SELECT statement>) [WITH (<property> = value, ...)];
Create a table with the specified columns, copying any data that meet SELECT statement criteria.
WITH
Clause Properties
Property | Description |
---|---|
fragment_size |
Number of rows per fragment that is a unit of the table for query processing. Default = 32 million rows, which is not expected to be changed. |
max_chunk_size |
Size of chunk that is a unit of the table for query processing. Default: 1073741824 bytes (1 GB), which is not expected to be changed. |
max_rows |
Used primarily for streaming datasets to limit the number of rows in a table. When the max_rows limit is reached, the oldest fragment is removed. When populating a table from a file, make sure that your row count is below the max_rows setting. If you attempt load more rows at one time than the max_rows setting defines, the records up to the max_rows limit are removed, leaving only the additional rows. Default = 2^62. |
page_size |
Number of I/O page bytes. Default = 1MB, which does not need to be changed. |
partitions |
Partition strategy option:
|
vacuum |
Formats the table to more efficiently handle
DELETE requests. The only parameter available is
delayed . Rather than immediately remove deleted
rows, vacuum marks items to be deleted, and they are removed
at an optimal time.
|
Examples
Create the table newTable. Populate the table with all information from the table oldTable, effectively creating a duplicate of the original table.
CREATE TABLE newTable AS (SELECT * FROM oldTable);
Create a table named trousers. Populate it with data from the columns name, waist, and inseam from the table wardrobe.
CREATE TABLE trousers AS (SELECT name, waist, inseam FROM wardrobe);
Create a table named cosmos. Populate it with data from the columns star and planet from the table universe where planet has the class M.
CREATE TABLE IF NOT EXISTS cosmos AS (SELECT star, planet FROM universe WHERE class='M');
ALTER TABLE
ALTER TABLE <table> RENAME TO <table>; ALTER TABLE <table> RENAME COLUMN <column> TO <column>; ALTER TABLE <table> ADD [COLUMN] <column> <type> [NOT NULL] [ENCODING <encodingSpec>] ALTER TABLE <table> ADD (<column> <type> [NOT NULL] [ENCODING <encodingSpec>], ...) ALTER TABLE <table> ADD (<column> <type> DEFAULT <value>)
Examples
Rename the table tweets to retweets.
ALTER TABLE tweets RENAME TO retweets;
Rename the column source to device in the table retweets.
ALTER TABLE retweets RENAME COLUMN source TO device;
ALTER TABLE t ADD COLUMN pt_dropoff POINT DEFAULT 'point(0 0)';
Add the column lang to the table tweets using a TEXT ENCODING DICTIONARY.
ALTER TABLE tweets ADD COLUMN lang TEXT ENCODING DICT;
Add the columns lang and encode to the table tweets using a TEXT ENCODING DICTIONARY for each.
ALTER TABLE tweets ADD (lang TEXT ENCODING DICT, encode TEXT ENCODING DICT);
Add the column pt_dropoff to the table tweets with a default value point(0,0).
ALTER TABLE tweets ADD pt_dropoff POINT DEFAULT 'point(0 0)';
Note |
|
DROP TABLE
DROP TABLE [IF EXISTS] <table>;
Deletes the table structure, all data from the table, and any dictionary content unless it is a shared dictionary. (See the Note regarding disk space reclamation.)
Example
DROP TABLE IF EXISTS tweets;
DUMP TABLE
DUMP TABLE <table> TO '<filepath>' [WITH (COMPRESSION='<compression_program>')];
Archives data and dictionary files of the table <table>
to file <filepath>
.
Valid values for <compression_program>
include:
- gzip (default)
- lz4
- none
If you do not choose a compression option, the system uses gzip if it is available. If gzip is not installed, the file is not compressed.
The file path must be enclosed in single quotes.
Note |
|
Example
DUMP TABLE tweets TO '/opt/archive/tweetsBackup.gz' WITH (COMPRESSION='gzip');
RESTORE TABLE
RESTORE TABLE <table> FROM '<filepath>' [WITH (COMPRESSION='<compression_program>'];
Restores data and dictionary files of table <table>
from the file at <filepath>
. If you specified a compression program when you used the DUMP TABLE
command, you must specify the same compression method during RESTORE
.
Restoring a table decompresses and then reimports the table. You must have enough disk space for both the new table and the archived table, as well as enough scratch space to decompress the archive and reimport it.
The file path must be enclosed in single quotes.
Note |
|
Example
RESTORE TABLE tweets FROM '/opt/archive/tweetsBackup.gz' WITH (COMPRESSION='gzip');