Create Databases and Tables

Create Databases

Use function database to create a new database or to get the handle of an existing database.

Syntax of function database: database(directory, [partitionType], [partitionScheme], [locations], [engine=’OLAP’], [atomic=’TRANS’])

To establish a distributed database in the distributed file system, parameter directory should start with “dfs://”.

When we create a new distributed database, we need to specify partitionType and partitionScheme. When we reopen an existing distributed database, we only need to specify directory. We cannot overwrite an existing distributed database with a different partitionType or partitionScheme. We may, however, append new partitions to a value or a range domain.

The table below shows the partition types. PartitionScheme is an integer scalar for the sequential domain, a tuple for the hash domain, and a vector for all other domains. The interpretation of the partition scheme depends on the partition type. partitionScheme supports the following data types: CHAR, SHORT, INT, LONG, DATE, MONTH, TIME, MINUTE, SECOND, DATETIME, DATEHOUR and SYMBOL.

Partition Type	Partition Type Symbol	Partition Scheme
sequential domain	SEQ	An integer scalar. It means the number of partitions.
range domain	RANGE	A vector. Any two adjacent elements of the vector define the range for a partition.
hash domain	HASH	A tuple. The first element is the data type of partitioning column. The second element is the number of partitions.
value domain	VALUE	A vector. Each element of the vector defines a partition.
list domain	LIST	A vector. Each element of the vector defines a partition.
composite domain	COMPO	A vector. Each element of the vector is a database handle.

After we create a database, we cannot revise the partition type or partition scheme with function database.

Create Dimension Table

A dimension table is a non-partition table in a distributed database. It is used to store data sets in small size that are not frequently updated. We can use the function createTable to create a dimension table.

Create Partitioned Tables

Use function createPartitionedTable to create a distributed table.

Syntax of function createPartitionedTable: createPartitionedTable(dbHandle, table, tableName, [partitionColumns], [compressMethods])

We need to provide a model table whose schema is adopted by the newly created distributed table. All partition domains other than the sequential domain must specify partition column or columns.

$ db=database("dfs://db1",VALUE,1 2 3)
$ timestamp = [09:34:07,09:36:42,09:36:51,09:36:59,09:32:47,09:35:26,09:34:16,09:34:26,09:38:12]
$ sym = `C`MS`MS`MS`IBM`IBM`C`C`C
$ price= 49.6 29.46 29.52 30.02 174.97 175.23 50.76 50.32 51.29
$ qty = 2200 1900 2100 3200 6800 5400 1300 2500 8800
$ t = table(timestamp, sym, qty, price)

$ dt=db.createTable(t,`dt).append!(t)
$ select * from dt;

timestamp	sym	qty	price
09:34:07	C	2200	49.6
09:36:42	MS	1900	29.46
09:36:51	MS	2100	29.52
09:36:59	MS	3200	30.02
09:32:47	IBM	6800	174.97
09:35:26	IBM	5400	175.23
09:34:16	C	1300	50.76
09:34:26	C	2500	50.32
09:38:12	C	8800	51.29

Range Domain

In a range domain (RANGE), partitions are determined by ranges specified by any two adjacent elements of the partition scheme vector. The starting value is inclusive and the ending value is exclusive. A row with value of the partitioning column falling within a range belongs to the partition defined by this range.

Example 1. Create a distributed database dfs://rangedb of range domain and save an in-memory table as a distributed table in the database.

$ n=1000000
$ ID=rand(10, n)
$ x=rand(1.0, n)
$ t=table(ID, x)
$ db=database("dfs://rangedb", RANGE,  0 5 10)
$ pt=db.createPartitionedTable(t, `pt, `ID)
$ pt.append!(t)
$ select count(x) from pt;

count_x
1000000

The database above has 2 partitions: [0,5) and [5,10). Table t is saved as a partitioned table pt with the partitioning column of ID.

To get a database handle, use function database and only specify the database path.

$ db=database("dfs://rangedb")

Value Domain

$ n=1000000
$ month=take(2000.01M..2016.12M, n)
$ x=rand(1.0, n)
$ t=table(month, x)
$ db=database("dfs://valuedb", VALUE, 2000.01M..2016.12M)
$ pt = db.createPartitionedTable(t, `pt, `month)
$ pt.append!(t)
$ select count(x) from pt;

count_x
1000000

The database above has 17*12=204 partitions. Each partition is a month between January 2000 and December 2016. Table t is saved as a partitioned table pt with the partitioning column of month.

For a database with value domain, we can increase the number of partitions after it is created. Please check addValuePartitions for details.

Hash Domain

In a hash domain (HASH), we need to specify the data type of the partitioning column and the number of partitions.

$ n=1000000
$ ID=rand(10, n)
$ x=rand(1.0, n)
$ t=table(ID, x)
$ db=database("dfs://hashdb", HASH,  [INT, 2])
$ pt = db.createPartitionedTable(t, `pt, `ID)
$ pt.append!(t)
$ select count(x) from pt;

count_x
1000000

The database dfs://hashdb above has 2 partitions. Table t is saved as a partitioned table pt with the partitioning column of ID.

List Domain

In a list domain (LIST), each element of the partition scheme vector corresponds to a partition.

$ n=1000000
$ ticker = rand(`MSFT`GOOG`FB`ORCL`IBM,n);
$ x=rand(1.0, n)
$ t=table(ticker, x)
$ db=database("dfs://listdb", LIST, [`IBM`ORCL`MSFT, `GOOG`FB])
$ pt = db.createPartitionedTable(t, `pt, `ticker)
$ pt.append!(t)
$ select count(x) from pt;

count_x
1000000

The database dfs://listdb above has 2 partitions. The first partition contains 3 tickers and the second contains 2 tickers.

Composite Domain

A composite domain (COMPO) can have 2 or 3 partitioning columns. Each partitioning column can be of range, value, list or hash domain.

$ n=1000000
$ ID=rand(100, n)
$ dates=2017.08.07..2017.08.11
$ date=rand(dates, n)
$ x=rand(10.0, n)
$ t=table(ID, date, x)

$ dbDate = database(, VALUE, 2017.08.07..2017.08.11)
$ dbID = database(, RANGE, 0 50 100)
$ db = database("dfs://compodb", COMPO, [dbDate, dbID])
$ pt = db.createPartitionedTable(t, `pt, `date`ID)
$ pt.append!(t)
$ select count(x) from pt;

count_x
1000000

The database dfs://compodb above has 2 levels of partitions. One level uses the value domain with 5 partitions. The other level uses the range domain with 2 partitions. In total the database has 5*2=10 partitions.