Example 1. Load large text files




Very often we need to import data in large text files into a database. This could be the bottleneck of the entire data analysis process. To speed up the process, we can load a file in parallel.  


In this and some other examples, we will import a text file USstocks.csv. It contains daily records of thousands of US stocks from 1990 to 2016. You can get the data from multiple sources. The most common source of historical data for US stocks is the Center for Research in Security Prices (CRSP) from the University of Chicago.


In these examples, we will use the following columns out of the 22 columns in USstocks.csv:


Data column name

Description

PERMNO

security identification number

TICKER

ticker symbol

date

calendar date

PRC

price per share

SHROUT

shares outstanding

RET

daily stock return

VOL

daily stock volume


To load USstocks.csv in parallel:



USstocks = ploadText("C:/DolphinDB/Data/USstocks.csv");



In comparison, to load the file in a non-parallel fashion:



USstocks = loadText("C:/DolphinDB/Data/USstocks.csv");



Loading a file in parallel is generally faster than in a non-parallel fashion.



>timer USstocks = ploadText("C:/DolphinDB/Data/USstocks.csv");

Time elapsed: 22415.377 ms


>timer USstocks = loadText("C:/DolphinDB/Data/USstocks.csv");

Time elapsed: 59841.574 ms



Loading a text file in parallel requires a minimum size of available memory that is twice the size of the table. In comparison, loading a text file in a non-parallel fashion requires that the size of the available memory is larger than the size of the table. If we need to import a data file that will produce a table larger than available memory, we can save the data as a partitioned table in a distributed database. Please see Example 2.


When the system loads an external text file, it makes a random sampling of each column and infers its likely data type.


We recommend users check the data types of all the columns in the text file with function extractTextSchema. In this example, as some of the values of RET are 'C', the system determines that the data type of RET is STRING.



schema = extractTextSchema("c:/DolphinDB/Data/USstocks.csv");

schema;


name    type

------- ------

PERMNO  INT

date    DATE

SHRCD   INT

TICKER  SYMBOL

TRDSTAT CHAR

PERMCO  INT

HSICCD  INT

CUSIP   SYMBOL

DLSTCD  SYMBOL

DLPRC   SYMBOL

DLRET   SYMBOL

BIDLO   DOUBLE

ASKHI   DOUBLE

PRC     DOUBLE

VOL     INT

RET     STRING

BID     DOUBLE

ASK     DOUBLE

SHROUT  INT

CFACPR  DOUBLE

CFACSHR DOUBLE

OPENPRC DOUBLE



The following script changes the data type of RET to DOUBLE, and loads the text file into DolphinDB with the correct schema.



update schema set type=`DOUBLE where name=`RET;

USstocks = ploadText("c:/DolphinDB/Data/USstocks.csv", , schema);

 


The first 20 observations of the table USstocks in the Data Browser window:




Download source code here.