Configurations
SmartLoader main configuration file is default.yml which is located in SmartLoader’s root directory. Configurations from default.yml are inherited into configuration files created in the configs folder. Smartloader will use a different configuration file for each different CSV file or set of files to be loaded.
Configuration File properties
The configuration file should be located in the configs folder and should have the same name of the CSV file to be imported with a .yml extension. For example, importing a CSV file called myfile.csv should have a configuration file in the configs folder called myfile.yml.
Configuration files have several properties that can be set to control the data loading. These properties can be set and used using the following method listed in order of priority:
As command line arguments
Inserted in the configuration file
As environment variables (Configuration options have to be proceeded by SMARTLOADER_)
Example if used as a command line arguments:
java -jar smartloader.jar -generateConfigOnly -createTable=true -linesToAnalyze=1000
Example if used inside the config file:
generateConfigOnly
createTable:true
linesToAnalyze:1000
Example fi used as an environment variable:
SMARTLOADER_createTable=True
SMARTLOADER_linesToAnalyze=1000
!!CAUTION!!
ALL CONFIGURATION FILE PROPERTIES ARE CASE SENSITIVE
CONFIG FILES CANNOT CONTAIN TABS. ONLY SPACES.
List of properties:
-addAuditSid Add audit column if it does not
exist(works only in 'generate
config' step)
-addDateId add date column if it does not
exist
-archiveFolderPathOverride <arg> Specify path to the archive
folder regardless where workDir
or dataDir settings are pointing
-auditSidColumnNumber <arg> audit_sid column number in the source file
-auditTableName Set the name of the table to receive
audit records. Default is dim_audit.
-auto Generate all the configs and
insert the data to the database
-charset <arg> Specify the charset of the
incoming file. For example
'utf-8','big5','cp1251'. Default
'utf-8'
-createTable Create table if it does not
exist(works only if not -auto)
-customArgs <arg> Array of custom arguments that
should be arranged in pairs. This
arguments will be packed into the
map and will be accessible from
groovy expressions. Example
[-customArgs 1 red 2 green 3
blue] this args will be converted
to map [1:'red', 2:'green',
3:'blue']
-daemon run application continuously in
background
-dataDir <arg> the program data directory for
incoming, archive and rejected
folders
-dbPassword <arg> the target database password
-dbUrl <arg> the target database url
connection string
-dbUser <arg> the target database user name
-debug Increase verbosity of error
logging
-delimiter <arg> column delimiter
-direct <arg> Loading to vertica will
use 'DIRECT' loading method
-doNotWriteAudit <arg> If set - do not write audit_sid
to dim_audit table
-enclosingSymbol <arg> if your data is wrapped in
quotes, set this property
to '"'
-encrypt <arg> encrypt input string
-filesToProcess <arg> Specify file names to process, in
case if you want to process files
from different folder
-flexTable <arg> If this property will be set to
true - file will be parsed like
json file, and will be loaded to
flex table
-generateConfigOnly Generate config only
-generateSqlFile Generates separated files with
"create table" sql
-generateSqlFileOnly Generates separated files with
"create table" sql for existing
configuration files and quits. Files
will have the same name as the configuration
file with a .SQL extension
-header <arg> Count of the lines in the header,
disables automatic header
recognition:
-header -1: Default value. Header will be recognized by smartLoader
-header 0: Source file has no header
-header 1 (or any positive number): Number of lines in the header
-help Print this help
-incomingFolderPathOverride <arg> Specify path to the incoming
folder regardless where workDir
or dataDir settings are pointing
-inputFileDecryptionSecretPath <arg> Path in Vault where there are all
setting for decryption of
encrypted files in incoming
folder
-instanceName <arg> Instance name that is used for
the locks
-jsonFlattenArrays <arg> If provided json file contains
nested arrays structures, parser
may try to make them flat
-jsonFlattenMaps <arg> If provided json file contains
nested maps structures, parser
may try to make them flat
-keepAuditSid Do not update audit sid column
with a new value
-linesToAnalyze <arg> number of lines to recognize the
structure of the file
-loadingMode <arg> Specify 'by_time' to upload the
file by modify time of the files,
from the earliest to newest.
Specify 'by_category' to first
sort the files in batches by
tables where the files should be
downloaded, and then sort the
batches by modify time
-moveToArchive <arg> Move file to archive or not(by
default=true)
-notificationEmail <arg> Email where the report will be
sent
-nullValue <arg> Value of the column that will be
treated as empty
-onEndExpression Used to execute Groovy code once when SmartLoader
finishes processing input file
-onStartExpression Used to execute Groovy code once when
SmarLoader begins processing the input
file
-overrideExtension <arg> If you need, you can manually set
extension for the file, for
example process .dat files (By
default smartLoader does not know
what linereader to use for such
files)
-parallelism <arg> Number of simultaneous loaders
-processId <arg> To be used in dim_audit table
-quotes <arg> Value quotes. This is used if field
values contain delimiter characters. it will wrap
the field contents with double quotes.
For example, if the file is comma-delimited
and one of the fields contains text that
has a commas, then setting this property
to (quotes: "\"") will cause SmartLoader
to ignore the commas.
-removeProcessedFiles <arg> If true files will not be
archived after the processing,
but just removed. If false -
files moved to the archive folder
-skipColumns <columns> columns are not to insert to the
table (one option - one column,
can be used several times)
-skipExtension <arg> If the file name ends with this
string, it will be skipped by the
smartloader. Default - ".temp"
-skipLines <arg> how many lines should be skipped
before analyzing or loading
-smtpHost <arg> smtp host for sending email.
Default 'localhost'
-smtpPassword <arg> smtp password of the notification
sender
-smtpUser <arg> smtp user of the notification
sender
-stringMinSize <arg> minimal length of a string in the
schema
-table <arg> The target table name where the data
will be loaded
-tablePrefix <arg> Prefix that will be added to the
table name
-useColumnsListFromDb <arg> If this property is set to true -
Columns list in configuration
file can be empty. Column names
will be obtained from the db
-useColumnsListFromIncomingFile <arg> If this property is set to true -
Columns list in configuration
file can be empty. Column names
will be obtained from the file
header
-useConfigFromPath <arg> Disable searching config file
based on file name, but instead
force smartLoader to use this
provided config for all files
-vaultPassword <arg> Password in Vault that stores
encryption keys
-vaultRoleId <arg> RoleId in Vault that stores
encryption keys
-vaultUrl <arg> Url to Vault that stores
encryption keys
-workDir <arg> The program working directory
Environment Variables only
SMARTLOADER_LOGS_FORMAT: Defines SmartLoader logs format, set to “%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n” by default.
SMARTLOADER_LOGS_LEVEL: Defines SmarLoader logs level, set to INFO by default
You can find documentation on logger configuration on its official page: http://logback.qos.ch/documentation.html
SMARTLOADER_LOGS_EPHEMERAL: Controls file logging. By default it is set to 1 that disables file logs. Set it to 0 to enable file logging
Working With Fields
Field Types
SmartLoader automatically detects delimiters, field names and filed types and generates corresponding entires in the configuration file using the following format:
columns:
- !field
additionalData: yyyy-dd-MM
name: Date
size: 32
type: Timestamp
- !field
name: underlying
size: 16
type: String
It is always advised to examine the data for each field and make any adjustments if necessary. SmartLoader can detect the following field types:
Timestamp
String
Integer
Long
Numeric
SmartLoader allows the creation of calculated new fields based on the CSV file underlying columns. This can be accomplished using the field expression: property. For example, assume a csv file myfile.csv with one column: A 1 2 3
Let’s add a new column “B” with value {field A * 2}. We accomplish this with the following configuration change:
---
columns:
- !field
name: B
size: 25
type: String
generated: true
expression: 'toInt(f("a"))*2'
- !field
name: a
size: 7
type: Integer
…
We used the following two new field properties:
generated:true - this tells smartloader that this is a generated field
expression: ‘toInt(f(”a”))*2’ - this is the calculation that SmartLoader will use to populate the value of the newly created field. (Refer to the API Reference Guide documentation for more information about the toInt() and f() functions)
Using the above configuration, SmartLoader will load the following data set into the database: A,B 1,2 2,4 3,6
Skippig Fields
SmartLoader allows for fields that are present in the CSV file to be skipped and excluded from being loaded in the database. This can be accomplished by using setting property skipColumns and listing the names of the fields to be excluded:
...
skipColumns:
- _SKIP_A
- _SKIP_B
- _SKIP_C
- _SKIP_D
...
A good practice (but is not required) is to rename the field name to be excluded to start with “SKIP”. For example, if skipping field B, then we recommend to name it “SKIP_B”. This allows for better field management as you can grep “SKIP” and get the list of fields to be skipped.
Working With Multiple Files
We covered so far importing a single CSV file where we had to create a corresponding configuration file that has the same name. Such approach will not scale when processing multiple incoming text files with the same structure but different names. These are cases where the filename consists of a fixed name and variable part such as a count or a date (i.e. http_MON_344.csv, http_TUE_343.csv, http_WED_342.csv, etc..). Following the method we outlined so far will require the creation of as many configuration files as there are CSV files. SmartLoader simplifies this process with the creation of routing.cfg configuration file.
Routing.cfg Properties
the routing.cfg file is located in the configs folder and is intended to map the relationship between input files and their corresponding configuration files. It has the following structure:
fileNameRegularExpression = configurationFileName
fileNameRegularExpression: is a regular expression with a pattern that matches the desired files
configurationFileName: the name of the configuration file that maps to the CSV files that match the pattern
Given the above example where we are processing the following multiple files with the same file structure:
http_MON_344.csv
http_TUE_343.csv
http_WED_342.csv
STEP1: Create a config file based on a single file and name it http.yml
STEP2: Create a routing.cfg file with the following entry: http_.*=http
The entry is using a regular expression to select files that start with {http_} + {any character or digit that follow}. SmartLoader will use the http.yml configuration file to load any file with the name that matches the regular expression pattern. The routing.cfg configuration file can contain multiple entries of file pattern selections and configuration filename combinations.
… http_.*=http .+-options-.+=options …
NOTE:All entries in rounting.cfg must be in lower case
Routing.cfg also supports conditional execution statements using the following format:
?{expression}pattern=config
?{expression} is a Groovy expression that, if it evaluates to true, the pattern=config statement will execute
pattern=config: is the pattern defined using regular expression that corresponds with the defined config file
For example, given the previous example, if we want to only execute the pattern http_.=http on Mondays. we would use the following statement in routing.cfg:
?{Date.parse(”yyyy-MM-dd”, “2011-03-07”).format(”EEE”) == “Mon”} http_MON*=http
The expression is a Groovy expression that checks if today’s date is Monday and execute the pattern if true
Working With Compressed Files
SmartLoader natively reads compressed files in the following formats:
.ZIP (Files must be compressed with the DEFLATE option)
.JZIP
.XZ
Logging
Smartloader generates a data loading log for each processed file and saves the log in the logs folder using the same filename with a .log extension. The log file contains the number of records read and or rejected. Check the files in the rejected folder to gain more insights about why the file or records were rejected.
Rejected Files
SmartLoader will reject records or entire files that do not conform to the rules defined in the configuration file. SamrtLoader will create a file with the same filename as the source file with a .rejected extension. For example, if the source filename is myfile.zip, the rejected filename will be myfile.zip.rejected
NOTE: Rejected files are written in CSV format even though the source file may be compressed. To view rejected files, change the extension to CSV and view it using a text editor