Configurations

SmartLoader main configuration file is default.yml which is located in SmartLoader’s root directory. Configurations from default.yml are inherited into configuration files created in the configs folder. Smartloader will use a different configuration file for each different CSV file or set of files to be loaded.

Configuration File properties

The configuration file should be located in the configs folder and should have the same name of the CSV file to be imported with a .yml extension. For example, importing a CSV file called myfile.csv should have a configuration file in the configs folder called myfile.yml.

Configuration files have several properties that can be set to control the data loading. These properties can be set and used using the following method listed in order of priority:

  1. As command line arguments

  2. Inserted in the configuration file

  3. As environment variables (Configuration options have to be proceeded by SMARTLOADER_)

Example if used as a command line arguments:

java -jar smartloader.jar -generateConfigOnly -createTable=true -linesToAnalyze=1000

Example if used inside the config file:

  • generateConfigOnly

  • createTable:true

  • linesToAnalyze:1000

Example fi used as an environment variable:

  • SMARTLOADER_createTable=True

  • SMARTLOADER_linesToAnalyze=1000

!!CAUTION!!

  • ALL CONFIGURATION FILE PROPERTIES ARE CASE SENSITIVE

  • CONFIG FILES CANNOT CONTAIN TABS. ONLY SPACES.

List of properties:

 -addAuditSid                            Add audit column if it does not
                                         exist(works only in 'generate
                                         config' step)

 -addDateId                              add date column if it does not
                                         exist

 -archiveFolderPathOverride <arg>        Specify path to the archive
                                         folder regardless where workDir
                                         or dataDir settings are pointing

 -auditSidColumnNumber <arg>             audit_sid column number in the source file

 -auditTableName                         Set the name of the table to receive
                                         audit records. Default is dim_audit.

 -auto                                   Generate all the configs and
                                         insert the data to the database

 -charset <arg>                          Specify the charset of the
                                         incoming file. For example
                                         'utf-8','big5','cp1251'. Default
                                         'utf-8'

 -createTable                            Create table if it does not
                                         exist(works only if not -auto)

 -customArgs <arg>                       Array of custom arguments that
                                         should be arranged in pairs. This
                                         arguments will be packed into the
                                         map and will be accessible from
                                         groovy expressions. Example
                                         [-customArgs 1 red 2 green 3
                                         blue] this args will be converted
                                         to map [1:'red', 2:'green',
                                         3:'blue']

 -daemon                                 run application continuously in
                                         background

 -dataDir <arg>                          the program data directory for
                                         incoming, archive and rejected
                                         folders

 -dbPassword <arg>                       the target database password

 -dbUrl <arg>                            the target database url
                                         connection string

 -dbUser <arg>                           the target database user name

 -debug                                  Increase verbosity of error
                                         logging

 -delimiter <arg>                        column delimiter

 -direct <arg>                           Loading to  vertica will
                                         use 'DIRECT' loading method

 -doNotWriteAudit <arg>                  If set - do not write audit_sid
                                         to dim_audit table

 -enclosingSymbol <arg>                  if your data is wrapped in
                                         quotes, set this property
                                         to '"'

 -encrypt <arg>                          encrypt input string

 -filesToProcess <arg>                   Specify file names to process, in
                                         case if you want to process files
                                         from different folder

 -flexTable <arg>                        If this property will be set to
                                         true - file will be parsed like
                                         json file, and will be loaded to
                                         flex table

 -generateConfigOnly                     Generate config only

 -generateSqlFile                        Generates separated files with
                                         "create table" sql

 -generateSqlFileOnly                    Generates separated files with
                                         "create table" sql for existing
                                         configuration files and quits. Files
                                         will have the same name as the configuration
                                         file with a .SQL extension

 -header <arg>                           Count of the lines in the header,
                                         disables automatic header
                                         recognition:
                                         -header -1: Default value. Header will be recognized by smartLoader
                                         -header 0: Source file has no header
                                         -header 1 (or any positive number): Number of lines in the header  

 -help                                   Print this help

 -incomingFolderPathOverride <arg>       Specify path to the incoming
                                         folder regardless where workDir
                                         or dataDir settings are pointing

 -inputFileDecryptionSecretPath <arg>    Path in Vault where there are all
                                         setting for decryption of
                                         encrypted files in incoming
                                         folder

 -instanceName <arg>                     Instance name that is used for
                                         the locks

 -jsonFlattenArrays <arg>                If provided json file contains
                                         nested arrays structures, parser
                                         may try to make them flat

 -jsonFlattenMaps <arg>                  If provided json file contains
                                         nested maps structures, parser
                                         may try to make them flat

 -keepAuditSid                           Do not update audit sid column
                                         with a new value

 -linesToAnalyze <arg>                   number of lines to recognize the
                                         structure of the file

 -loadingMode <arg>                      Specify 'by_time' to upload the
                                         file by modify time of the files,
                                         from the earliest to newest.
                                         Specify 'by_category' to first
                                         sort the files in batches by
                                         tables where the files should be
                                         downloaded, and then sort the
                                         batches by modify time

 -moveToArchive <arg>                    Move file to archive or not(by
                                         default=true)

 -notificationEmail <arg>                Email where the report will be
                                         sent

 -nullValue <arg>                        Value of the column that will be
                                         treated as empty

 -onEndExpression                        Used to execute Groovy code once when SmartLoader
                                         finishes processing input file

 -onStartExpression                      Used to execute Groovy code once when
                                         SmarLoader begins processing the input
                                         file

 -overrideExtension <arg>                If you need, you can manually set
                                         extension for the file, for
                                         example process .dat files (By
                                         default smartLoader does not know
                                         what linereader to use for such
                                         files)

 -parallelism <arg>                      Number of simultaneous loaders

 -processId <arg>                        To be used in dim_audit table

 -quotes <arg>                           Value quotes. This is used if field
                                         values contain delimiter characters. it will wrap
                                         the field contents with double quotes.
                                         For example, if the file is comma-delimited
                                         and one of the fields contains text that
                                         has a commas, then setting this property
                                         to (quotes: "\"") will cause SmartLoader
                                         to ignore the commas.


 -removeProcessedFiles <arg>             If true files will not be
                                         archived after the processing,
                                         but just removed. If false -
                                         files moved to the archive folder

 -skipColumns <columns>                  columns are not to insert to the
                                         table (one option - one column,
                                         can be used several times)

 -skipExtension <arg>                    If the file name ends with this
                                         string, it will be skipped by the
                                         smartloader. Default - ".temp"

 -skipLines <arg>                        how many lines should be skipped
                                         before analyzing or loading

 -smtpHost <arg>                         smtp host for sending email.
                                         Default 'localhost'

 -smtpPassword <arg>                     smtp password of the notification
                                         sender

 -smtpUser <arg>                         smtp user of the notification
                                         sender

 -stringMinSize <arg>                    minimal length of a string in the
                                         schema

 -table <arg>                            The target table name where the data
                                         will be loaded

 -tablePrefix <arg>                      Prefix that will be added to the
                                         table name

 -useColumnsListFromDb <arg>             If this property is set to true -
                                         Columns list in configuration
                                         file can be empty. Column names
                                         will be obtained from the db

 -useColumnsListFromIncomingFile <arg>   If this property is set to true -
                                         Columns list in configuration
                                         file can be empty. Column names
                                         will be obtained from the file
                                         header

 -useConfigFromPath <arg>                Disable searching config file
                                         based on file name, but instead
                                         force smartLoader to use this
                                         provided config for all files

 -vaultPassword <arg>                    Password in Vault that stores
                                         encryption keys

 -vaultRoleId <arg>                      RoleId in Vault that stores
                                         encryption keys

 -vaultUrl <arg>                         Url to Vault that stores
                                         encryption keys

 -workDir <arg>                          The program working directory

Environment Variables only

SMARTLOADER_LOGS_FORMAT: Defines SmartLoader logs format, set to “%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n” by default.

SMARTLOADER_LOGS_LEVEL: Defines SmarLoader logs level, set to INFO by default

You can find documentation on logger configuration on its official page: http://logback.qos.ch/documentation.html

SMARTLOADER_LOGS_EPHEMERAL: Controls file logging. By default it is set to 1 that disables file logs. Set it to 0 to enable file logging

Working With Fields

Field Types

SmartLoader automatically detects delimiters, field names and filed types and generates corresponding entires in the configuration file using the following format:

columns:
    - !field
        additionalData: yyyy-dd-MM
        name: Date
        size: 32
        type: Timestamp
    - !field
        name: underlying
        size: 16
        type: String

It is always advised to examine the data for each field and make any adjustments if necessary. SmartLoader can detect the following field types:

  • Timestamp

  • String

  • Integer

  • Long

  • Numeric

SmartLoader allows the creation of calculated new fields based on the CSV file underlying columns. This can be accomplished using the field expression: property. For example, assume a csv file myfile.csv with one column: A 1 2 3

Let’s add a new column “B” with value {field A * 2}. We accomplish this with the following configuration change:

 ---
columns:
    - !field
        name: B
        size: 25
        type: String
        generated: true
        expression: 'toInt(f("a"))*2'
    - !field
        name: a
        size: 7
        type: Integer
…

We used the following two new field properties:

  • generated:true - this tells smartloader that this is a generated field

  • expression: ‘toInt(f(”a”))*2’ - this is the calculation that SmartLoader will use to populate the value of the newly created field. (Refer to the API Reference Guide documentation for more information about the toInt() and f() functions)

Using the above configuration, SmartLoader will load the following data set into the database: A,B 1,2 2,4 3,6

Skippig Fields

SmartLoader allows for fields that are present in the CSV file to be skipped and excluded from being loaded in the database. This can be accomplished by using setting property skipColumns and listing the names of the fields to be excluded:

...
skipColumns:
 - _SKIP_A
 - _SKIP_B
 - _SKIP_C
 - _SKIP_D
...

A good practice (but is not required) is to rename the field name to be excluded to start with “SKIP”. For example, if skipping field B, then we recommend to name it “SKIP_B”. This allows for better field management as you can grep “SKIP” and get the list of fields to be skipped.

Working With Multiple Files

We covered so far importing a single CSV file where we had to create a corresponding configuration file that has the same name. Such approach will not scale when processing multiple incoming text files with the same structure but different names. These are cases where the filename consists of a fixed name and variable part such as a count or a date (i.e. http_MON_344.csv, http_TUE_343.csv, http_WED_342.csv, etc..). Following the method we outlined so far will require the creation of as many configuration files as there are CSV files. SmartLoader simplifies this process with the creation of routing.cfg configuration file.

Routing.cfg Properties

the routing.cfg file is located in the configs folder and is intended to map the relationship between input files and their corresponding configuration files. It has the following structure:

fileNameRegularExpression = configurationFileName

  • fileNameRegularExpression: is a regular expression with a pattern that matches the desired files

  • configurationFileName: the name of the configuration file that maps to the CSV files that match the pattern

Given the above example where we are processing the following multiple files with the same file structure:

  • http_MON_344.csv

  • http_TUE_343.csv

  • http_WED_342.csv

    STEP1: Create a config file based on a single file and name it http.yml

    STEP2: Create a routing.cfg file with the following entry: http_.*=http

The entry is using a regular expression to select files that start with {http_} + {any character or digit that follow}. SmartLoader will use the http.yml configuration file to load any file with the name that matches the regular expression pattern. The routing.cfg configuration file can contain multiple entries of file pattern selections and configuration filename combinations.

… http_.*=http .+-options-.+=options …

NOTE:All entries in rounting.cfg must be in lower case

Routing.cfg also supports conditional execution statements using the following format:

?{expression}pattern=config

  • ?{expression} is a Groovy expression that, if it evaluates to true, the pattern=config statement will execute

  • pattern=config: is the pattern defined using regular expression that corresponds with the defined config file

For example, given the previous example, if we want to only execute the pattern http_.=http on Mondays. we would use the following statement in routing.cfg:

?{Date.parse(”yyyy-MM-dd”, “2011-03-07”).format(”EEE”) == “Mon”} http_MON*=http

The expression is a Groovy expression that checks if today’s date is Monday and execute the pattern if true

Working With Compressed Files

SmartLoader natively reads compressed files in the following formats:

  • .ZIP (Files must be compressed with the DEFLATE option)

  • .JZIP

  • .XZ

Logging

Smartloader generates a data loading log for each processed file and saves the log in the logs folder using the same filename with a .log extension. The log file contains the number of records read and or rejected. Check the files in the rejected folder to gain more insights about why the file or records were rejected.

Rejected Files

SmartLoader will reject records or entire files that do not conform to the rules defined in the configuration file. SamrtLoader will create a file with the same filename as the source file with a .rejected extension. For example, if the source filename is myfile.zip, the rejected filename will be myfile.zip.rejected

NOTE: Rejected files are written in CSV format even though the source file may be compressed. To view rejected files, change the extension to CSV and view it using a text editor