Configuring Parsers with gsqlcmd

Contents:

Parsing HTML
Parsing XML
Parsing JSON
Parsing Plain Text
Parsing CSV

Parsing HTML

gsqlcmd allows you to extract data from HTML tables easily.

To extract the required table data, use the /rootPath option.

For example, to download dividends from the nasdaq.com website, run:

gsqlcmd download http://www.nasdaq.com/symbol/aapl/dividend-history aapl.htm

Next, get the available tables with the following command:

gsqlcmd show-html-tables aapl.htm tables.htm /openOutput

The resulting tables.htm file will look like this:

Example of the show-html-tables mode for dividends from nasdaq.com

You’ll notice that the dividends are in table 3.

To convert the downloaded file, use:

gsqlcmd convert aapl.htm dividends.csv /rootPath=3 /quoteChar=

Alternatively, to extract data directly from the web page, use:

gsqlcmd convert http://www.nasdaq.com/symbol/aapl/dividend-history dividends.csv /rootPath=3 /quoteChar=

The resulting dividends.csv will contain data like this:

ExEffDate;Type;CashAmount;DeclarationDate;RecordDate;PaymentDate
2019-02-08;Cash;0.73;2019-01-29;2019-02-11;2019-02-14
2018-11-08;Cash;0.73;2018-11-01;2018-11-12;2018-11-15
2018-08-10;Cash;0.73;2018-07-31;2018-08-13;2018-08-16
2018-05-11;Cash;0.73;2018-05-01;2018-05-14;2018-05-17
2018-02-09;Cash;0.63;2018-02-01;2018-02-12;2018-02-15

gsqlcmd usually detects the correct table automatically. So, you can try a simpler command first:

gsqlcmd convert aapl.htm dividends.csv

If the page contains multiple tables with the required data, run the convert command for each table.

If the tables share the same structure, specify multiple indexes in the /rootPath option, separated by commas, like this: /rootPath=20,21,22.

If a table lacks headers, use the /noSourceHeaders option. In this case, gsqlcmd will name columns F1, F2, etc.

Parsing XML

gsqlcmd can extract table data from hierarchical XML files.

In most cases, gsqlcmd automatically detects the desired data. However, you can customize the parser using the /rootPath and other parser options.

For example, consider test.xml with the following content:

<?xml version="1.0" encoding="utf-8"?>
<root>
    <parent>
        <row id="1"><f1>data11</f1><f2>data12</f2></row>
        <row id="2"><f1>data21</f1><f2>data22</f2></row>
        <row id="3"><f1>data31</f1><f2>data32</f2></row>
    </parent>
</root>

The basic convert command is:

gsqlcmd convert test.xml test.csv

The resulting test.csv will look like this:

id;"f1";"f2"
1;"data11";"data12"
2;"data21";"data22"
3;"data31";"data32"

gsqlcmd selects the "best" root for the data. This command is equivalent to:

gsqlcmd convert test.xml test.csv /rootPath=root.parent.row

You can change the root path to extract data from a specific node. For example:

gsqlcmd convert test.xml test.csv /rootPath=root.parent

This will yield:

row_id;"row_f1";"row_f2"
1;"data11";"data12"
2;"data21";"data22"
3;"data31";"data32"

To remove quotes, use the /quoteChar option:

gsqlcmd convert test.xml test.csv /rootPath=root.parent /quoteChar=

The output will be:

row_id;row_f1;row_f2
1;data11;data12
2;data21;data22
3;data31;data32

Use the show-roots mode to list applicable roots:

gsqlcmd show-roots test.xml

root
root.parent
root.parent.row
root.parent.row.id
root.parent.row.f1
root.parent.row.f2

Here are some options to customize the parser:

/attributedColumns=<col>.<attr>[;...]
/collapsedNodes=<node>[;...]
/ignoredTags=<tag>[;...]
/keptNodes=<node>[;...]
/requiredColumns=<name>[;...]
/skippedNodes=<node>[;...]

For example, to skip certain nodes, use the /skippedNodes option:

gsqlcmd convert test.xml test.csv /quoteChar= /skippedNodes=f2

id;f1
1;data11
2;data21
3;data31

Parsing JSON

Parsing JSON is similar to parsing XML since both formats are hierarchical.

For example, consider test.json with the following content:

{"root":
  {"parent":[
    {"id":1,"f1":"data11","f2":"data12"},
    {"id":2,"f1":"data21","f2":"data22"},
    {"id":3,"f1":"data31","f2":"data32"}
  ]}
}

The basic convert command will return:

gsqlcmd convert test.json test.csv

Id;"F1";"F2"
1;"data11";"data12"
2;"data21";"data22"
3;"data31";"data32"

The JSON parser capitalizes the first letters of column names.

To list applicable roots, use the show-roots command:

gsqlcmd show-roots test.json

root
root.parent
root.parent.row
root.parent.row.id
root.parent.row.f1
root.parent.row.f2

Select the required data using the /rootPath option:

gsqlcmd convert test.json test.csv /rootPath=root /quoteChar=

Parent_id;Parent_f1;Parent_f2
1;data11;data12
2;data21;data22
3;data31;data32

The JSON parser supports multiple roots. For example:

gsqlcmd convert test.json test.csv /rootPath=root.parent.f1,root.parent.f2 /quoteChar=

F1;F2
data11;data12
data21;data22
data31;data32

You can use * instead of node names:

gsqlcmd convert test.json test.csv /rootPath=*.f1,*.f2 /quoteChar=

Check the Downloading and converting to CSV folder in the gsqlcmd package for ready-to-use samples.

Here are some options for customizing the parser:

/keptNodes=<node>[;...]
/requiredColumns=<name>[;...]
/skippedNodes=<node>[;...]

Parsing Plain Text

gsqlcmd can extract data from plain text files if the text contains recognizable columns, like this:

id  f1      f2
1   data11  data12
2   data21  data22
3   data31  data32

Use the convert command to extract data:

gsqlcmd convert test.txt test.csv

The resulting test.csv will look like this:

id;"f1";"f2"
1;"data11";"data12"
2;"data21";"data22"
3;"data31";"data32"

You can also use Microsoft ODBC and OLE DB text drivers to select and convert plain text data.

First, update the schema.ini file using the make-ini command:

gsqlcmd make-ini test.txt schema.ini

Then, use the exec command to select data:

gsqlcmd exec test.txt "SELECT * FROM test.txt" test.csv

You can use SQL-supported features like WHERE or GROUP in these commands.

Parsing CSV

gsqlcmd allows you to extract data from CSV files and convert it to any supported output format.

Here are some useful formatting options:

/add=<header>=<value | function>[;...]
/addRowNum
/dateFormat=<format>
/dateTimeFormat=<format>
/escapeChar=<char>
/noBOM
/outputCodepage=<codepage>
/outputCulture=<name>
/[outputFormat=]asText | asTxt | asCsv | asHtm | asHtml | asXml | asJson
/[[output]Separator](formatting-options.md#option-outputSeparator)=<separator> | Tab
/quoteChar=<char>
/rowNumBase=<integer>
/timeFormat=<format>

For example, to add new columns like Symbol and Date, and convert the output file to UTF-8 encoding, use:

gsqlcmd convert test.csv test-65001.csv /outputEncoding=65001 /add=Symbol=AAPL;Date=Date()

You can also use Microsoft ODBC and OLE DB text drivers to select and convert CSV data.

First, update the schema.ini file using the make-ini command:

gsqlcmd make-ini test.csv schema.ini

Then, use the exec command to select data:

gsqlcmd exec test.csv "SELECT 'AAPL' AS Symbol, :date AS [Date], * FROM test.csv" output.csv /set=date=Date()

You can use SQL-supported features like WHERE or GROUP in these commands.