Configuring Parsers with gsqlcmd
Contents:
Parsing HTML
gsqlcmd allows you to extract data from HTML tables easily.
To extract the required table data, use the /rootPath option.
For example, to download dividends from the nasdaq.com website, run:
gsqlcmd download http://www.nasdaq.com/symbol/aapl/dividend-history aapl.htm
Next, get the available tables with the following command:
gsqlcmd show-html-tables aapl.htm tables.htm /openOutput
The resulting tables.htm
file will look like this:
You’ll notice that the dividends are in table 3.
To convert the downloaded file, use:
gsqlcmd convert aapl.htm dividends.csv /rootPath=3 /quoteChar=
Alternatively, to extract data directly from the web page, use:
gsqlcmd convert http://www.nasdaq.com/symbol/aapl/dividend-history dividends.csv /rootPath=3 /quoteChar=
The resulting dividends.csv
will contain data like this:
ExEffDate;Type;CashAmount;DeclarationDate;RecordDate;PaymentDate 2019-02-08;Cash;0.73;2019-01-29;2019-02-11;2019-02-14 2018-11-08;Cash;0.73;2018-11-01;2018-11-12;2018-11-15 2018-08-10;Cash;0.73;2018-07-31;2018-08-13;2018-08-16 2018-05-11;Cash;0.73;2018-05-01;2018-05-14;2018-05-17 2018-02-09;Cash;0.63;2018-02-01;2018-02-12;2018-02-15
gsqlcmd usually detects the correct table automatically. So, you can try a simpler command first:
gsqlcmd convert aapl.htm dividends.csv
If the page contains multiple tables with the required data, run the convert command for each table.
If the tables share the same structure, specify multiple indexes in the /rootPath option, separated by commas, like this: /rootPath=20,21,22
.
If a table lacks headers, use the /noSourceHeaders option. In this case, gsqlcmd will name columns F1, F2, etc.
Parsing XML
gsqlcmd can extract table data from hierarchical XML files.
In most cases, gsqlcmd automatically detects the desired data. However, you can customize the parser using the /rootPath and other parser options.
For example, consider test.xml
with the following content:
<?xml version="1.0" encoding="utf-8"?> <root> <parent> <row id="1"><f1>data11</f1><f2>data12</f2></row> <row id="2"><f1>data21</f1><f2>data22</f2></row> <row id="3"><f1>data31</f1><f2>data32</f2></row> </parent> </root>
The basic convert command is:
gsqlcmd convert test.xml test.csv
The resulting test.csv
will look like this:
id;"f1";"f2" 1;"data11";"data12" 2;"data21";"data22" 3;"data31";"data32"
gsqlcmd selects the "best" root for the data. This command is equivalent to:
gsqlcmd convert test.xml test.csv /rootPath=root.parent.row
You can change the root path to extract data from a specific node. For example:
gsqlcmd convert test.xml test.csv /rootPath=root.parent
This will yield:
row_id;"row_f1";"row_f2" 1;"data11";"data12" 2;"data21";"data22" 3;"data31";"data32"
To remove quotes, use the /quoteChar option:
gsqlcmd convert test.xml test.csv /rootPath=root.parent /quoteChar=
The output will be:
row_id;row_f1;row_f2 1;data11;data12 2;data21;data22 3;data31;data32
Use the show-roots mode to list applicable roots:
gsqlcmd show-roots test.xml root root.parent root.parent.row root.parent.row.id root.parent.row.f1 root.parent.row.f2
Here are some options to customize the parser:
/attributedColumns=<col>.<attr>[;...] /collapsedNodes=<node>[;...] /ignoredTags=<tag>[;...] /keptNodes=<node>[;...] /requiredColumns=<name>[;...] /skippedNodes=<node>[;...]
For example, to skip certain nodes, use the /skippedNodes option:
gsqlcmd convert test.xml test.csv /quoteChar= /skippedNodes=f2 id;f1 1;data11 2;data21 3;data31
Parsing JSON
Parsing JSON is similar to parsing XML since both formats are hierarchical.
For example, consider test.json
with the following content:
{"root": {"parent":[ {"id":1,"f1":"data11","f2":"data12"}, {"id":2,"f1":"data21","f2":"data22"}, {"id":3,"f1":"data31","f2":"data32"} ]} }
The basic convert command will return:
gsqlcmd convert test.json test.csv Id;"F1";"F2" 1;"data11";"data12" 2;"data21";"data22" 3;"data31";"data32"
The JSON parser capitalizes the first letters of column names.
To list applicable roots, use the show-roots command:
gsqlcmd show-roots test.json root root.parent root.parent.row root.parent.row.id root.parent.row.f1 root.parent.row.f2
Select the required data using the /rootPath option:
gsqlcmd convert test.json test.csv /rootPath=root /quoteChar= Parent_id;Parent_f1;Parent_f2 1;data11;data12 2;data21;data22 3;data31;data32
The JSON parser supports multiple roots. For example:
gsqlcmd convert test.json test.csv /rootPath=root.parent.f1,root.parent.f2 /quoteChar= F1;F2 data11;data12 data21;data22 data31;data32
You can use *
instead of node names:
gsqlcmd convert test.json test.csv /rootPath=*.f1,*.f2 /quoteChar=
Check the Downloading and converting to CSV folder in the gsqlcmd package for ready-to-use samples.
Here are some options for customizing the parser:
/keptNodes=<node>[;...] /requiredColumns=<name>[;...] /skippedNodes=<node>[;...]
Parsing Plain Text
gsqlcmd can extract data from plain text files if the text contains recognizable columns, like this:
id f1 f2 1 data11 data12 2 data21 data22 3 data31 data32
Use the convert command to extract data:
gsqlcmd convert test.txt test.csv
The resulting test.csv
will look like this:
id;"f1";"f2" 1;"data11";"data12" 2;"data21";"data22" 3;"data31";"data32"
You can also use Microsoft ODBC and OLE DB text drivers to select and convert plain text data.
First, update the schema.ini
file using the make-ini command:
gsqlcmd make-ini test.txt schema.ini
Then, use the exec command to select data:
gsqlcmd exec test.txt "SELECT * FROM test.txt" test.csv
You can use SQL-supported features like WHERE or GROUP in these commands.
Parsing CSV
gsqlcmd allows you to extract data from CSV files and convert it to any supported output format.
Here are some useful formatting options:
/add=<header>=<value | function>[;...] /addRowNum /dateFormat=<format> /dateTimeFormat=<format> /escapeChar=<char> /noBOM /outputCodepage=<codepage> /outputCulture=<name> /[outputFormat=]asText | asTxt | asCsv | asHtm | asHtml | asXml | asJson /[[output]Separator](formatting-options.md#option-outputSeparator)=<separator> | Tab /quoteChar=<char> /rowNumBase=<integer> /timeFormat=<format>
For example, to add new columns like Symbol and Date, and convert the output file to UTF-8 encoding, use:
gsqlcmd convert test.csv test-65001.csv /outputEncoding=65001 /add=Symbol=AAPL;Date=Date()
You can also use Microsoft ODBC and OLE DB text drivers to select and convert CSV data.
First, update the schema.ini
file using the make-ini command:
gsqlcmd make-ini test.csv schema.ini
Then, use the exec command to select data:
gsqlcmd exec test.csv "SELECT 'AAPL' AS Symbol, :date AS [Date], * FROM test.csv" output.csv /set=date=Date()
You can use SQL-supported features like WHERE or GROUP in these commands.