Skip to content

Latest commit

 

History

History

cmslitemetadata_to_redshift

CMS Lite Metadata to Redshift Microservice

This folder contains the cmslitemetadata_to_redshift.py script and configuration file cmslite_gdx.json that enable the the S3-to-Redshift microservice implemented on the GDX-Analytics platform. It also includes a requirements.txt to simplify pip installing requisite files with a single command: pip install -r requirements.

cmslitemetadata_to_redshift.py

The CMS Lite Metadata to Redshift microservice requires a json configuration file passed as a second command line argument to run. The configuration file format is described in more detail below. Usage is like:

python s3_to_redshift.py configfile.json

Testing files

There are two test files that can be used for bulding test tables and a config file to run the script as required. cmslite_test.json reference the test schema in redshift and point to a test directory s3://sp-ca-bc-gov-131565110619-12-microservices/client/test_cmslite/ in s3.

test.cmslite_test.sql will build required tables under the test schema in redshift.

Usage :

pipenv install --ignore-pipfile pipenv run python cmslitemetadata_to_redshift.py cmslite_test.json

Overview

This microservice was built from the S3 to Redshift microservice. The input csv for this microservice to consume is prepared by the Content Management Framework servers containing metadata pertaining to pages in the CMS Lite system. The processing steps are very similar to the S3 to Redshift microservice with some notable feature additions to nested metadata in the input file that result in lookup tables and dictionary tables.

The "delim" and "nested_delim" keys in the cmslite_gdx.json file direct the microservice on what lookup and dictionary tables to insert content on. Lookup tables are built for input columns that containing a collection of elements split by nested delimiters (nested columns). The lookup tables contain two columns, node_id and id. The id column contains Integer representations of the elements of a nested column. Additional dictionary columns then provide a mapping of lookup table IDs to their human readable term.

For example, in the lookup table metadata_languages there may be a row where the node_id column reads 111AAAA1A111111A1AA1111A1A1A11A, and the id columns reads 1. Then, in the dcterms_languages dictionary table, we may find that where id=1, the value column reads English. Lookup tables are substantially larger than dictionary tables since every page from the metadata is represented in a lookup table and may be repeated with different IDs, whereas the dictionary tables contain only one column per id. The id values themselves are generated by the microservice itself.

This microservice runs in truncate mode, meaning that the full metadata is loaded every time it runs. Input files are usually around 70MB for the production data set, and the processing time is generally between 20 and 30 minutes. The interaction with Redshift is accomplished in a single transaction in order to not result in inadvertent data integrity conflicts if a single lookup or dictionary table failed to load.

Log files are appended at the debug level into file called cmslitemetadata_to_redshift.log under a logs/ folder which much be created manually. Info level logs are output to stdout. In the log file, events are logged with the format showing the log level, the function name, the timestamp with milliseconds, and the message: INFO:__main__:2010-10-10 10:00:00,000:<log message here>.

Configuration

Environment Variables

The S3 to Redshift microservice requires the following environment variables be set to run correctly.

  • pgpass: the database password for the microservice user;
  • AWS_ACCESS_KEY_ID: the AWS access key for the account authorized to perform COPY commands from S3 to Redshift; and,
  • AWS_SECRET_ACCESS_KEY: the AWS secret access key for the account authorized to perform COPY commands from S3 to Redshift.

Configuration File

The JSON configuration is required as a second argument when running the cmslitemetadata_to_redshift.py script. It follows this structure:

  • "bucket": the label defining the S3 bucket that the microservice will reference.
  • "source": the top level S3 prefix for source objects after the bucket label, as in: <bucket>/<source>/<client>/<doc>.
  • "destination": the top level S3 prefix for processed objects to be stored, as in: <bucket>/<destination>/<client>/<doc>.
  • "directory": the S3 prefix to follow source or destination and before the <doc> objects.
  • "doc": a regex pattern representing the final object after all directory prefixes, as in: <bucket>/<source>/<client>/<doc>.
  • "dbschema": An optional String defaulting to 'microservice' (currently unused by s3_to_redshift.py).
  • "dbtable": The table to COPY the processed data into with the schema, as in: <schema>.<table>.
  • "column_count": The number of columns the processed dataframe should contain.
  • "columns": A list containing the column names of the input file.
  • "columns_metadata": columns to be retained for the main metadata table
  • "columns_lookup": columns containing nested collections of data to be split into lookup tables.
  • "dbtables_dictionaries": The dictionary table names (there must be as elements in this list as appeared in columns_lookup).
  • "dbtables_metadata": The lookup table names (there must be as elements in this list as appeared in columns_lookup).
  • "replace": a list of replacement objects represented as:
  • "field": the column in which to make a replacement
  • "old": the string to be replaced
  • "new": the replacement string
  • "dateformat" a list of dictionaries containing keys: field and format
  • "field": a column name containing datetime format data.
  • "format": strftime to parse time. See strftime documentation for more information on choices.
  • "dtype_dic_strings": A dictionary where keys are the names of columns in the input data, and the keys are strings defining the datatype of that column.
  • "delim": specify the character that delimits data in the input csv.
  • "nested_delim": specify the character that delimits nested collections of data in the columns_lookup list.
  • "truncate": boolean (true or false) that determines if the Redshift table will be truncated before inserting data, or instead if the table will be extended with the inserted data.
  • "sql_query": an argument to provide the location of sql queries if need to be used in the python script.

The structure of the config file should resemble the following:

{
  "bucket": String,
  "source": String,
  "destination": String,
  "directory": String,
  "doc": String,
  "dbschema": String,
  "dbtable": String,
  "columns": [String],
  "column_count": Integer,
  "columns_metadata": [String],
  "columns_lookup": [String],
  "dbtables_dictionaries": [String],
  "dbtables_metadata": [String],
  "replace": [
    {
      "field": String,
      "old": String,
      "new": String
    }
  ],
  "dateformat": [
    {
      "field": String,
      "format": String
    }
  ],
  "dtype_dic_strings": [String],
  "delim": String,
  "nested_delim": String,
  "truncate": boolean
}

Project Status

As we determine new CMS Lite metadata to make available in Looker, this configuration file and the microservice script itself will change. This project is ongoing.

Getting Help

Please Contact the GDX Service desk for any analytics service help.

Contributors

The GDX analytics team will be the main contributors to this project currently and will maintain the code.

License

Copyright 2015 Province of British Columbia

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.