For every star on GitHub, we'll donate $2 to clean up our waterways. Star us now!
Let’s learn by example.
Throughout this tutorial, we’ll walk you through the creation of a end-to-end modern ELT stack.
In parts 1, 2, and 3 we built an ELT pipeline. We took all the data from the commits on one repository at GitHub and extracted the authors working on it. However now we realized, we stored a lot of information where we really might want to hide a few of those pieces.
We’re going to do light-weight transformations, also called “inline data mappings” to clean up the data before storing them anywhere. We will use these inline data mappings to hide all emails inside the JSON blob we receive. In the Meltano world, these data mappings are also called stream maps.
If you're having trouble throughout this tutorial, you can always head over to the Slack channel to get help.
To add inline data mappings, we need a new plugin. We’re going to use the mapper “transform-field”. To add this plugin, use the
meltano add mapper command:
We’re now going to add two mapping to this mapper.
To add our first mapping, we’re going to edit the
meltano.yml file located inside your root project directory. Modify the block for the
transform-field mapper as shown below:
mappers: - name: transform-field variant: transferwise pip_url: pipelinewise-transform-field executable: transform-field mappings: - name: hide-github-mails config: transformations: - field_id: "commit" tap_stream_name: "commits" field_paths: ["author/email", "committer/email"] type: "HASH"
Let’s go through this step-by-step
mappers: [...] mappings: - name: hide-github-mails config: transformations: [...]
These lines define the name “hide-github-mails” as the name of our mapping. We can call the mapping using this name, and ignoring any reference to the actual mapper “transform-field”.
[...] transformations: - field_id: "commit" tap_stream_name: "commits" field_paths: ["author/email", "committer/email"] type: "HASH"
These lines define one transformation. We instruct to target the stream “commits”, and therein the field “commit”. We then use the field paths to navigate to the two emails we know are contained within this message and set the type to “HASH”. Using “HASH” means we will still be able to tell whether two emails are the same, but not be able to read the email. They will be replaced with a SHA-256 hash of the email.
Now we’re ready to run the data integration process with these modifications again. To do so, we’ll need to clean up first, since we already ran the EL process in part 1. The primary key is still the same and as such the ingestion would fail.
Drop the table inside your local postgres by running a docker exec:
docker exec meltano_postgres psql -U meltano -c 'DROP TABLE tap_github.commits; DROP TABLE analytics.authors;'
Now we can run the full process again using the
meltano runcommand. We add the parameter –full-refresh to ignore the state Meltano has stored.
If everything was configured correctly, you should now see your data flow from your source into your destination! Take your favourite SQL tool, connect to the database using the connection details set and check the table
commits inside the schema
tap_github. The JSON blob inside the column
commit should now contain no e-mail adresses but rather the hashed values for both fields.
There we have it, a complete ELT pipeline with inline data mappings, congratulations!
One last thing for you to do: try to run the following command to celebrate:
Next, head over to Part 5, scheduling of jobs.