Tutorial Part 1 - Connect to Data
Let’s learn by example.
Throughout this tutorial, we’ll walk you through one of many applications of Meltano. All of them are about:
- connecting to data
- processing it
- and finally storing it in some permanent storage or pushing it to another component
In this part, we're going to start with connecting to data.
We're going to take data from a "source", namely GitHub, and extract a list of commits to one repository.
To test that this part works, we will dump the data into JSON files. In Part 2, we will then place this data into a PostgreSQL database.
We'll assume you have Meltano installed already. You can tell Meltano is installed and which version by running meltano version
$ meltano --version
meltano, version 3.4.0
This tutorial is written using meltano >= v3.0.0.
If you don't have a GitHub account to follow along, you could either exchange the commands for a different tap, like GitLab or PostgreSQL, or you can create a free GitHub account. You will also need a personal access token to your GitHub account.
If you're having trouble throughout this tutorial, you can always head over to the Slack channel to get help.
Create Your Meltano Project
Step 1 is to create a new Meltano project that (among other things) will hold the plugins that implement the details of our pipeline.
-
Navigate to the directory that you'd like to hold your Meltano projects.
-
Initialize a new project in a directory of your choosing using
meltano init
:
meltano init my-meltano-project
$ meltano init my-new-project
Created my-new-project
Creating project files...
my-new-project/
|-- .meltano
|-- meltano.yml
|-- README.md
|-- requirements.txt
|-- output/.gitignore
|-- .gitignore
|-- extract/.gitkeep
|-- load/.gitkeep
|-- transform/.gitkeep
|-- analyze/.gitkeep
|-- notebook/.gitkeep
|-- orchestrate/.gitkeep
Creating system database... Done!
... Project my-new-project has been created!
Meltano Environments initialized with dev, staging, and prod.
To learn more about Environments visit: https://docs.meltano.com/concepts/environments
Next steps:
cd my-new-project
Visit https://docs.meltano.com/getting-started#create-your-meltano-project to learn where to go from here.
This action will create a new directory with, among other things, your meltano.yml
project file. Your file will look something like this:
version: 1
default_environment: dev
project_id: <unique-GUID>
environments:
- name: dev
- name: staging
- name: prod
- Navigate to the newly created project directory:
cd my-meltano-project
Add an Extractor to Pull Data from a Source
Now that you have your very own Meltano project, it's time to add plugins to it. We're going to add an extrator for GitHub to get our data. An extractor is responsible for pulling data out of any data source. In this case, we choose a specific one with the --variant
option to make this tutorial easy to work with.
- Add the GitHub extractor
meltano add extractor tap-github
$ meltano add extractor tap-github
Added extractor 'tap-github' to your Meltano project
Variant: meltanolabs (default)
Repository: https://github.com/meltanolabs/tap-github
Documentation: https://hub.meltano.com/extractors/tap-github--meltanolabs
2024-01-01T00:25:40.604941Z [info ] Installing extractor 'tap-github'
2024-01-01T00:25:53.152127Z [info ] Installed extractor 'tap-github'
To learn more about extractor 'tap-github', visit https://hub.meltano.com/extractors/tap-github--meltanolabs
This will add the new plugin to your meltano.yml
project file:
plugins:
extractors:
- name: tap-github
variant: meltanolabs
pip_url: git+https://github.com/MeltanoLabs/tap-github.git
- Test that the installation was successful by calling
meltano invoke
:
$ meltano invoke tap-github --help
If you see the extractor's help message printed, the plugin was definitely installed successfully.
$ meltano invoke tap-github --help
2024-09-19T09:32:05.162591Z [info ] Environment 'dev' is active
Usage: tap-github [OPTIONS]
Execute the Singer tap.
Options:
--state PATH Use a bookmarks file for incremental replication.
--catalog PATH Use a Singer catalog file with the tap.
--test TEXT Use --test to sync a single record for each
stream. Use --test=schema to test schema output
without syncing records.
--discover Run the tap in discovery mode.
--config TEXT Configuration file location or 'ENV' to use
environment variables.
--format [json|markdown] Specify output style for --about
--about Display package metadata and settings.
--version Display the package version.
--help Show this message and exit.
Configure the Extractor
The GitHub tap requires configuration before it can start extracting data.
- The simplest way to configure a new plugin in Meltano is using the mode
interactive
:
meltano config tap-github set --interactive
- Follow the prompts to step through all available settings, the ones you'll need to fill out are repositories (formatted like
["sbalnojan/meltano-lightdash"]
), start_date, and your auth_token.
$ meltano config tap-github set --interactive
Configuring Extractor 'tap-github' Interactively
[...]
Settings
1. additional_auth_tokens: [...]
2. auth_token: GitHub token to authenticate ...
[...]
8. repositories: An array of strings containing the github repos to be ...
[...]
11. start_date:
[...]
To learn more about extractor 'tap-github' and its settings, visit https://hub.meltano.com/extractors/tap-github--meltanolabs
Loop through all settings (all), select a setting by number (1 - 16), or exit (e)? [all]:
$ 2
[...]Description:
GitHub token to authenticate with.
New value (redacted):
$
Repeat for confirmation:
$
<[... other 2 values...]
This will add the configuration to your meltano.yml
project file:
plugins:
extractors:
- name: tap-github
config:
start_date: '2024-01-01'
repositories:
- sbalnojan/meltano-lightdash
It will also add your secret auth token to the file .env
:
TAP_GITHUB_AUTH_TOKEN='ghp_XXX' # your token!
- Double check the config by running
meltano config tap-github
:
meltano config tap-github list
$ meltano config tap-github list
2024-07-08T16:27:36.433823Z [info ] The default environment 'dev' will be ignored for `meltano config`. To configure a specific environment, please use the option `--environment=<environment name>`.
additional_auth_tokens [env: TAP_GITHUB_ADDITIONAL_AUTH_TOKENS] current value: None (default)
Additional Auth Tokens: List of GitHub tokens to authenticate with. Streams will loop through them when hitting rate limits.
auth_token [env: TAP_GITHUB_AUTH_TOKEN] current value: (redacted) (from the TAP_GITHUB_AUTH_TOKEN variable in `.env`)
Auth Token: GitHub token to authenticate with.
...
repositories [env: TAP_GITHUB_REPOSITORIES] current value: ['meltano/meltnao', 'meltano/hub', 'meltano/sdk'] (from `meltano.yml`)
Repositories: An array of strings containing the github repos to be included
...
start_date [env: TAP_GITHUB_START_DATE] current value: '2024-01-01' (from `meltano.yml`)
Select Entities and Attributes to Extract
Now that the extractor has been configured, it'll know where and how to find your data, but won't yet know which specific entities and attributes (tables and columns) you're interested in.
By default, Meltano will instruct extractors to extract all supported entities and attributes, but we're going to select specific ones for this tutorial.
- Find out what entities and attributes are available, using
meltano select YOUR_TAP --list --all
:
meltano select tap-github --list --all
$ meltano select tap-github --list
2023-05-17T20:07:24.085827Z [info ] The default environment 'dev' will be ignored for 'meltano select'. To configure a specific environment, please use the option '--environment=environment name'.
Legend:
selected
excluded
automatic
Enabled patterns:
*.*
Selected attributes:
[selected ] assignees._sdc_repository
[automatic]
[...]
[selected ] commits.comments_url
[selected ] commits.commit
[selected ] commits.commit.author
[...]
[selected ] teams.repositories_url
[selected ] teams.slug
[selected ] teams.url
- Select the entities and attributes for extraction using
meltano select
:
meltano select tap-github commits url
meltano select tap-github commits sha
meltano select tap-github commits commit_timestamp
This will add the selection rules to your meltano.yml
project file:
version: 1
default_environment: dev
environments:
- name: dev
- name: staging
- name: prod
project_id: YOUR_ID
plugins:
extractors:
- name: tap-github
variant: meltanolabs
pip_url: git+https://github.com/MeltanoLabs/tap-github.git
config:
start_date: '2024-01-01'
repositories:
- sbalnojan/meltano-lightdash
select:
- commits.url
- commits.sha
- commits.commit_timestamp
- Run
meltano select --list
to double-check your selection:
meltano select tap-github --list
Add a dummy loader to dump the data into JSON
To test that the extraction process works, we add a JSON target.
-
Add the JSON target using
meltano add loader target-jsonl --variant=andyh1203
.
$ meltano add loader target-jsonl --variant=andyh1203</span>
Added loader 'target-jsonl' to your Meltano project
Variant: andyh1203 (default)
Repository: https://github.com/andyh1203/target-jsonl
Documentation: https://hub.meltano.com/loaders/target-jsonl--andyh1203
2024-01-01T00:25:40.604941Z [info ] Installing loader 'target-jsonl'
2024-01-01T00:25:53.152127Z [info ] Installed loader 'target-jsonl'
To learn more about loader 'target-jsonl', visit https://hub.meltano.com/loaders/target-jsonl--andyh1203
This target requires zero configuration, it just outputs the data into a jsonl
file.
Do a test run to verify the extraction process works
Now that your Meltano project, extractor, and dummy loader are set up, we can test run the extraction process.
There's just one step here: run your newly added extractor and jsonl loader in a pipeline using meltano run
:
meltano run tap-github target-jsonl
$ meltano run tap-github target-jsonl
2024-09-19T13:53:36.403099Z [info ] Environment 'dev' is active
2024-09-19T13:53:41.062802Z [info ] Found state from 2024-09-19 13:53:17.415907.
2024-09-19T13:53:41.071885Z [warning ] No state was found, complete import.
2024-09-19T13:53:43.054384Z [info ] INFO Starting sync of repository: sbalnojan/meltano-lightdash cmd_type=elb consumer=False name=tap-github producer=True stdio=stderr string_id=tap-github
2024-09-19T13:53:43.553171Z [info ] INFO METRIC: {"type": "timer", "metric": "http_request_duration", "value": 0.4796161651611328, "tags": {"endpoint": "commits", "http_status_code": 200, "status": "succeeded"}} cmd_type=elb consumer=False name=tap-github producer=True stdio=stderr string_id=tap-github
2024-09-19T13:53:43.561190Z [info ] INFO METRIC: {"type": "counter", "metric": "record_count", "value": 1, "tags": {"endpoint": "commits"}} cmd_type=elb consumer=False name=tap-github producer=True stdio=stderr string_id=tap-github</span>
2024-09-19T13:53:43.735250Z [info ] Incremental state has been updated at 2024-09-19 13:53:43.734535.
2024-09-19T13:53:43.820467Z [info ] Block run completed. block_type=ExtractLoadBlocks err=None set_number=0 success=True</span>
You should see data flowing from your source into the jsonl file.
You can verify that it worked by looking inside the newly created file called output/commits.jsonl
.
$ cat output/commits.jsonl
{"sha": "409bdd601e0531833665f538bccecd0f69e101c0", "node_id": "C_kwDOH_twHNoAKDQwOWJkZDYwMWUwNTMxODMzNjY1ZjUzOGJjY2VjZDBmNjllMTAxYzA", "url": "https://api.github.com/repos/sbalnojan/meltano-lightdash/commits/409bdd601e0531833665f538bccecd0f69e101c0", "commit_timestamp": "2024-09-14T12:41:21Z"}
Next Steps
Next, head over to Part 2: Loading extracted data into a target.