For every star on GitHub, we'll donate $2 to clean up our waterways. Star us now!

Tutorial Part 1 - Extract Data

Let’s learn by example.

Throughout this tutorial, we’ll walk you through the creation of a end-to-end modern ELT stack. In this part, we’re going to start with the data extraction process.

We’re going to take data from a “source”, namely GitHub, and extract a list of commits to one repository.

To test that this part works, we will dump the data into JSON files. In Part 2, we will then place this data into a PostgreSQL database.

We’ll assume you have Meltano installed already. You can tell Meltano is installed and which version by running meltano version

```console $ meltano --version meltano, version 2.6.0

This tutorial is written using meltano >= v2.0.0.

If you don’t have a GitHub account to follow along, you could either exchange the commands for a different tap, like GitLab or PostgreSQL, or you can create a free GitHub account. You will also need a personal access token to your GitHub account.

If you're having trouble throughout this tutorial, you can always head over to the Slack channel to get help.

Create Your Meltano Project #

Step 1 is to create a new Meltano project that (among other things) will hold the plugins that implement the details of our ELT pipeline.

  1. Navigate to the directory that you’d like to hold your Meltano projects.

  2. Initialize a new project in a directory of your choosing using meltano init:

meltano init my-meltano-project

```console $ meltano init my-new-project Created my-new-project Creating project files... my-new-project/ |-- .meltano |-- meltano.yml |-- |-- requirements.txt |-- output/.gitignore |-- .gitignore |-- extract/.gitkeep |-- load/.gitkeep |-- transform/.gitkeep |-- analyze/.gitkeep |-- notebook/.gitkeep |-- orchestrate/.gitkeep Creating system database... Done! ... Project my-new-project has been created! Meltano Environments initialized with dev, staging, and prod. To learn more about Environments visit: Next steps: cd my-new-project Visit to learn where to go from here.

This action will create a new directory with, among other things, your meltano.yml project file. Your file will look something like this:

   version: 1
   default_environment: dev
   project_id: <unique-GUID>
   - name: dev
   - name: staging
   - name: prod
  1. Navigate to the newly created project directory:
cd my-meltano-project

Add an Extractor to Pull Data from a Source #

Now that you have your very own Meltano project, it’s time to add plugins to it. We’re going to add an extrator for GitHub to get our data. An extractor is responsible for pulling data out of any data source. In this case, we choose a specific one with the --variant option to make this tutorial easy to work with.

  1. Add the GitHub extractor
$ meltano add extractor tap-github --variant=meltanolabs

```console $ meltano add extractor tap-github --variant=meltanolabs 2022-09-19T09:32:05.162591Z [info ] Environment 'dev' is active Added extractor 'tap-github' to your Meltano project Variant: meltanolabs (default) Repository: Documentation: Installing extractor 'tap-github'... Installed extractor 'tap-github' To learn more about extractor 'tap-github', visit ```

This will add the new plugin to your meltano.yml project file:

  - name: tap-github
    variant: meltanolabs
    pip_url: git+
  1. Test that the installation was successful by calling meltano invoke:
$ meltano invoke tap-github --help

If you see the extractor’s help message printed, the plugin was definitely installed successfully.

```console $ meltano invoke tap-github --help 2022-09-19T09:32:05.162591Z [info ] Environment 'dev' is active usage: tap-github [-h] -c CONFIG [-s STATE] [-p PROPERTIES] [--catalog CATALOG] [-d] options: -h, --help show this help message and exit -c CONFIG, --config CONFIG Config file -s STATE, --state STATE State file -p PROPERTIES, --properties PROPERTIES Property selections: DEPRECATED, Please use --catalog instead --catalog CATALOG Catalog file -d, --discover Do schema discovery ```

Configure the Extractor #

The GitHub tap requires configuration before it can start extracting data.

  1. The simplest way to configure a new plugin in Meltano is using the mode interactive:
$ meltano config tap-github set --interactive
  1. Follow the prompts to step through all available settings, the ones you’ll need to fill out are repositories, start_date and your private_token.
```console $ meltano config tap-github set --interactive Configuring Extractor 'tap-github' [...] Settings 1. user_agent: [...] 3. auth_token: GitHub token to authenticate ... [...] 8. repositories: An array of strings containing the github repos to be ... [...] 11. start_date: [...] To learn more about extractor 'tap-github' and its settings, visit Loop through all settings (all), select a setting by number (1 - 11), or exit (e)? [all]: $ 3 [...]Description: GitHub token to authenticate with. New value: $ Repeat for confirmation: $ <[... other 2 values...] ```

This will add the configuration to your meltano.yml project file:

        - name: tap-github
            start_date: 2022-01-01
              - sbalnojan/meltano-lightdash

It will also add your secret auth token to the file .env:

  TAP_GITHUB_AUTH_TOKEN='ghp_XXX' # your token!
  1. Double check the config by running meltano config tap-github:
meltano config tap-github

```console $ meltano config tap-github 2022-09-19T11:26:22.888257Z [info ] Environment 'dev' is active 2022-09-19T11:26:23.573556Z [info ] The default environment (dev) will be ignored for `meltano config`. To configure a specific Environment, please use option `--environment=[]`. { "repository": "sbalnojan/meltano-lightdash", "start_date": "2022-01-01" } ```

Select Entities and Attributes to Extract #

Now that the extractor has been configured, it’ll know where and how to find your data, but won’t yet know which specific entities and attributes (tables and columns) you’re interested in.

By default, Meltano will instruct extractors to extract all supported entities and attributes, but we’re going to select specific ones for this tutorial.

  1. Find out what entities and attributes are available, using meltano select YOUR_TAP --list --all:
meltano select tap-github --list --all

```console $ meltano select tap-github --list 2022-09-19T10:59:43.554214Z [info ] Environment 'dev' is active Legend: selected excluded automatic Enabled patterns: *.* Selected attributes: [selected ] assignees._sdc_repository [automatic] [...] [selected ] commits.comments_url [selected ] commits.commit [selected ] [...] [selected ] teams.repositories_url [selected ] teams.slug [selected ] teams.url ```
  1. Select the entities and attributes for extraction using meltano select:
meltano select tap-github commits url
meltano select tap-github commits sha
meltano select tap-github commits commit_timestamp

This will add the selection rules to your meltano.yml project file:

  version: 1
  default_environment: dev
  - name: dev
        - name: tap-github
          - commits.url
          - commits.sha
          - commits.commit_timestamp
  - name: staging
  - name: prod
  project_id: YOUR_ID
    - name: tap-github
      variant: meltanolabs
      pip_url: git+
        start_date: 2022-01-01
        repository: sbalnojan/meltano-lightdash
  1. Run meltano select --list to double-check your selection:
meltano select tap-github --list

Add a dummy loader to dump the data into JSON #

To test that the extraction process works, we add a JSON target.

  1. Add the JSON target using meltano add loader target-jsonl.
```console $ meltano add loader target-jsonl</span> 2022-09-19T13:47:42.389423Z [info ] Environment 'dev' is active To add it to your project another time so that each can be configured differently, add a new plugin inheriting from the existing one with its own unique name:     meltano add loader target-jsonl--new --inherit-from target-jsonl Installing loader 'target-jsonl'... Installed loader 'target-jsonl' To learn more about loader 'target-jsonl', visit ```

This target requires zero configuration, it just outputs the data into a jsonl file.

Do a test run to verify the extraction process works #

Now that your Meltano project, extractor, and dummy loader are set up, we can test run the extraction process.

There’s just one step here: run your newly added extractor and jsonl loader in a pipeline using meltano run:

$ meltano run tap-github target-jsonl

```console $ meltano run tap-github target-jsonl 2022-09-19T13:53:36.403099Z [info ] Environment 'dev' is active 2022-09-19T13:53:41.062802Z [info ] Found state from 2022-09-19 13:53:17.415907. 2022-09-19T13:53:41.071885Z [warning ] No state was found, complete import. 2022-09-19T13:53:43.054384Z [info ] INFO Starting sync of repository: sbalnojan/meltano-lightdash cmd_type=elb consumer=False name=tap-github producer=True stdio=stderr string_id=tap-github 2022-09-19T13:53:43.553171Z [info ] INFO METRIC: {"type": "timer", "metric": "http_request_duration", "value": 0.4796161651611328, "tags": {"endpoint": "commits", "http_status_code": 200, "status": "succeeded"}} cmd_type=elb consumer=False name=tap-github producer=True stdio=stderr string_id=tap-github 2022-09-19T13:53:43.561190Z [info ] INFO METRIC: {"type": "counter", "metric": "record_count", "value": 1, "tags": {"endpoint": "commits"}} cmd_type=elb consumer=False name=tap-github producer=True stdio=stderr string_id=tap-github</span> 2022-09-19T13:53:43.735250Z [info ] Incremental state has been updated at 2022-09-19 13:53:43.734535. 2022-09-19T13:53:43.820467Z [info ] Block run completed. block_type=ExtractLoadBlocks err=None set_number=0 success=True</span> ```

You should see data flowing from your source into the jsonl file. You can verify that it worked by looking inside the newly created file called output/commits.jsonl.

```console $ cat output/commits.jsonl {"sha": "409bdd601e0531833665f538bccecd0f69e101c0", "url": ""} ```

Next Steps #

Next, head over to Part 2: Loading extracted data into a target.