Create a Custom Extractor
As much as we'd like to support all the data sources out there, we'll need your help to get there. If you find a data source that Meltano doesn't support right now, it might be time to get your hands dirty.
What Are Custom Extractors?
Custom extractors are scripts or tools that source data from unconventional data sources like a custom database or a SaaS API like Appwrite and present it in a form that can be loaded into the desired target sink. The term “custom” here means the extractor isn’t a native part of the tool.
A custom extractor must present the extracted data in a form that the tool can use to load it into the target sink like a data warehouse.
Singer is a commonly used tool for extracting data, which serves as the specification for implementing data extractors and loaders.
Singer refers to extractor scripts as taps and loader scripts as targets. If you use Singer for your ELT needs, a custom extractor is basically a Singer tap written for your organization’s needs.
These Singer taps and targets can be cumbersome to run manually so the easiest way to run them is as Meltano extractor/loader plugins. Meltano’s EL features handle all of the Singer complexity of configuration, stream discovery, and state management.
How to Create a Custom Extractor
The following steps will demonstrate how to implement a custom extractor to extract data from a JSON placeholder REST API to a JSONL file using Meltano and the Meltano SDK. You can check out the complete custom extractor code on this GitHub repo.
There a few prerequisites that you need before continuing. The first step details how you can install these dependencies.
- Python3 for running Python-based scripts
- Pip3 for installing pipx
- Poetry for dependency management in your custom extractor
- Cookiecutter for installing the template repository
1. Installing the Dependencies
You can install Python3 from the official website. Python usually comes packaged with a package installer known as pip.
You can verify that pip is installed by running the below command:
pip3 --version
Any version number above 20 should be able to install pipx. You can then run the command below to install pipx, meltano, cookiecutter, and poetry.
pip3 install pipx
pipx ensurepath
source ~/.bashrc
pipx install meltano
pipx install cookiecutter
pipx install poetry
Pipx is a wrapper around pip that simplifies the process of installing Python programs that need to be added to path (e.g., Meltano).
You will use cookiecutter to clone the Meltano SDK template repo for implementing a custom extractor.
Poetry serves as the dependency manager for the project. If you have used npm before, poetry serves a similar purpose but for Python projects.
2. Create a Project Using the Cookiecutter Template
Run the following command to create the project files from the cookiecutter template:
cookiecutter https://github.com/meltano/sdk --directory="cookiecutter/tap-template"
After running the above command, you will be prompted to configure your project.
- Type
jsonplaceholder
as your source name. - Then input your first name and last name.
- You can leave the
tap_id
and library name as the default suggested names. - If you are planning to distribute your project on PyPI and the
tap_id
is already in use, you can set a variant name. e.g. 'meltanolabs' or 'pipelinewise'. - For the stream type, you should select REST, and Custom or N/A for the auth method.
- Finally, you can choose to add a CI/CD template or not. It doesn’t really matter in this case.
source_name [MySourceName]: jsonplaceholder
admin_name [FirstName LastName]: <Your First Name, Your Last Name>
tap_id [tap-jsonplaceholder]:
library_name [tap_jsonplaceholder]:
variant [None (Skip)]:
Select stream_type:
1 - REST
2 - GraphQL
3 - SQL
4 - Other
Choose from 1, 2, 3, 4 [1]: 1
Select auth_method:
1 - API Key
2 - Bearer Token
3 - Basic Auth
4 - OAuth2
5 - JWT
6 - Custom or N/A
Choose from 1, 2, 3, 4, 5, 6 [1]: 6
Select include_cicd_sample_template:
1 - GitHub
2 - None (Skip)
Choose from 1, 2 [1]:
The result of the above command is a new directory tap-jsonplaceholder
that contains boilerplate code for developing your tap and also a meltano.yml
file that you can use to test your custom extractor.
You can view the cookiecutter template in cookiecutter directory in the Meltano SDK repository.
3. Install the Custom Extractor Python dependencies Using Poetry
Change directory into the json-placeholder tap directory, and install the python dependencies using poetry:
cd tap-jsonplaceholder
# [Optional] but useful if you need to debug your custom extractor
poetry config virtualenvs.in-project true
poetry install
See Debug A Custom Extractor to learn more about the optional Poetry step above.
4. Configure the Custom Extractor to Consume Data from the Source
To configure your custom extractor to consume data from the JSON placeholder, you need to set the API_URL and the streams that will be replicated.
Open the file tap-jsonplaceholder/tap_jsonplaceholder/tap.py
and replace its content with the content below:
"""jsonplaceholder tap class."""
from typing import List
from singer_sdk import Tap, Stream
from singer_sdk import typing as th # JSON schema typing helpers
from tap_jsonplaceholder.streams import jsonplaceholderStream, CommentsStream
STREAM_TYPES = [CommentsStream]
class Tapjsonplaceholder(Tap):
"""jsonplaceholder tap class."""
name = "tap-jsonplaceholder"
def discover_streams(self) -> List[Stream]:
"""Return a list of discovered streams."""
return [stream_class(tap=self) for stream_class in STREAM_TYPES]
Then replace the content of tap-jsonplaceholder/tap_jsonplaceholder/streams.py
with the content below:
"""Stream type classes for tap-jsonplaceholder."""
from singer_sdk import typing as th # JSON Schema typing helpers
from tap_jsonplaceholder.client import jsonplaceholderStream
class CommentsStream(jsonplaceholderStream):
primary_keys = ["id"]
path = '/comments'
name = "comments"
schema = th.PropertiesList(
th.Property("postId", th.IntegerType),
th.Property("id", th.IntegerType),
th.Property("name", th.StringType),
th.Property("email", th.StringType),
th.Property("body", th.StringType),
).to_dict()
The tap.py
file defines the tap settings and the available streams, which is the comments stream in this case. You can find the available stream types in the STREAM_TYPES array.
The streams.py
file configures the comments stream to use the /comments
path and also sets the properties of the extracted fields.
Finally, change the value of url_base
in the tap-jsonplaceholder/tap_jsonplaceholder/client.py
file to https://jsonplaceholder.typicode.com.
...
class jsonplaceholderStream(RESTStream):
"""jsonplaceholder stream class."""
# TODO: Set the API's base URL here:
url_base = "https://jsonplaceholder.typicode.com"
...
5. Install The Newly Created Tap
Navigate to your project root directory on your shell and run the following command:
meltano install
meltano add loader target-jsonl
This command installs your newly created tap, tap-jsonplaceholder, and a loader, target-jsonl, to the default Meltano project. It also creates an output directory where the extracted data will be loaded.
Execute the command below to run an ELT pipeline that replicates data from the REST API to JSONL files:
meltano run tap-jsonplaceholder target-jsonl
You can inspect the output directory for the extracted JSON data.
Use the command below to get the first five lines of the extracted comments JSON file:
head -n 5 output/comments.jsonl
You should see the following:
{"postId": 1, "id": 1, "name": "id labore ex et quam laborum", "email": "Eliseo@gardner.biz", "body": "laudantium enim quasi est quidem magnam voluptate ipsam eos\ntempora quo necessitatibus\ndolor quam autem quasi\nreiciendis et nam sapiente accusantium"}
{"postId": 1, "id": 2, "name": "quo vero reiciendis velit similique earum", "email": "Jayne_Kuhic@sydney.com", "body": "est natus enim nihil est dolore omnis voluptatem numquam\net omnis occaecati quod ullam at\nvoluptatem error expedita pariatur\nnihil sint nostrum voluptatem reiciendis et"}
{"postId": 1, "id": 3, "name": "odio adipisci rerum aut animi", "email": "Nikita@garfield.biz", "body": "quia molestiae reprehenderit quasi aspernatur\naut expedita occaecati aliquam eveniet laudantium\nomnis quibusdam delectus saepe quia accusamus maiores nam est\ncum et ducimus et vero voluptates excepturi deleniti ratione"}
{"postId": 1, "id": 4, "name": "alias odio sit", "email": "Lew@alysha.tv", "body": "non et atque\noccaecati deserunt quas accusantium unde odit nobis qui voluptatem\nquia voluptas consequuntur itaque dolor\net qui rerum deleniti ut occaecati"}
{"postId": 1, "id": 5, "name": "vero eaque aliquid doloribus et culpa", "email": "Hayden@althea.biz", "body": "harum non quasi et ratione\ntempore iure ex voluptates in ratione\nharum architecto fugit inventore cupiditate\nvoluptates magni quo et"}
Interacting with your new plugin
Now that your plugin is installed and configured, you are ready to interact with it using Meltano.
Use meltano invoke
to run your plugin in isolation:
meltano invoke tap-my-custom-source
You can also use the --discover
flag to see details about the supported streams:
meltano invoke tap-my-custom-source --discover
You can also use meltano select
to parse your catalog
and list all available entities and attributes:
meltano select --list --all
Add the Custom Extractor Plugin to Your Meltano Project
Once you have tested your custom extractor, by installing and running it in the custom extractor repository, you might want to add it to a separate Meltano project.
If You Don't Have An Existing Meltano Project
Navigate to the parent directory of your custom extractor and run the following command:
meltano init
The above command will prompt you to enter a project name. You should enter a name like meltano-demo. Afterward, navigate into the newly created project using the below command:
cd meltano-demo
Meltano exposes each custom extractor configuration in the plugin definition, located in the meltano.yml
project file.
To test the custom extractor as part of your Meltano project, you will need to add your custom extractor configuration in the meltano.yml
file for your project.
There are two ways you can do this:
1. Use The Meltano Add Command
Run the command below to add the extractor as a custom extractor not hosted on MeltanoHub registry:
meltano add --custom extractor tap-jsonplaceholder
You will be prompted to input the namespace URL. Choose tap-jsonplaceholder
.
Also, choose -e ../tap-jsonplaceholder
as the pip_url since you are working with a local extractor project. This should be the full path to your custom extractor.
Go with the default executable name.
You can leave the capabilities and settings fields blank for now. The command will install the custom extractor to your Meltano project.
Added extractor 'tap-jsonplaceholder' to your Meltano project
2024-01-01T00:25:40.604941Z [info ] Installing extractor 'tap-jsonplaceholder'...
2024-01-01T00:25:53.152127Z [info ] Installed extractor 'tap-jsonplaceholder'
Alternatively, you can create a plugin definition YAML file locally and add to your project using the --from-ref
option:
tap-jsonplaceholder.yml
name: tap-jsonplaceholder
namespace: tap_jsonplaceholder
pip_url: -e ../tap-jsonplaceholder
meltano add --from-ref tap-jsonplaceholder.yml extractor tap-jsonplaceholder
The plugin name must be present in the YAML file to constitute a valid plugin definition - supplying it as a command argument is a no-op in this case
As you develop your custom extractor, it is possible that its settings will change:
- New functionality is added, requiring new setting(s)
- Existing functionality is modified, requiring setting(s) to be renamed
- Existing functionality is removed, requiring setting(s) to be removed
In this case, you will need to update the extractor in your project - by maintaining your plugin definiton YAML file in line with changes to the tap as you go, this is a simple process of running the previous command along with the --update
flag:
meltano add --update --from-ref tap-jsonplaceholder.yml extractor tap-jsonplaceholder
Add a JSONL target
Run the command below to add the JSONL loader that will contain the extracted data stream:
meltano add loader target-jsonl
Run an ELT Pipeline That Loads Data into a JSONL File
The following command will run an ELT pipeline that loads data into a JSONL file:
meltano run tap-jsonplaceholder target-jsonl
Inspect the Loaded Data in the Outputs Directory
Run the following command to get the first five lines of the comments JSONL file:
head -n 5 output/comments.jsonl
2. Edit Your Existing Meltano.yml file
In your existing meltano.yml
:
# ...
plugins:
extractors:
# Insert a new entry:
- name: tap-my-custom-source
namespace: tap_my_custom_source
# Installs the plugin from a local path
# in 'editable' mode (https://pip.pypa.io/en/stable/topics/local-project-installs/#editable-installs).
# Can point to '.' if it's in the same directory as `meltano.yml`
pip_url: -e /path/to/tap-my-custom-source
# Name of custom tap that will be invoked.
# Can be found in the pyproject.toml of your custom tap under CLI declaration
executable: tap-my-custom-source
capabilities:
# For a reference of plugin capabilities, see:
# https://docs.meltano.com/reference/plugin-definition-syntax#capabilities
- state
- catalog
- discover
config:
# Configured values:
username: me@example.com
start_date: '2024-01-01'
settings:
- name: username
- name: password
sensitive: true
- name: start_date
# Default value for the plugin:
value: '2010-01-01T00:00:00Z'
loaders:
# your loaders here:
- name: target-jsonl
variant: andyh1203
pip_url: target-jsonl
# ...
You can further customize the appearance of your custom extractor using the following options:
label
logo_url
description
Any time you manually add new plugins to meltano.yml
, you will need to rerun the
install command:
meltano install
Plugin Settings
When creating a new custom extractor plugin, you'll often have to expose some settings to the user so that Meltano can generate the correct configuration to run your custom extractor.
To properly expose and configure your settings, you'll need to define them:
-
name: Identifier of this setting in the configuration. The name is the most important field of a setting, as it defines how the value will be passed down to the underlying component. Nesting can be represented using the
.
separator.foo
represents the{ foo: VALUE }
in the output configuration.foo.a
represents the{ foo: { a: VALUE } }
in the output configuration.
-
kind: (optional): Represent the type of value this should be, (e.g.
date_iso8601
for dates). Defaults tostring
. -
sensitive: (optional - defaults to
false
): Indicate whether the setting is sensitive (e.g. a password, token or code). -
value (optional): Define a default value for the plugin's setting.
Passing sensitive setting values
It is best practice not to store sensitive values directly in meltano.yml
.
Note in our example above, we provided values directly for username
and start_date
but we did not enter a value
for password. This was to avoid storing sensitive credentials in clear text within our source code. Instead, make sure the setting is set to sensitive: true
and then
run meltano config <plugin> set password <value>
. You can also set the matching environment variable for this
setting by running export TAP_MY_CUSTOM_SOURCE_PASSWORD=<value>
.
You may use any of the following to configure setting values (in order of precedence):
- Environment variables
config
section in the pluginvalue
of the setting's definition
Publishing to the world
Once you've built your tap and it is providing you the data you need, we hope you will consider sharing it with the world! We often find that community members who benefit from your tap also may contribute back their own improvements in the form of pull requests.
Publish to PyPI
If you've built your tap using the SDK, you can take advantage of the streamlined
poetry publish
command to publish
your tap directly to PyPI.
- Create an account with PyPI.
- Create a PyPI API token for use in automated publishing. (Optional but recommended.)
- Run
poetry build
from within your repo to build. - Run
poetry publish
to push your latest version to the PyPI servers.
Test a pip
install
We recommend using pipx to avoid dependency conflicts:
pip3 install pipx
pipx ensurepath
python -m pipx install tap-my-custom-source
After restarting your terminal, this should also work without the python -m
prefix:
pipx install tap-my-custom-source
Or if you don't want to use pipx:
pip3 install tap-my-custom-source
If you have gotten this far... Congrats! You are now a proud Singer tap developer!
Make it discoverable
Once you have your tap published to PyPI, consider making it discoverable for other users of Meltano.
Updates for production use
Once your repo is installable with pip, you can reference this in your meltano.yml
file with three quick steps:
- Add a
pip_url
property to yourextractor
definition, for examplepip_url: tap-my-custom-source
.- Alternatively, you can also install the latest from your git repo directly using this syntax:
pip_url: git+https://github.com/myusername/tap-my-custom-source@main
- Alternatively, you can also install the latest from your git repo directly using this syntax:
- Replace
/path/to/tap-my-custom-source.sh
with just the executable name:tap-my-custom-source
. - Rerun
meltano install
to use the version from pip in place of the local test version.