Original Post Read More
The Integration Runtime (IR) is the compute powering activities in Azure Data Factory (ADF) or Synapse Pipelines. There are a few types of Integration Runtimes:
Azure Integration Runtime – serverless compute that supports Data Flow, Copy and External transformation activities (i.e., activities that are being executed on external compute, such as Databricks, and the IR acts as an orchestrator only).
Self-hosted Integration Runtime (SHIR)– same software as above packaged as an MSI that you install on your own physical or virtual infrastructure. You can have up to 4 nodes in a cluster. It supports Copy and External transformation activities.
Azure SSIS-IR – supports executing SSIS packages in a managed Azure compute environment.
Important! Data Flows execute on Spark clusters, while Copy and External transformation activities execute on Windows OS.
You always specify which Integration Runtime to connect to in the Linked Service definition:
In this blog we will focus on the data movement activity and how to choose the most suitable compute for your scenario.
The first question to ask is whether the data sources that you are connecting to are publicly accessible. If so, then you can use the Azure Integration Runtime to connect to those.
Please note, one compute is used to connect to both your source and destination data store, hence if one is in a private network you will need to use an IR that can connect to it. And this will be the IR leveraged at runtime.
If your data store is behind a firewall or in a private network (Azure virtual network, on-premises, other cloud virtual networks, etc.), you have a few options:
Use a SHIR and whitelist its IP address/es on the data source.
Use an Azure IR with managed virtual network (VNet) enabled.
The simplest and cheapest one is to whitelist the IP address ranges of the Azure IR. This is often not desirable in highly secure production environments. This has a higher operational cost and provides the lowest security.
For the managed VNet Integration Runtime a dedicated virtual network is created only for you, that is managed by us. This then allows you to create a private endpoint between the IR and your data store. There is no compute kept idle in the VNet, hence when you execute your first activity on this IR, there is some queueing time during which we inject the VMs into the VNet. We have a TTL setting that you can leverage to avoid this queuing for subsequent activities.
When choosing the managed VNet IR you get a secure, fully isolated, and highly available compute option which allows you to run up to 50 concurrent activities.
For comparison, to get the same computing power as in the managed VNet IR with a SHIR, as a rule of thumb, you start by allocating at least 1 v/core per job. Different jobs will consume different resources, hence, to establish what your workload needs are it is best to test with a representative dataset and then extrapolate the results. E.g. a job that does data type conversion will consume more resources compared to one that doesn’t.
For most production scenarios where a data store is in a private network, we will be choosing between option 1 and 2. There isn’t a one-size fits all approach. Let’s look at different important considerations to help you navigate this decision:
How many concurrent activities do you need to execute?
Is your workload predictable?
In general, if your workload is predictable and you don’t run many concurrent jobs, or you can tolerate the latency caused by queueing jobs due to lack of capacity, you can size your SHIR cluster to fit the load.
In scenarios where it is hard to predict the load due to having multiple teams or projects leveraging the same SHIR, it is better to choose the managed VNet IR. To put it simply, it is difficult to size something for an unknown load. We can of course oversize it, but then we also end up paying and managing an oversized SHIR cluster. The managed VNet IR has built-in high availability and can handle up to 50 concurrent activities. Hence you get a serverless option that can handle very high load.
On the other end of the spectrum, we have scenarios where you don’t have concurrent activities running, and maybe your pipeline contains small copy jobs and external transformation activities, in such scenarios, it might be better to look at having a small SHIR. We recommend a two-node cluster with VM sizes appropriate for the workload. You could also have a single node cluster if your workload can tolerate delays caused by hardware or software failure.
Where are your data stores located?
For on-premises data stores, when using the managed VNet IR you need some additional infrastructure to connect to the on-premises environment. This is due to the way private endpoints (and Private Link Service) work. For an example with SQL Server, please see this tutorial.
If this additional infrastructure already exists (e.g. Express route or S2S VPN, the Load balancer, etc.) then managed VNet IR is the best option. Otherwise, setting up an SHIR could be the simpler option.
Do you plan a one-time data migration? Or periodical data loads?
For one-time migrations, you could leverage existing on-premises commodity hardware. For periodical data loads however, it might be better not to create dependencies on commodity hardware so you can decommission it. Hence looking at having the SHIR run on Azure VMs or leveraging the managed VNet IR might be a better solution.
I hope by this time we’re all on the same page that this decision requires careful consideration of the many aspects that influence it. I also hope that you are now better armed to embark on this decision journey.
Let me know in the comments if you have any questions or there are some important considerations that I have missed