This is going to be a short post, which I am writing as a reminder for my future self as well as any of you reading out there.
If you are troubleshooting an error with your pipeline, especially if you work in a database system that has layered views, make sure you fully dig down through all layers of your data setup before you make a ticket with Microsoft. I learned this the annoying way over a work week while working through a ticket with Microsoft for a very strange issue I was seeing in a pipeline that uses a Synapse Serverless SQL Pool. We had checked so many things that week with no luck in changing the outcome of how the pipeline runs, and then the pipeline just went back to working when I ran it outside of its schedule.
The error I was seeing made it look like the Serverless SQL Pool views, which use an OPENROWSET call to a parquet file, were referencing the wrong source file even though I confirmed multiple times that the view definitions were correct. For example, the view was written to use parquet files under TestDatabase/MyTable/** as the source files, but the error was making it seem like they were instead pulling data from TestDatabase/OtherRandomTable/** which was confusing to say the least. I thought that the Serverless node was broken or had a bug that was making the views look at “OtherRandomTable” files instead of the correct files.
The Cause
The error happened because multiple views used a CROSS APPLY to another view tied to a parquet file in a data lake, and that parquet file was being deleted and recreated by a parallel pipeline. When the failing pipeline tried to reference its views, it couldn’t find that base view because the source file had not yet been recreated by the parallel pipeline. Makes sense and is so obvious in hindsight, but it took Microsoft support directly asking me to make me realize that I had a view referencing another view, which I needed to check the definition of.
The change I needed to make was to update the pipeline triggers so that the process deleting and recreating the base view would be done making the parquet files when the second pipeline ran and tried to use those files.
If I had done my due diligence and dug through every layer of the data environment, which I am normally good at with other scenarios, I would have quickly and easily discovered the issue myself. But sometimes we need to learn the hard way because our brains aren’t running at full capacity. (It also helps that I finally had dedicated time to set aside to this problem and so wasn’t trying to multitask multiple work items at once.)
Summary
If you are troubleshooting ETL failures of any kind, make sure you dig down through all layers of the process to ensure you have checked everything possible related to your failure before reaching out to support. They’ll happily help you find what you missed, but it will save everyone time if you can figure it out yourself first.
If you use a Hive Linked Service in your Azure Synapse workspace, you have probably been getting a banner at the top of your workspace every time you sign in telling you that you have linked services that are about to be deprecated so you should update them. If, like me, you go ahead and do that upgrade from version 1.0 to version 2.0, which is touted as being as simple as toggling a switch, you may have been disappointed to find that you now have an error with your connector that prevents your pipelines from running. If this is relatable to you, keep reading because I will tell you how you can resolve the issue.
Fortunately or unfortunately, depending on how you look at it, I was a preview tester for Microsoft for the fix to the Hive V2.0 connector issue because the problem had already been reported and had a fix in progress when I created a ticket with Microsoft support. The solution has now been released and is available for you to implement if you need it.
For a couple months, my team was being annoyed with a banner every time we opened one of our workspaces which utilizes a Hive linked service, telling us that we have a linked service version that is going to be deprecated soon. In August, I finally decided I had enough of the banner and decided to do the requested upgrade to our Hive Linked Service, to upgrade to version 2.0. Making the change was simple enough, changing the radio button we had selected in the settings of the linked service (LS), and I didn’t even have to fix any settings of the LS after making the switch like I had to do with an upgrade to our Oracle connectors. All previously saved settings stayed where they were, so I saved the change and merged it.
A couple days later, on the next run of our pipeline that uses that linked service, we got the following error on the “Copy Data” activity which pulls data from the Hive data source into our data lake.
"ErrorCode=FailedToConnectMSDISource,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Failed to connect to the connector. Error code: 'AuthenticationError', message: 'One or more errors occurred., ErrorCode: AuthenticationError',Source=Microsoft.DataTransfer.Connectors.MSDIBridgeAdapter,''Type=Microsoft.DI.Driver.Hive.Exceptions.HiveConnectorException,Message=One or more errors occurred., ErrorCode: AuthenticationError,Source=Microsoft.DI.Driver.Hive,''Type=Apache.Arrow.Adbc.Drivers.Apache.Hive2.HiveServer2Exception,Message=An unexpected error occurred while opening the session. 'Required field 'client_protocol' was not present! Struct: TOpenSessionReq(client_protocol:null)',Source=Apache.Arrow.Adbc.Drivers.Apache,''Type=Thrift.TApplicationException,Message=Required field 'client_protocol' was not present! Struct: TOpenSessionReq(client_protocol:null),Source=Apache.Arrow.Adbc.Drivers.Apache,'",
The important part of that error message is An unexpected error occurred while opening the session. 'Required field 'client_protocol' was not present!.
As any developer then does, I started googling that error but didn’t find any pages with that same error for Azure Synapse related Hive issues. But I did find several similar errors with different uses of Hive that were caused by version issues between the client and the server, essentially the client got upgraded beyond what the server version was, which caused a similar error for others.
For my pipeline, I do not control the Hive data source that we were failing to pull from because that is managed by one of my company’s customers. I tried to get information about their Hive version to see if it was something they could fix, but I quickly hit a dead end and they told me to make a support ticket with Microsoft. Which I did.
Thankfully Microsoft was immediately helpful and I got to work with a very kind and knowledgeable support woman, who immediately tracked down and found that the issue I was having was one that the engineering team was already working on resolving. Great! There was a fix already in the works! And a couple days after being told I just needed to wait for the fix to be sent out, I received a message from the support agent asking if I would test out their proposed fix to the problem, which was an updated version of the Self-Hosted Integration Runtime (SHIR) that allowed us to connect our systems to our customer’s.
How to Resolve the Problem
The Self-Hosted Integration Runtime (SHIR) that Microsoft wanted me to test was easy to install on our server where we normally run our SHIR for the workspace I was upgrading. They provided a download link to me, so I hit that then easily install the SHIR like you normally would and it upgraded in place. The version of the SHIR I tested for them was “IntegrationRuntime_5.58.9364.4”.
Once I installed that on our server, I updated the Hive connector again (since I was forced to downgrade while the solution was being developed so I could continue getting our data) and this time, the test connection worked. I then committed the change and ran my pipeline, and it successfully completed the “Copy Data” activity that had been previously failing with the 2.0 version of the connector.
The only problem we have now is that the 2.0 version of the connector seems to have changed how data types are handled for Parquet files in Data Lakes, so now I need to remap data types between our Data Lake and the final data warehouse to get the pipeline completely working again. I had to remove the test SHIR from our production system at that point anyway since they told me we couldn’t use it in production yet, so when I do the official upgrade to the 2.0 Hive connector, I will also need to make that data mapping fix.
As of writing this post, the SHIR that I believe contains the fix is live on the site where you normally download that software, except the version is slightly higher. What’s available on their download site is “IntegrationRuntime_5.58.9377.1”. If you are having the same issue I am with this connector upgrade, give this latest SHIR a try to see if that resolves your issue like the test version did for mine. I will update this post once I have done my final upgrade and have confirmed that the SHIR version available online fixes my problem for good.
Summary
If you started to experience connection failures on your Hive linked service in Synapse after changing the version of the linked service from V1.0 to V2.0, update your Integration Runtime to version 5.58 or later to get the problem fixed. When updating the linked service, there is no mention of needing to upgrade the SHIR unless it is really out of date (when I first tried upgrading the connector I couldn’t because we were on SHIR 5.52), so you will need to be aware of this mismatch yourself.
Let me know down in the comments if you also ran into this issue with the upgrade!
Today at TechCon365 in Atlanta, I did another full day session of the pre-con, this time covering everything in the realm of Microsoft Fabric. Before today, I knew hardly anything about Fabric, aside from it being something that Microsoft was pushing our team to use for a new project I wasn’t involved with. The workshop, which didn’t have any hands-on work but was still good, was led by John White and Jason Himmelstein. The two speakers did an excellent job at teaching all aspects of Fabric without being boring. This was the introduction to Fabric as a concept that I’ve been needing for months.
Note: The content below is not guaranteed to be accurate, it’s just what I took away from the workshop. In the coming months, I will be learning more about these tools myself to fact check everything and see if the new tools will work for my team.
For me, the biggest takeaway of things I learned during today’s workshop is that Microsoft intends to get their customers migrated off Synapse as the data movement platform and into Fabric and its OneLake. As of now, Synapse as a product no longer has a product manager (according to the presenters), which means that Microsoft does not intend to make any feature updates to that tool. There is still a maintenance team for the product to fix any issues that may arise and to keep it running, but there are no long term plans for it anymore.
Synapse going away is concerning to me because my company only recently migrated to Synapse off of SSIS. My fear is that if we spend a ton of time and energy converting our ETL systems into Fabric, that as soon as we get the work done, we’ll need to start anew with whatever product line Microsoft releases next, just like what we just did with SSIS to Synapse. I understand that cycles like this are everyday occurrences in the technology sector, but I also know that my boss and the executives at my company are likely not going to be happy when I tell them what I learned in today’s workshop.
If you’re in the same situation my team is in, you are totally in Synapse and maybe just wrapped up getting there so are wary to move on to the next transition to Fabric, don’t worry too much. The presenters did assure us that Synapse will be supported for a while yet, no concrete plans have been announced to retire it. Based on that, John and Jason recommended that teams stop doing any new development using Synapse and instead start to do new development in Fabric to work as real-world test cases of how useful it can be. I haven’t yet done anything like that, so the usefulness is still unknown to me personally.
Hello Fabric
Fabric is being sold as the one stop shop for all things data in the Microsoft universe, the future of all data analytics and processing. No longer is Fabric going to be Power BI 2.0; now it will serve as the location for anything you need to do with data. While the sales pitches are great, I am wary of the transition, like I said above already. I dislike when companies claim that their new tool is going to solve all your problems, because nothing is ever that great in reality. But here is what I learned today.
OneLake: Put every speck of data here
With each tenant, Microsoft gives you a single OneLake, because it should be your one data lake for your organization, just like there is only a single OneDrive for each user. No more having dozens of storage accounts to act as separate data lakes for different processes. Now you get the one and you will like it. More details I learned:
Parquet files: the reigning champions
A “Shortcut” to the source data
If you want to get data from a source system, like your Oracle finance database for PeopleSoft finance, you can add a “Shortcut” to the data source without having to pull the data into OneLake with a normal ETL process.
This concept is the most intriguing to me, and I really want to see how it plays in real life with real data and systems. Could be too good to be true.
Security redefined
They’ve developed a new way to manage security for the data that is supposed to reduce overhead. They want to have you manage security in only one place, just like you’ll be managing all your data in one place with OneLake.
This feature is still in preview, so remains to be seen if it works as well as Microsoft claims
File Explorer: An old but good way to explore your data
A new plugin for Windows has been created that will allow you to view your OneLake data through the Windows File Explorer, like they’re normal files. Which they are, if they’re Delta-Parquet flat files. The new feature is OneLake File Explorer.
You can also view your OneLake data files through Azure Storage Explorer, which is a good second option as well since my team already uses that for other purposes.
For the OneLake concept, I like that it is trying to prioritize reusing what you already have–tools, data, reporting frameworks–so that you don’t have to keep reinventing the wheel for each new data flow process you need to create. The concepts are good, so I’m eager to use them for myself after the conference.
Lakehouse
This is a new way to store data that is offered exclusively by OneLake in the Microsoft ecosystem. It combines the concepts of data lakes and data warehouses by using Delta Parquet format files. It gives you the ability for ACID compliant transactions on the data files while also still having compressed and cheap data storage like a data lake.
Data Factory
This is the part of Fabric that is replacing Synapse pipelines. In the demos during the workshop, I could see that the Dataflow Gen 2s and the data pipelines look almost exactly like what is currently available in Synapse, which is promising, because then at least I don’t need to learn a new GUI. Like with Synapse, the Dataflows are still backed by Spark engine to allow for parallel processing and high throughput. Both tools allow you to move data from point A to point B (the OneLake) for the scenarios where Shortcuts won’t work and you still need to literally move data between systems.
Fabric Native Databases
If the other new data storage types in OneLake and Fabric weren’t enough for you, there is now also going to be Fabric-native databases available for use. These will be SQL Databases (and some other database types) that you create directly in Fabric instead of creating them independently then joining them to the OneLake with a Shortcut. This change to Fabric is what they are intending to use to take Fabric from a data analytics platform to a full data platform. However, the presenters of today’s workshop did caution that they don’t recommend putting your production systems into these native databases yet, since they’re still new and not well-tested in the real world. But maybe in the near future, we truly will be doing 100% of our data integration engineering in Fabric.
Summary
With this technology, it feels like we could be at the cusp of a new big wave of change for how we operate in the data realm, so I am really interested to see how things play out in the next year or two. Using Fabric for all things data could be the next big thing like Microsoft is trying to sell, or maybe it will simply fade into being just another option in the existing sea of options for data manipulation and analytics in the Microsoft ecosystem.
Do you think Fabric is going to be the the next big thing in data technology like Microsoft claims?
A task that I completed a while ago that I wanted to share was creating two different pipelines for a business user to load data to and from one of their databases to another server for monthly processing. The business user requested that they receive an email if the pipeline failed so that they would know sooner rather than later that there had been a problem with processing. Unfortunately, there isn’t currently a way to natively send an email from a Synapse Analytics pipeline, there isn’t a built-in activity for that (that I’m aware of). To enable the functionality requested by the business users, I had to create a custom system to send an email.
Since there isn’t a native activity within Synapse Analytics that allows you to send an email, the workaround system I created is to make a Logic App that will send an email when triggered by an HTTP request, then add an HTTP activity in the pipeline to trigger the pipeline and feed through error information to the Logic App to generate a useful email. Keep reading to learn the details of how to setup such a process.
Creating a Logic App to Send an Email
1. Click on the component “Add a trigger” in the designer. In the search box, type “request” and then select the option that says “When an HTTP Request is received”.
2. Click on the link to “Use sample payload to generate schema”
3. Enter a simple JSON-formatted string containing the values you will need to use to send the email
4. Click “Done” when you’ve included all the parameters you want to include in your incoming JSON values, and your simple JSON will be automatically formatted in the exact way needed for the payload to handle the request.
5. Choose the METHOD that will be used for the HTTP request, select “POST”
6. Click into the open area of the designer to close the editor for the HTTP request component then click the “+” under the component to add another to the flow
7. In the search box, search for “outlook”, scroll down until you see the section for “Office 365 Outlook” and click the component for “Send an Email (v2)”
8. If you haven’t logged into an email account through the Logic App yet, it will request that you do so in order to start specifying how the email should be sent. Go through the process of signing into an email account. I recommend using a service account email instead of your own email, unless you’re the only person that will ever be working on this process in the future.
9. Click into the “To” box in the properties of the “Send an Email” component, and you’ll see two blue icons pop up to the left of the box that will allow you to dynamically enter information into that text box. Click the top icon that looks like a lightning bolt to pick values from the previous step of the process.
10. In the box that opens, select the “to” value from the HTTP request component to fill into the “To” box of the email properties.
11. Repeat the same steps for the Subject of the email and the Body of the email.
Adding the Email Capability to your Synapse Pipeline
After saving the Logic App, you will be ready to use it from your Synapse pipeline. Once saved for the first time, you will now see that a URL has been generated for the HTTP Request activity which you will use to call the Logic App from your pipeline. Copy that URL.
Return to your Synapse pipeline so that we can set it up to call the Logic App to send an email. Since we essentially created an API which we can hit to send an email, we need to add a “Web” component to our pipeline. In my case, I want to send an email to a business user when the pipeline fails since they are the key party interested in the pipeline failing. For that, I have added my “Web” component as the “On Failure” route of my pipeline (red “x”/line).
Open the “Web” activity to set it up to hit your new API. Give the Activity whatever name you would like, in my case it is named “Send Failure Email 1”. Then go to the “Settings” tab and paste the URL into the URL field. Then for “Method”, choose “POST” since we want to send information to the URL from our pipeline.
Next, we need to write the body of text to send to the API. This Body will need to match the formatting that we specified in the Logic App when we configured the JSON Payload. In that payload, I specified the following parameters should be required: To, Subject, and EmailBody. Based on that, I have written my Body for the Web activity as the following:
{
"to":"@{pipeline().parameters.EmailRecipient}",
"subject":"@{pipeline().parameters.EmailSubject}",
"emailbody":"Pipeline failed when trying to copy data for table @{item().TableName}."
}
When working with the expression builder in Synapse, you need to be very particular about formatting when using pipeline parameters or variables in an expression. Pay close attention to the double-quotes, the curly brackets, and the @ symbols. In my HTTP Body, I set the values of two of the parameters I send to the Logic App through pipeline parameters so that I only had to change them in one place and not in every iteration of the email component in my pipeline. If you don’t want to use parameters like that, you could hard-code the values of the key-value pairs in the body to whatever you would like, similar to how I set the value of emailbody.
Click “OK” to close the expression builder panel. The very last thing we need to add to our “Web” activity is the Headers of the POST HTTP message, we only need to add one. Click the “+ New” button to add one, then enter the Name as “Content-Type” and the Value as “application/json”.
Testing the Pipeline Email Capability
Once you have finalized the addition of the “Web” activity in your pipeline, you are ready to Debug and test the changes, which you can do in the manner in which you normally test your Synapse pipelines. For me, I knew my pipeline was setup successfully in the way I wanted when I created a false failure scenario (since the email only sends when a component fails), I ran a debug, the previous component failed and I received an email containing the failure information that I wanted. Your testing may look slightly different.
Summary
Although Synapse Analytics doesn’t have a built-in feature for sending an email from a pipeline, you are able to build that functionality yourself fairly easily using a Logic App and a Web activity in your pipeline. I would love to hear if you found this method useful or implemented it yourself. Thanks for reading!