Create Azure Pipeline to Blob Storage on USA Gov Cloud

manoj · May 3, 2022, 5:57am

I have created a Pipeline on my SingleStore v7.8.1 to the Azure Commercial cloud using a Storage Account. I was able to load a CSV data file and get SingleStore to ingest it. Great for my prototyping.

What I really need to do for my Production use case is the same thing, but have the Pipeline link to a Storage Account on the USA Azure GOVERNMENT Cloud. My Production SingleStore managed Cluster is installed on the Azure GOV Cloud.

When I try to create the Pipeline on the GOV cloud I get an error. This error is resulting from the Create Pipeline looking at the Commercial cloud for the Storage Account endpoint. So maybe the solution to my problem is setting the internal Endpoint suffix of the Storage Account url before I call Create Pipeline.

CREATE PIPELINE library
AS LOAD DATA AZURE ‘my-container-name’
CREDENTIALS ‘{“account_name”: “your_account_name”, “account_key”:
“your_account_key”}’
INTO TABLE classic_books
FIELDS TERMINATED BY ‘,’;

ERROR 1933 ER_EXTRACTOR_EXTRACTOR_GET_LATEST_OFFSETS: Cannot get source metadata for pipeline. Get “https://psnadevdiag.blob.core.windows.net/cad-container?comp=list&include=metadata&restype=container”: dial tcp: lookup psnadevdiag.blob.core.windows.net on xx.XXX.XX.XX:XX: no such host

https://psnadevdiag.blob.core.windows.net/cad-container (Commercial cloud works)

https://psnadevdiag.blob.core.usgovcloudapi.net/cad-container (this fails)

Anyone know how to resolve this issue?

sasha · May 3, 2022, 8:15pm

We indeed don’t currently give you a way to connect to a custom azure “service url”. I’ve filed a feature request to add support.

In the meantime, the best thing I can suggest is (unfortunately) to attempt to work around this at the host/DNS level. Do you have enough control of the system to redirect traffic from core.windows.net to core.usgovcloudapi.net?

manoj · May 9, 2022, 10:51pm

Thanks for the suggestion. However, that is not something I can do since I do have a customer on the Azure Commercial cloud, all others are on GOV cloud. So I cannot make the DNS change suggested.

I do have a work around idea. I would use the Azure COMM cloud to create a pipeline. Upload my CSV data files to a BLOB storage container and have the Pipeline ingest the data.

I would want do delete the data files almost immediately from the Storage account. My understanding is the the Pipeline cannot do that. However, I could query the information_schema DB to run the following query:

SELECT * from information_schema.PIPELINES_FILES where File_state = ‘Loaded’;

I could then iterate over this list to DROP each datafile from the Pipeline. I am not sure if that also deletes it from the Storage account (which I would want to do).

As I write this I am thinking I may need a SingleStore API to be able to execute the DROP file iteration on some StoredProc and also to delete the Storage account files (if they are different).

Anyone with knowledge please chime in. I will have to figure this out next.