BizTalk and Certificate Revocation Lists (CRLs) - Part I#

(This post follows on from last week's post Web Services and the 15 Second Delay).


Having accounted for 15 secs of the 45 secs latency I'd been observing in my BizTalk application, I started looking at where the additional 30 secs was coming from.

 

When I looked at my log files, I noticed the entire process from start to end (including calling external Web Services) was now taking just over three seconds. However, when invoking the BizTalk Web Service from a test client, the measured latency was still over 30 seconds. Something wasn't right…

 

I put logging code in the BizTalk Web Service around the ServerProxy.Invoke() call (the method call the submits the received SOAP message to BizTalk).

In my BizTalk application, the BizTalk Web Service is used to submit a message to the MessageBox, via the SOAP Adapter and a static request-response Receive Port using the PassThru pipeline. There is an Orchestration which subscribes to messages coming from this receive port, processes them, and then returns a response back to the receive port (and therefore the Web Service). My logging code logged the time at which the orchestration received a request message, and the time at which it sent a response message.

 

What I saw in my log files was that there was a 15 sec delay between calling ServerProxy.Invoke(), and the orchestration logging that it had received a request, and then another 15 sec delay between the orchestration sending a response, and the return from ServerProxy.Invoke() in the WebService.

Interesting.

 

Our BizTalk Load Test environment is composed of a single BizTalk 2006 Server, a single SQL Server, and SAN disk storage. Given that everything that happens between ServerProxy.Invoke() and an orchestration instance being started is non-user code (i.e. not my code!!), the problem was most likely environmental i.e. network, disk, database, security or "other".

 

So my first step was to attach SQL Server Profiler and log all the database calls going between the BizTalk Server and SQL Server.

If you haven't used Profiler before, be aware that it captures a huge amount of data – going through its logs can be very very painful.

However, in this case we're only interested in calls that have originated from BizTalk – specifically, calls which have originated from one of the Service Accounts that BizTalk runs under. In this case, the service account that the Isolated host instance runs under, and the account that my Orchestration Host's host instance runs under.

 

Why these two? The SOAP adapter runs under the Isolated host, which means that it'll be this Isolated host instance that will be putting the request into the MessageBox. And the Orchestration Host's host instance will be the one which picks up the request from the MessageBox, and then writes the response back the MessageBox.

For the Orchestration Host's host instance, finding out the service account name is easy – just look in the Services console, find the service named something like BizTalk Service BizTalk Group : <host name>, and then note the value in the Log On As column.

 

For the Isolated Host's host instance, we need to check the identity account for the Application Pool under which the BizTalk Web Service is running (if you're using IIS 6 – if you're using IIS 5, you'd use the service account for IIS itself, usually the ASPNET account).

 

So all we have to do is look for entries in the Profile log file which have a LoginName property equal to one of these two account names. Fortunately, you can setup a filter in Profiler quite easily – either before you run the trace (so that only events you're interested in are captured) or after.

 

I chose to capture all events and then filtered on entries which had an EntryClass of RPC:Completed, and a LoginName value of the Isolated host account name.

(Note: because I'm only interested in the SQL statements being executed by BizTalk to call Stored Procedures, I only need to see RPC events – none of the other events are relevant here as they pertain to Audit events, Keep-Alive requests, or Transaction statements).

 

Running through the filtered log means I only have to trawl through 300 odd lines, rather than 10,000…!

What I'm looking for is the time at which the Isolated host submitted the SOAP request to the MessageBox, so I'm looking for a call to Stored Procedure (SPROC) called bts_InsertMessage:

 

 

(Note: You'll notice in the above image that just before bts_InsertMessage is called, there are calls to bts_InsertPredicate_<hostname> and bts_InsertProperty followed by a call to bts_FindSubscription.

InsertPredicate and InsertProperty populate the MessageProps table, and FindSubscription checks whether the combination of these predicates and properties match a current subscription – if so, bts_InsertMessage is called; if not, a "subscription not found" exception is raised).

 

So now I know at what time the SOAP request was placed in the MessageBox: 17 seconds after ServerProxy.Invoke() was called. Now I need to check what time the message was picked up by my Orchestration Host's host instance.

To do this, I filter the Profiler log on the service account name used by the host instance.

However, this time I'm looking for a call to bts_DeQueueMessages_<hostname> which did some work – that is, a call which has a higher duration than the others:

 

 

However, this doesn't clear up the mystery – this just shows that the message was picked up by the Orchestration Host's host instance 4 secs after being inserted by the Isolated host instance.

 

And then something hits me: what if this related to the problems I was having earlier, with the Web Service and the CRL?

What if "17 seconds" is really "15 seconds plus some BizTalk processing time"?

 

I started up TCP View, passed a test message through – and sure enough, there was a CRL call with a SYN_SENT state.

Only this time, the TCP connection was to http://crl.microsoft.com/.

What are the chances that the BizTalk assemblies are digitally signed? Specifically Microsoft.BizTalk.WebServices.ServerProxy.dll, the assembly used by BizTalk Web Services.

Let's have a look:

 

 

And if we drill down into the Certificate Details, and look at the CRL Distribution point we see:

 


 

So there we are – because the code is hosted in IIS, and is digitally signed, IIS will attempt to validate the certificate, which involves attempting to download the CRL for the certificate.

 

Unlike my Web Service, I don't think I'd have much luck in convincing Microsoft to *not* digitally sign the BizTalk assemblies… ;-)

 

So, we set the default proxy on the BizTalk server (using the proxycfg command line tool discussed in the previous post).

 

And sure enough, the delay in the BizTalk Web Service went away.

The 45 secs latency dropped to 2 secs… and then dropped to 0.5 secs once I did some low-latency tuning.

 

So there we go. Now the only reason my client has this issue is because they don't want to have separate BizTalk servers which just run the BizTalk Web Services (i.e. dedicated BizTalk Servers, most likely in a DMZ, which don't do anything but put received SOAP messages in the MessageBox – all the other BizTalk servers would be behind the firewalls and in the secure portion of the network – this is Microsoft's recommended way of doing it, but is costlier in terms of licensing).

 

We decided we could live with the Load Test server having access to the internet via a proxy server, so applied this change to all our BizTalk servers (both Live and Test).

Saturday, March 17, 2007 10:22:34 PM (GMT Standard Time, UTC+00:00) #    Comments [0]  |  Trackback

 

Web Services and the 15 Second Delay#
Background: The large(ish) BizTalk system I've been working on for the last few months consumes a few external Web Services. Whilst doing some early performance tuning, I noticed that I was getting pretty serious request-response latency from the primary BizTalk Web Service... Like over 45 secs, instead of the expected 0.5 – 2 secs...

Fortunately, I routinely put tracing/audit calls in my BizTalk code which includes timing info - and I always log the times around orchestration entry/exit... So I was quite surprised to see that a request to one of the external Web Services was taking 15secs to respond... especially as the web service was hosted (temporarily) on the same server...

Although I didn't have the source code, with the help of Lutz Roeder's trusty .NET Reflector I was poking around the innards within minutes… and couldn't find anything that would cause a delay of 15 secs.

What's more, after further testing I found that the problem only occurred the first time after IIS was restarted. After the first time, it would return within 2 secs (still not very good, but much better). One thing I noticed about the Web Service was that it made use of some code in an unmanaged assembly (old school C++ DLL)...

On a hunch, I wrote a simple test Web Service which simply referenced some functions in the DLL (using the DllImport attribute)... and bang, a 15 sec delay before the Web Service constructor was called.

I also tried the same trick in a console app... and there was no delay.

By this time I was fairly stumped. This issue only happened on our Load Test server: it didn't happen on the Dev or Test servers. And it seemed to only happen to code hosted under IIS.

I figured it was time to dig a bit deeper and see what IIS was doing: starting up Process Monitor (to monitor file and registry access) and TCP View (to monitor network activity), I called my test Web Service.

Filtering the Process Monitor logs to the w3wp.exe process (the IIS Worker Process for Application Pools) I noticed that w3wp was reading a lot of security and cryptography keys from the registry:

Hmm. Interesting.

Then I looked at the TCP View log. And I noticed something very interesting coming from the w3wp process: a connection was being attempted to http://crl.versign.com and was being blocked.

How do I know it was being blocked? Well, the TCP View log showed a SYN_SENT, which appeared and was visible for 15 secs before disappearing:

Now if you're not familiar with how TCP works, then you might miss the significance of this: When a TCP client attempts to contact a TCP server, the first thing performed is a handshake. This involves the client sending a SYN, the server responding with a SYN-ACK, and the client sending a SYN-ACK-ACK.

In the TCP View tool, the rightmost column indicates the state of the local TCP connection end-point. You'll see that in this case, the state is SYN_SENT: this state indicates that the local TCP client is waiting for a SYN-ACK response from the SYN it sent.

Why would this be? Well, many reasons, but mostly because the address can't be reached e.g. a proxy server is required, or a firewall rule is blocking the call.

And what's the relevance of crl.verisign.com? A CRL (Certificate Revocation List) is used to indicate which Certificates (issued by a Certificate Authority) are no longer valid.

And this is only done when you are using an assembly which has been digitally signed.

(Note: Microsoft became really really serious about CRL's back in 2001 when VeriSign accidentally released two Microsoft certificates to a third party – which meant that the third party could sign code that appeared to have come from Microsoft).

So what does this have to do with my Web Service and a 15 second delay?

Looking through the assemblies in the Web Service, I noticed that the unmanaged assembly was digitally signed. Aha! So when the unmanaged assembly was referenced in a Web Service, IIS would hand off control to the Windows Cryptography API (CAPI) and ask it to verify the certificate. And this involves retrieving a CRL for the certificate.

After doing some more poking around, I found this blog post (from one of the developers at Xceed Software, a company specializing in .NET components).

The thing that drew my attention was this comment:

Some people behind a very strict firewall, ignoring outgoing connections instead of rejecting them, had a timeout while loading our assemblies. That timeout (at least 15 seconds) is hardcoded in WinVerifyTrust!

Eureka! That's exactly my problem!

And it explained why it was only happening on our Load Test server: Our Dev and Test servers have access to the internet, whilst the Load Test server does not: it has no proxy server configured.

To test this theory, I configured the Load Test server to use a proxy server. If you haven't done this before, you can't do it via Internet Explorer settings – that only configures a proxy for application which go via wininet.dll i.e. the Internet Explorer API, which uses the proxy server settings in the Internet Explorer/Internet Options/Connections/LAN Settings panel.

CAPI, however, uses winhttp.dll (Microsoft Windows HTTP Services). And to set the proxy server for that, you need to use the proxycfg command line tool which sets proxy server settings for winhttp.

If you run proxycfg –u, this will import your Internet Explorer proxy server settings:

So I did this, restarted IIS and presto! The problem went away.

However, giving the Load Test server access to the Internet is a fairly large security hole (at least it is in our network, where the Load test server is behind both the firewall and DMZ). 

So, we contacted the vendor who supplied the unmanaged DLL, and managed to get them to release a new version which wasn't digitally signed.

Which solved the problem of the latency in the Web Service (which accounted for 15 secs of the 45 secs latency).

Solving the remaining 30 secs latency is the subject of next week's post.

Saturday, March 10, 2007 12:01:13 PM (GMT Standard Time, UTC+00:00) #    Comments [0]  |  Trackback

 

All content © 2020, Daniel Probert
On this page
This site
Calendar
<March 2007>
SunMonTueWedThuFriSat
25262728123
45678910
11121314151617
18192021222324
25262728293031
1234567
Archives
Sitemap
Blogroll OPML
Disclaimer

Powered by: newtelligence dasBlog 2.3.12105.0

The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.

Send mail to the author(s) E-mail

Theme design by Jelle Druyts


Pick a theme: