Background:
The large(ish) BizTalk system I've been working on for the last few months
consumes a few external Web Services. Whilst doing some early performance tuning,
I noticed that I was getting pretty serious request-response latency from the primary
BizTalk Web Service... Like over 45 secs, instead of the expected 0.5 – 2 secs...
Fortunately,
I routinely put tracing/audit calls in my BizTalk code which includes timing
info - and I always log the times
around orchestration entry/exit... So I was quite surprised to see that a request
to one of the external Web Services was taking 15secs to respond... especially
as the web service was hosted (temporarily) on the same server...
Although
I didn't have the source code, with the help of Lutz Roeder's trusty .NET Reflector I was poking
around the innards within minutes… and couldn't find anything that would cause a
delay of 15 secs.
What's
more, after further testing I found that the problem only occurred the first
time after IIS was restarted. After the first time, it would return within 2 secs
(still not very good, but much better). One thing I noticed about the Web
Service was that it made use of some code in an unmanaged assembly (old school C++ DLL)...
On a
hunch, I wrote a simple test Web Service which simply referenced some functions
in the DLL (using the DllImport
attribute)... and bang, a 15 sec delay before the Web Service constructor was
called.
I
also tried the same trick in a console app... and there was no delay.
By
this time I was fairly stumped. This issue only happened on our Load Test
server: it didn't happen on the Dev or Test servers. And it seemed to only
happen to code hosted under IIS.
I
figured it was time to dig a bit deeper and see what IIS was doing: starting up
Process
Monitor (to monitor file and registry access) and TCP
View (to monitor network activity), I called my test Web Service.
Filtering
the Process Monitor logs to the w3wp.exe
process (the IIS Worker Process for Application Pools) I noticed that w3wp was
reading a lot of security and cryptography keys from the registry:

Hmm.
Interesting.
Then
I looked at the TCP View log. And I noticed something very interesting coming
from the w3wp process: a connection was being attempted to http://crl.versign.com and was being
blocked.
How
do I know it was being blocked? Well, the TCP View log showed a SYN_SENT, which
appeared and was visible for 15 secs before disappearing:

Now
if you're not familiar with how TCP works,
then you might miss the significance of this: When a TCP client attempts to
contact a TCP server, the first thing performed is a handshake. This involves
the client sending a SYN, the server responding with a SYN-ACK, and the client
sending a SYN-ACK-ACK.
In
the TCP View tool, the rightmost column indicates the state of the local TCP
connection end-point. You'll see that in this case, the state is SYN_SENT: this
state indicates that the local TCP client is waiting for a SYN-ACK response
from the SYN it sent.
Why
would this be? Well, many reasons, but mostly because the address can't be
reached e.g. a proxy server is required, or a firewall rule is blocking the
call.
And
what's the relevance of crl.verisign.com? A CRL (Certificate Revocation List)
is used to indicate which Certificates (issued by a Certificate Authority) are
no longer valid.
And
this is only done when you are using an assembly which has been digitally
signed.
(Note: Microsoft became really really
serious about CRL's back in 2001 when VeriSign accidentally released
two Microsoft certificates to a third party – which meant that the third
party could sign code that appeared to have come from Microsoft).
So
what does this have to do with my Web Service and a 15 second delay?
Looking
through the assemblies in the Web Service, I noticed that the unmanaged
assembly was digitally signed. Aha! So when the unmanaged assembly was
referenced in a Web Service, IIS would hand off control to the Windows
Cryptography API (CAPI) and ask it to verify the certificate. And this involves
retrieving a CRL for the certificate.
After
doing some more poking around, I found this
blog post (from one of the developers at Xceed
Software, a company specializing in .NET components).
The
thing that drew my attention was this comment:
Some people behind a very strict
firewall, ignoring outgoing connections instead of rejecting them, had a
timeout while loading our assemblies. That timeout (at least 15 seconds) is
hardcoded in WinVerifyTrust!
Eureka! That's exactly my problem!
And it explained why it was only
happening on our Load Test server: Our Dev and Test servers have access to the internet,
whilst the Load Test server does not: it has no proxy server configured.
To test this theory, I configured the Load
Test server to use a proxy server. If you haven't done this before, you can't
do it via Internet Explorer settings – that only configures a proxy for
application which go via wininet.dll
i.e. the Internet Explorer API, which uses the proxy server settings in the Internet Explorer/Internet
Options/Connections/LAN Settings panel.
CAPI, however, uses winhttp.dll (Microsoft
Windows HTTP Services). And to set the proxy server for that, you need to use
the proxycfg
command line tool which sets proxy server settings for winhttp.
If
you run proxycfg –u, this will import
your Internet Explorer proxy server settings:

So I
did this, restarted IIS and presto! The problem went away.
However,
giving the Load Test server access to the Internet is a fairly large security
hole (at least it is in our network, where the Load test server is behind both the
firewall and DMZ).
So,
we contacted the vendor who supplied the unmanaged DLL, and managed to get them
to release a new version which wasn't digitally signed.
Which
solved the problem of the latency in the Web Service (which accounted for 15 secs of the 45 secs latency).
Solving the remaining 30 secs latency is the subject of next week's post.