Recently I have been involved in troubleshooting a custom SharePoint application. The application was deployed on a SharePoint 2007 farm, consisting of a database server, an application server and two front ends. At time in the live environment the customer encountered the following error and after this the application did not work until an IIS reset:
The remote server returned an error: (500) Internal Server Error. at System.Net.HttpWebRequest.GetResponse() at System.ServiceModel.Channels.HttpChannelFactory.HttpRequestChannel.HttpChannelRequest.WaitForReply(TimeSpan timeout)
Unfortunately this never happened on the test farm – which was identical to the live one – and to make it even worse it was a random occurance and nobody was able to reproduce it.
Soon, they realised by visiting the following URL http://:8091/CustomServiceName.svc?wsdl, that the problem disappeared and the application worked correctly for a while.
After some digging I found the following event on the application server:
Event ID: 1309
Task Category: Web Event
Process name: w3wp.exe
Exception type: COMException
Exception message: The handle is invalid. (Exception from HRESULT: 0x80070006 (E_HANDLE))
Further information in this message told me that the problem occurred with a custom WCF application hosted on the app server and it belongs to the custom application.
I also opened the IIS logfiles and searched for patterns when the error occurred. The only interesting thing was that normally always a POST method was used during normal behaviour except when the application was “healed” by visiting the WCF service URL using a GET method. However nothing told me what put this application into this “zombie” state.
I turned on the IIS Failed Request Tracing feature without a filter and was waiting for the issue to come up again.
Luckily the FRT files told more:
[COMException (0x80070006): The handle is invalid. (Exception from HRESULT: 0x80070006 (E_HANDLE))]
[FileLoadException: Could not load file or assembly ‘SMDiagnostics, Version=188.8.131.52, Culture=neutral, PublicKeyToken=b77a5c561934e089’ or one of its dependencies. The handle is invalid. (Exception from HRESULT: 0x80070006 (E_HANDLE))]
System.ServiceModel.Activation.HttpModule.ProcessRequest(Object sender, EventArgs e) +0
System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously) +171
To be on the safe side I checked the dll was on the machine and the assembly was registered to the GAC, anyway it would not explain the randomness of the error.
Meanwhile I had the chance to check the server when the error came up again and realised that no worker process belonged to this application by going to the worker process tab of the IIS manager:
Example, one worker process is running and it belongs to the DefaultAppPool application pool.
When I visited the WCF service URL a worker process started up and the application was working for the next couple of hours without any problem. Basically the worker process cannot start when the user tries to use the application but visiting the WCF service URL or starting it up by hand works.
When the error occurred next time and there was no w3wp.exe to host the application. I started up the sysinternals process monitor tool to check what is going on.
Searching for SMDiagnostics.dll realised a w3wp.exe started and also finished after a couple of operations.
First of all it created a failed request tracing xml file and also put some info into it, then read some dlls successfully. However, creating a file in C:\Windows\assembly failed. The reason is because it tried to do in the name of the visiting user who does not have permission for this operation on the server.
After some deliberation I found a difference in the web.config file of the service compared to the test system and it was about impersonation:
After all, the test and the live systems were not identical. Removing this setting and using the default ‘false’ value solved the issue.
The randomness of this problem came from the worker process settings. By default, if a worker process is idle for 20 minutes it shuts down to save resources. If a new request comes to the server and the w3wp.exe is not running, a new one automatically starts up. Because of the impersonation settings the worker process tried to start and make operations in the name of the visiting user and of course it failed.
In real life the troubleshooting process was not straightforward; there were a few dead end tries (creating a new application pool, playing with limits, doing stress tests) however the case is finally solved!