How To Guide

How to Troubleshoot Server Downtime

Server downtime directly impacts business operations and revenue. A calm, systematic troubleshooting approach gets you back online faster than panicking and trying random fixes.

Overview

Server issues fall into four categories: hardware failure, network problems, operating system or software issues, and resource exhaustion (full disk, out of memory). Working through each category systematically pinpoints the cause quickly.

Step 1: Immediate Triage

Assess the situation and determine the scope of the outage.

Assess the Impact

Which services are affected? Email, file shares, database, website, applications?
Is the server completely unresponsive or partially working?
How many users are impacted? One person or the whole company?
When exactly did the problem start? Check monitoring alerts and user reports
Did anything change recently? Software updates, configuration changes, new installs?
Can you physically access the server? Are power lights on? Is the screen showing anything?

Check Basic Connectivity

Try pinging the server from another device on the same network
Verify network cables are connected — check for link lights on the network port
Check if the switch port is active (link light on the switch)
Try accessing the server via different methods: RDP, SSH, physical console, management port
If the server has a remote management card (iDRAC, iLO, IPMI), try connecting through that
Check if other devices on the same network switch are working
Verify the network switch itself is powered on and functioning

Communicate the Outage

Inform affected users that you are aware and investigating
Provide an estimated time for the next update (even if you do not have an ETA for resolution)
Use alternative communication if email is down (Teams, Slack, phone, WhatsApp)
Assign one person to handle communications so the technical team can focus
Log the start time of the incident for your incident report

Step 2: Systematic Diagnosis

Work through the most common causes methodically.

Check Hardware First

Check server health indicator LEDs: Green = OK, Amber = Warning, Red = Failure
Listen for unusual sounds: Clicking means disk failure, silence may mean power issue
Check RAID controller status: Degraded array means a drive has failed
Access hardware monitoring: BIOS/UEFI, Dell iDRAC, HP iLO, Lenovo XClarity
Check temperature readings — overheating causes shutdowns and throttling
Verify UPS status — battery failure can cause sudden power loss
Check memory (RAM) status — failed DIMMs cause crashes and blue screens

Check Disk Space

This is the MOST COMMON cause of server issues — always check first
Windows: Open File Explorer or run 'wmic logicaldisk get size,freespace,caption'
Linux: Run 'df -h' to check all filesystem usage
If any drive is above 95% full, this is likely the cause
Quick fixes: Empty temp folders, clear old logs, move large files
Check if log files have grown unexpectedly (runaway logging)
Clear Windows Update cache: Delete contents of C:\Windows\SoftwareDistribution

Pro Tip:

A full C: drive causes cascading failures: services stop, databases crash, logs cannot write, and Active Directory replication fails. Always monitor disk space with alerts at 80% and 90%.

Check Services and Event Logs

Windows: Open Event Viewer → Windows Logs → System and Application
Look for Critical and Error events around the time of failure
Common culprits: Service crashes, driver failures, disk errors
Check if critical services are running: services.msc or 'Get-Service' in PowerShell
Linux: Check /var/log/syslog, /var/log/messages, journalctl -xe
Look for Out of Memory (OOM) killer messages on Linux
Review application-specific logs for the affected service

Check Resource Utilisation

Open Task Manager (Ctrl+Shift+Esc) or Resource Monitor on Windows
Linux: Run 'top', 'htop', or 'free -m' for memory, 'iostat' for disk I/O
Check CPU usage — sustained 100% indicates a runaway process or insufficient capacity
Check memory usage — if RAM is full, the system swaps to disk and becomes extremely slow
Check disk I/O — high I/O wait means the storage is the bottleneck
Identify the process consuming the most resources and determine if it is legitimate

Step 3: Recovery and Prevention

Get back online and ensure it does not happen again.

Restore Service

Fix the identified root cause (add disk space, restart service, replace hardware)
Restart the affected service before restarting the entire server
If a reboot is needed, warn connected users and perform a clean restart
Verify all services come back online after restart
Test from a user's perspective — can they access what they need?
Monitor closely for the first 24 hours after recovery

Post-Incident Actions

Document everything: What happened, when, how it was diagnosed and fixed
Identify the root cause — not just the symptom
Set up monitoring alerts for the specific failure point so you get early warning
Schedule regular maintenance: disk cleanup, log rotation, Windows Updates
Review and update your disaster recovery procedures based on what you learned
Share findings with the team to improve collective troubleshooting skills

Preventive Measures

Implement proactive monitoring: Disk space, CPU, memory, service uptime
Set up automated alerts BEFORE thresholds reach critical levels
Schedule regular server maintenance windows (monthly or quarterly)
Keep servers patched with the latest security and stability updates
Maintain tested backups — verify you can actually restore from them
Document server configurations so anyone on the team can troubleshoot
Consider redundancy for critical servers: clustering, replication, or cloud failover

Need Professional Help?

Our engineers provide expert assistance with setup, troubleshooting, and ongoing support for businesses and individuals across Cornwall.

Get in Touch Call 01726 76999