Back to How To Guides
How To Guide

How to Troubleshoot Server Downtime

Server downtime directly impacts business operations and revenue. A calm, systematic troubleshooting approach gets you back online faster than panicking and trying random fixes.

Overview

Server issues fall into four categories: hardware failure, network problems, operating system or software issues, and resource exhaustion (full disk, out of memory). Working through each category systematically pinpoints the cause quickly.

Step 1: Immediate Triage

Assess the situation and determine the scope of the outage.

1

Assess the Impact

  • Which services are affected? Email, file shares, database, website, applications?
  • Is the server completely unresponsive or partially working?
  • How many users are impacted? One person or the whole company?
  • When exactly did the problem start? Check monitoring alerts and user reports
  • Did anything change recently? Software updates, configuration changes, new installs?
  • Can you physically access the server? Are power lights on? Is the screen showing anything?
2

Check Basic Connectivity

  • Try pinging the server from another device on the same network
  • Verify network cables are connected — check for link lights on the network port
  • Check if the switch port is active (link light on the switch)
  • Try accessing the server via different methods: RDP, SSH, physical console, management port
  • If the server has a remote management card (iDRAC, iLO, IPMI), try connecting through that
  • Check if other devices on the same network switch are working
  • Verify the network switch itself is powered on and functioning
3

Communicate the Outage

  • Inform affected users that you are aware and investigating
  • Provide an estimated time for the next update (even if you do not have an ETA for resolution)
  • Use alternative communication if email is down (Teams, Slack, phone, WhatsApp)
  • Assign one person to handle communications so the technical team can focus
  • Log the start time of the incident for your incident report

Step 2: Systematic Diagnosis

Work through the most common causes methodically.

1

Check Hardware First

  • Check server health indicator LEDs: Green = OK, Amber = Warning, Red = Failure
  • Listen for unusual sounds: Clicking means disk failure, silence may mean power issue
  • Check RAID controller status: Degraded array means a drive has failed
  • Access hardware monitoring: BIOS/UEFI, Dell iDRAC, HP iLO, Lenovo XClarity
  • Check temperature readings — overheating causes shutdowns and throttling
  • Verify UPS status — battery failure can cause sudden power loss
  • Check memory (RAM) status — failed DIMMs cause crashes and blue screens
2

Check Disk Space

  • This is the MOST COMMON cause of server issues — always check first
  • Windows: Open File Explorer or run 'wmic logicaldisk get size,freespace,caption'
  • Linux: Run 'df -h' to check all filesystem usage
  • If any drive is above 95% full, this is likely the cause
  • Quick fixes: Empty temp folders, clear old logs, move large files
  • Check if log files have grown unexpectedly (runaway logging)
  • Clear Windows Update cache: Delete contents of C:\Windows\SoftwareDistribution
Pro Tip:

A full C: drive causes cascading failures: services stop, databases crash, logs cannot write, and Active Directory replication fails. Always monitor disk space with alerts at 80% and 90%.

3

Check Services and Event Logs

  • Windows: Open Event Viewer → Windows Logs → System and Application
  • Look for Critical and Error events around the time of failure
  • Common culprits: Service crashes, driver failures, disk errors
  • Check if critical services are running: services.msc or 'Get-Service' in PowerShell
  • Linux: Check /var/log/syslog, /var/log/messages, journalctl -xe
  • Look for Out of Memory (OOM) killer messages on Linux
  • Review application-specific logs for the affected service
4

Check Resource Utilisation

  • Open Task Manager (Ctrl+Shift+Esc) or Resource Monitor on Windows
  • Linux: Run 'top', 'htop', or 'free -m' for memory, 'iostat' for disk I/O
  • Check CPU usage — sustained 100% indicates a runaway process or insufficient capacity
  • Check memory usage — if RAM is full, the system swaps to disk and becomes extremely slow
  • Check disk I/O — high I/O wait means the storage is the bottleneck
  • Identify the process consuming the most resources and determine if it is legitimate

Step 3: Recovery and Prevention

Get back online and ensure it does not happen again.

1

Restore Service

  • Fix the identified root cause (add disk space, restart service, replace hardware)
  • Restart the affected service before restarting the entire server
  • If a reboot is needed, warn connected users and perform a clean restart
  • Verify all services come back online after restart
  • Test from a user's perspective — can they access what they need?
  • Monitor closely for the first 24 hours after recovery
2

Post-Incident Actions

  • Document everything: What happened, when, how it was diagnosed and fixed
  • Identify the root cause — not just the symptom
  • Set up monitoring alerts for the specific failure point so you get early warning
  • Schedule regular maintenance: disk cleanup, log rotation, Windows Updates
  • Review and update your disaster recovery procedures based on what you learned
  • Share findings with the team to improve collective troubleshooting skills
3

Preventive Measures

  • Implement proactive monitoring: Disk space, CPU, memory, service uptime
  • Set up automated alerts BEFORE thresholds reach critical levels
  • Schedule regular server maintenance windows (monthly or quarterly)
  • Keep servers patched with the latest security and stability updates
  • Maintain tested backups — verify you can actually restore from them
  • Document server configurations so anyone on the team can troubleshoot
  • Consider redundancy for critical servers: clustering, replication, or cloud failover

Need Professional Help?

Our engineers provide expert assistance with setup, troubleshooting, and ongoing support for businesses and individuals across Cornwall.