How to Troubleshoot Server Downtime
Server downtime directly impacts business operations and revenue. A calm, systematic troubleshooting approach gets you back online faster than panicking and trying random fixes.
Overview
Server issues fall into four categories: hardware failure, network problems, operating system or software issues, and resource exhaustion (full disk, out of memory). Working through each category systematically pinpoints the cause quickly.
Step 1: Immediate Triage
Assess the situation and determine the scope of the outage.
Assess the Impact
- Which services are affected? Email, file shares, database, website, applications?
- Is the server completely unresponsive or partially working?
- How many users are impacted? One person or the whole company?
- When exactly did the problem start? Check monitoring alerts and user reports
- Did anything change recently? Software updates, configuration changes, new installs?
- Can you physically access the server? Are power lights on? Is the screen showing anything?
Check Basic Connectivity
- Try pinging the server from another device on the same network
- Verify network cables are connected — check for link lights on the network port
- Check if the switch port is active (link light on the switch)
- Try accessing the server via different methods: RDP, SSH, physical console, management port
- If the server has a remote management card (iDRAC, iLO, IPMI), try connecting through that
- Check if other devices on the same network switch are working
- Verify the network switch itself is powered on and functioning
Communicate the Outage
- Inform affected users that you are aware and investigating
- Provide an estimated time for the next update (even if you do not have an ETA for resolution)
- Use alternative communication if email is down (Teams, Slack, phone, WhatsApp)
- Assign one person to handle communications so the technical team can focus
- Log the start time of the incident for your incident report
Step 2: Systematic Diagnosis
Work through the most common causes methodically.
Check Hardware First
- Check server health indicator LEDs: Green = OK, Amber = Warning, Red = Failure
- Listen for unusual sounds: Clicking means disk failure, silence may mean power issue
- Check RAID controller status: Degraded array means a drive has failed
- Access hardware monitoring: BIOS/UEFI, Dell iDRAC, HP iLO, Lenovo XClarity
- Check temperature readings — overheating causes shutdowns and throttling
- Verify UPS status — battery failure can cause sudden power loss
- Check memory (RAM) status — failed DIMMs cause crashes and blue screens
Check Disk Space
- This is the MOST COMMON cause of server issues — always check first
- Windows: Open File Explorer or run 'wmic logicaldisk get size,freespace,caption'
- Linux: Run 'df -h' to check all filesystem usage
- If any drive is above 95% full, this is likely the cause
- Quick fixes: Empty temp folders, clear old logs, move large files
- Check if log files have grown unexpectedly (runaway logging)
- Clear Windows Update cache: Delete contents of C:\Windows\SoftwareDistribution
A full C: drive causes cascading failures: services stop, databases crash, logs cannot write, and Active Directory replication fails. Always monitor disk space with alerts at 80% and 90%.
Check Services and Event Logs
- Windows: Open Event Viewer → Windows Logs → System and Application
- Look for Critical and Error events around the time of failure
- Common culprits: Service crashes, driver failures, disk errors
- Check if critical services are running: services.msc or 'Get-Service' in PowerShell
- Linux: Check /var/log/syslog, /var/log/messages, journalctl -xe
- Look for Out of Memory (OOM) killer messages on Linux
- Review application-specific logs for the affected service
Check Resource Utilisation
- Open Task Manager (Ctrl+Shift+Esc) or Resource Monitor on Windows
- Linux: Run 'top', 'htop', or 'free -m' for memory, 'iostat' for disk I/O
- Check CPU usage — sustained 100% indicates a runaway process or insufficient capacity
- Check memory usage — if RAM is full, the system swaps to disk and becomes extremely slow
- Check disk I/O — high I/O wait means the storage is the bottleneck
- Identify the process consuming the most resources and determine if it is legitimate
Step 3: Recovery and Prevention
Get back online and ensure it does not happen again.
Restore Service
- Fix the identified root cause (add disk space, restart service, replace hardware)
- Restart the affected service before restarting the entire server
- If a reboot is needed, warn connected users and perform a clean restart
- Verify all services come back online after restart
- Test from a user's perspective — can they access what they need?
- Monitor closely for the first 24 hours after recovery
Post-Incident Actions
- Document everything: What happened, when, how it was diagnosed and fixed
- Identify the root cause — not just the symptom
- Set up monitoring alerts for the specific failure point so you get early warning
- Schedule regular maintenance: disk cleanup, log rotation, Windows Updates
- Review and update your disaster recovery procedures based on what you learned
- Share findings with the team to improve collective troubleshooting skills
Preventive Measures
- Implement proactive monitoring: Disk space, CPU, memory, service uptime
- Set up automated alerts BEFORE thresholds reach critical levels
- Schedule regular server maintenance windows (monthly or quarterly)
- Keep servers patched with the latest security and stability updates
- Maintain tested backups — verify you can actually restore from them
- Document server configurations so anyone on the team can troubleshoot
- Consider redundancy for critical servers: clustering, replication, or cloud failover
Need Professional Help?
Our engineers provide expert assistance with setup, troubleshooting, and ongoing support for businesses and individuals across Cornwall.