Latest revision as of 15:08, 18 October 2024

Smartmontools

https://www.smartmontools.org/
https://en.wikipedia.org/wiki/Smartmontools
https://help.ubuntu.com/community/Smartmontools
https://wiki.archlinux.org/title/S.M.A.R.T.
sudo apt install smartmontools
- By default, smartctl was installed in /usr/sbin.
- Put export PATH=$PATH:/usr/sbin in the .bashrc file
sudo apt install -y gsmartcontrol
SMART data is not partition-dependent but rather disk-dependent.

NVME

Version

$ smartctl -v | head -1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.19.0-46-generic] (local build)

Keyboards to look for: Written, Percentage

$ sudo smartctl -a /dev/nvme0 | grep "Writ"
Data Units Written:                 274,127 [140 GB]
Host Write Commands:                7,499,312

$ sudo smartctl -a /dev/nvme0 | grep "Percentage"
Percentage Used:                    0%

Full output

$ sudo smartctl -a /dev/nvme0 
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.19.0-46-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       CT1000P3SSD8
Serial Number:                      2314E6C4100F
Firmware Version:                   P9CR30A
PCI Vendor/Subsystem ID:            0xc0a9
IEEE OUI Identifier:                0x00a075
Controller ID:                      1
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            6479a7 77f00000c9
Local Time is:                      Sat Jul  1 11:22:30 2023 EDT
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005e):     Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x06):         Cmd_Eff_Lg Ext_Get_Lg
Maximum Data Transfer Size:         64 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     95 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.00W  0.0000W       -    0  0  0  0        0       0
 1 +     3.00W  0.0000W       -    0  0  0  0        0       0
 2 +     1.50W  0.0000W       -    0  0  0  0        0       0
 3 -   0.0250W  0.0000W       -    3  3  3  3     5000    1900
 4 -   0.0030W       -        -    4  4  4  4    13000  100000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         1
 1 -    4096       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        26 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    0%
Data Units Read:                    201,206 [103 GB]
Data Units Written:                 274,128 [140 GB]
Host Read Commands:                 4,982,258
Host Write Commands:                7,499,381
Controller Busy Time:               23
Power Cycles:                       13
Power On Hours:                     408
Unsafe Shutdowns:                   9
Media and Data Integrity Errors:    0
Error Information Log Entries:      42
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               26 Celsius
Temperature Sensor 2:               31 Celsius
Temperature Sensor 8:               26 Celsius

Error Information (NVMe Log 0x01, 16 of 16 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0         42     0  0x5007  0x4005  0x028            0     0     -

Wear out

Use the attribute Percentage.

smartctl -a /dev/nvme0 | grep "Percentage"
Percentage Used:                    2%

Difference of /dev/nvme0, /dev/nvme0n1, /dev/nvme0n1p1

Why is there both character device and block device for nvme?, Interpreting wiki/documentation for an NVMe disk

/dev/nvme0 represents the raw device and is the “control” device node that you use to configure the hardware. It’s the NVMe device controller.
/dev/nvme0n1, on the other hand, represents the first namespace on that device. The n1 denotes the first namespace of the device. These are the devices you use for actual storage, which will behave essentially as disks. This is what I get when I issue the "lsblk" command. So /dev/nvme0n1 is like /dev/sda.
/dev/nvme0n1p1 represents a partition on an NVMe storage namespace. So /dev/nvme0n1p1 is like /dev/sda1.

nvme-cli command

SATA SSD

How can I monitor the TBW on my Samsung SSD?
Crucial shows rated as 220TB Total Bytes Written (TBW) while Samsung shows as 600 TB TBW. Both 5 year warranty.
Sector size is 512 bytes.
The ID# may be different on different devices.
1TB is 1024^4 bytes (~10^12).

Keyboards to look for: Written, Percent

$ sudo smartctl -a /dev/sda | grep "Writ"
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       7384050441

$ sudo smartctl -a /dev/sda | grep "Sector"
Sector Size:      512 bytes logical/physical

$ sudo smartctl -a /dev/sda | grep "Percent" # 99% life remain in this case
202 Percent_Lifetime_Remain 0x0030   099   099   001    Old_age   Offline      -       1

Full output

$ sudo smartctl --all /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.19.0-46-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron Client SSDs
Device Model:     Crucial_CT525MX300SSD1
Serial Number:    1644148274F7
LU WWN Device Id: 5 00a075 1148274f7
Firmware Version: M0CR031
User Capacity:    525,112,713,216 bytes [525 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Jul  1 11:17:53 2023 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		( 1391) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (   7) minutes.
Conveyance self-test routine
recommended polling time: 	 (   3) minutes.
SCT capabilities: 	       (0x0035)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       14046
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       118
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   099   099   000    Old_age   Always       -       17
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       78
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   077   058   000    Old_age   Always       -       23 (Min/Max 12/42)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_ECC_Cnt 0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       1
202 Percent_Lifetime_Remain 0x0030   099   099   001    Old_age   Offline      -       1
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       7384050441
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       231070651
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       94337836
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       1940
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Vendor (0xff)       Completed without error       00%     14038         -
# 2  Vendor (0xff)       Completed without error       00%     13741         -
# 3  Vendor (0xff)       Completed without error       00%     13548         -
# 4  Vendor (0xff)       Completed without error       00%     13126         -
# 5  Vendor (0xff)       Completed without error       00%     12915         -
# 6  Vendor (0xff)       Completed without error       00%      5647         -
# 7  Vendor (0xff)       Completed without error       00%      5484         -
# 8  Vendor (0xff)       Completed without error       00%      5312         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Wear out

Use the attribute Media_Wearout_Indicator or Percentage Used or SSD Life Left.

# Kingston SSD 240 GB
# smartctl -a /dev/sda | grep Left
231 SSD_Life_Left           0x0000   002   002   000    Old_age   Offline      -       98

# Crucial 1T
$ sudo smartctl -a /dev/sda | grep Percent
202 Percent_Lifetime_Remain 0x0030   099   099   001    Old_age   Offline      -       1

# PNY CS900 1T
$ sudo smartctl -a /dev/sda | grep -A 1 -i "lifetime" 
  # '-A 1' is to include one line of context after the match
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      8360   
  # The "00% Remaining" in this context means the test completed 
  # with no remaining time left, which is expected for a completed test.

USB adapter

man smartctl and search for "-d TYPE".
The message Unknown USB bridge [Please specify device type with the -d option.] indicates that smartctl is unable to automatically detect the type of USB bridge used by your external drive.
You can try using the -d sat option to specify that the device type is a SATA drive behind a SCSI-to-ATA Translation (SAT) layer. This is exactly the case for vantec adapter. GSmartControl also showed it is /dev/sdc (scsi) in the Drive information. Oddly, when I use Ugreen adapter, dmesg also shows it is scsi but GSmartControl does not show scsi. That is being said, it does not hurt to add -d sat parameter to the smartctl command.
```
$ sudo dmesg
...
[997217.895800] scsi 5:0:0:0: Direct-Access     Crucial_ CT525MX300SSD1   1414 PQ: 0 ANSI: 6
[997217.899060] sd 5:0:0:0: Attached scsi generic sg1 type 0
...
```
```
sudo smartctl -a -d sat /dev/sdX
```
The -d option is used to specify the device type, which can be useful when smartctl is unable to correctly guess the device type. For example, on some systems, smartctl may correctly guess that a drive is a SATA drive, while on other systems it may not. In such cases, the -d sat option can be used to explicitly specify that the device is a SATA drive.
GSmartControl displays SMART supported is Yes on Ugreen adapter but No on Vantec adapter.
On Samsung PSSD T7, I need to use sudo smartctl -a -d scsi /dev/sdc
If this doesn’t work, you can try other device types such as -d sat,12, -d usbcypress, -d usbjmicron, -d usbprolific, or -d usbsunplus. You can find more information about these options in the smartctl man page or by running smartctl --help.

eMMC

eMMC storage is typically accessed via an SD/MMC interface, which is not directly supported by smartctl. Therefore, it is not possible to use smartctl to check the health of eMMC storage by specifying a device type.
How to Check eMMC info from linux - depends on supports from Kernel Driver

dmesg | grep mmc

Calculation

$ sudo apt install calc
$ calc 274127*512/1024^2
	133.85107421875

> 274127 *512/1024^2 # sudo smartctl -a /dev/nvme0 | grep "Data Units Written"
[1] 133.8511 # GB

> 7384050441 * 512/1024^3 # sudo smartctl -a /dev/sda | grep "Total_LBAs_Written"
[1] 3520.99 # GB

Understanding smartctl -a output

SMART overall-health self-assessment test result

You can run SMART tests on a mounted disk. However, it's generally recommended to run long tests on unmounted disks to prevent any potential issues, especially during read/write operations.
You can run the command smartctl -H /dev/sdX to check the overall-health self-assessment test result of the drive.
If the test result is PASSED, it means that the drive is considered healthy according to the SMART system. If the test result is FAILED, it means that the drive is considered to be in a pre-failure condition and may fail soon.

Where is the log file

The test results are displayed directly in the terminal and stored in the drive's firmware and can be viewed as long as the drive is operational. So by default we can't find the test date/time.
Use "smartctl -a /dev/sda > smartctl_results.txt" to save the results to a file.

Passed status

The "PASSED" status indicates that the drive's overall SMART health checks have been completed, and no attributes have crossed their critical thresholds at the time of the assessment.

Low risk Error

A few reallocated sectors
Drives with occasional read or write errors (like CRC errors)

High risk error

Rapidly increasing reallocated sectors count: This indicates the drive is actively deteriorating. High or rapidly increasing values (e.g., > 5–10) can indicate potential failure.
Current pending sectors that are not getting reallocated can mean data on those sectors is already corrupted. Non-zero values suggest potential data loss if these sectors are found to be unreadable.
Uncorrectable sectors count or errors during read/write operations suggest potential data loss. A value greater than zero is alarming and indicates significant drive issues.

Non-Critical Attributes

Attributes like Power-On Hours, Load Cycle Count, and Temperature.

Raw_Read_Error_Rate

https://unix.stackexchange.com/a/384833

The THRESH column tells you what the vendors considers as lowest possible value considered as healthy.
If the WORST column shows values below THRESH in same row, the drive is considered as not healthy. It also implies that VALUE has been seen below THRESH, of course. You can also see that only the attributes of type Pre-fail matter when evaluating health.

Current Pending Sector

This has been identified by UNRAID from my 3.5" WD blue HDD.
However, if the Current Pending Sector Count increases, it indicates that drive failure is imminent. Pending Sectors are the prediction of reallocated sectors which can also be a strong indicator of dead of the hard drive. What to Do When Encountering Current Pending Sector Count?
How should I understand "Current Pending Sector Count" in CrystalDiskInfo reports?

Offline uncorrectable

This has been identified by UNRAID from my 3.5" WD blue HDD.
What Does Uncorrectable Sector Count Mean & How to Fix It
Smartctl utility giving uncorrectable and unreadable sectors error on HDD

UDMA CRC error count

This has been identified by UNRAID from my Crucial CT525MX300 525G SSD. But overall-health is passed.

Current pending ECC count

This has been identified by UNRAID from my Crucial CT1000MX500 1T SSD. I still add it to the array. After a while, the error was gone.

Output from a brand new disk

PNY 1T SSD

SMART support/capability

Use sudo smartctl -a /dev/sdb . If I use sudo smartctl -a /dev/sdb1, it will show SMART support is: Unavailable - device lacks SMART capability.

GSmartControl

https://gsmartcontrol.shaduri.dev/downloads
GSmartControl is part of Gparted Live.
GSmartControl 1.1.3 -> Options -> Update Drive Database (failed). Download and install 1.1.4. A new window is open but no progress.
For some reason, GSmartControl show my WD black nvme as Unknown model. But "sudo smartctl -a /dev/nvme0n1 | grep -i model" can display the model. So the command line tool is better.

gsmartcontrol can show the command it used though it does not print everything. Select a disk and click Options -> View Execution Log'.
UGREEN adapter

$ sudo smartctl --info --health --capabilities  /dev/sdb
=== START OF INFORMATION SECTION ===
...
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
...

VANTEC adapter

$ sudo smartctl --info --health --capabilities  /dev/sdb
=== START OF INFORMATION SECTION ===
...
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.
...

But the rest of output are very similar. This bridge may not pass on all SMART commands.

UGREEN SATA/USB adapter and VANTEC.

smartd: SMART Disk Monitoring Daemon

How to configure smartd and be notified of hard disk problems via email
- Create a configuration file /etc/smartd.conf
```
/dev/sdX -H -l error -l selftest -m <email_address>
```
where "-H" means to monitor the health status, error log (-l error), and self-test log (-l selftest) of the /dev/sdX device, and to send an email if any issues are detected.
- Run sudo systemctl enable smartd.service && sudo systemctl start smartd.service
The threshold for the temperature of a disk is typically determined by the manufacturer and is often not directly changeable by the user.
https://wiki.archlinux.org/title/S.M.A.R.T.#smartd
Monitoring hard disk health with smartd under Linux or UNIX operating systems

Monitor temperature

sudo apt install hddtemp
hddtemp
hddtemp /dev/sda

hddtemp -d /dev/sd[abcd]
telnet remotebox 7634
# OR  nc 192.168.1.100 7634

SAMSUNG 980 SSD 1TB PCle 3.0x4, NVMe and search "temperature"
- smartd reports warnings: Device: /dev/nvme0, Critical Warning (0x02): Temperature. It seems this is a common problem.
- smartctl -a /dev/nvme0 can show the current temperature
- It seems 970 EVO plus is better.

Monitor dashboard: scrutiny

https://github.com/AnalogJ/scrutiny WebUI for smartd S.M.A.R.T monitoring. The dashboard shows all disks health at a glance.

Following is the docker compose file docker-compose.yml for my case. Note that also "lsblk" shows my NVME drive is "nvme0n1", but I need to use "nvme0" to make it to work. To view the dashboard, go to http://localhost:8082.

version: '3.5'

services:
  scrutiny:
    container_name: scrutiny
    image: ghcr.io/analogj/scrutiny:master-omnibus
    cap_add:
      - SYS_ADMIN
      - SYS_RAWIO
    ports:
      - "8082:8080" # webapp
      - "8086:8086" # influxDB admin
    volumes:
      - /run/udev:/run/udev:ro
      - ./config:/opt/scrutiny/config
      - ./influxdb:/opt/scrutiny/influxdb
    devices:
      - "/dev/sda"
      - "/dev/nvme0"

As the README described, smartd does not record S.M.A.R.T attribute history, so it can be hard to determine if an attribute is degrading slowly over time.

Reviews

Samsung 980 M.2 NVMe SSD Review: Going DRAMless with V6 V-NAND (Updated)
Crucial P3 Plus PCIe NVMe M.2 SSD also has no DRAM.
TEAMGROUP T-Force CARDEA A440 Pro Graphene Heatsink 1TB DRAM SLC Cache
Crucial P5 Plus 1TB PCIe Gen4 look good with 5y warranty.

Some disks

Best SSDs

Best SSDs of 2024: Reviews and buying advice

Samsung Portable SSD T7

Yes, the Samsung Portable SSD T7 (PSSD T7) supports S.M.A.R.T. data reporting. However, there have been some issues with Linux tools not working with the Samsung PSSD T7. This issue has been fixed with a pull request to the `drivedb` of `smartmontools`. If your `drivedb` is current, it will now work correctly. If it is not current, you can manually add the `-d sntasmedia` argument to `smartctl` or update the `drivedb` independently of `smartmontools` by using the update-smart-drivedb command; see Linux tools don't work with Samsung PSSD T7 & NVMe pass-through support for Samsung T7 SSD.

@@ Line 5: / Line 5: @@
 * https://wiki.archlinux.org/title/S.M.A.R.T.
 * '''sudo apt install smartmontools'''
+** By default, smartctl was installed in /usr/sbin.
+** Put '''export PATH=$PATH:/usr/sbin''' in the .bashrc file
 * '''sudo apt install -y gsmartcontrol '''
 * SMART data is not partition-dependent but rather disk-dependent.
@@ Line 99: / Line 101: @@
 == Wear out ==
 Use the attribute [https://unix.stackexchange.com/a/652631 Percentage].
-<pre>
+<syntaxhighlight lang='sh'>
 smartctl -a /dev/nvme0 | grep "Percentage"
 Percentage Used:                    2%
-</pre>
+</syntaxhighlight>
 == Difference of /dev/nvme0, /dev/nvme0n1, /dev/nvme0n1p1 ==
@@ Line 121: / Line 123: @@
 <li>Sector size is 512 bytes.
 <li>The ID# may be different on different devices.
+<li>1TB is 1024^4 bytes (~10^12).
 <li>Keyboards to look for: '''Written''', [https://askubuntu.com/a/1090231 Percent]
 <pre>
@@ Line 249: / Line 252: @@
 == Wear out ==
 Use the attribute '''Media_Wearout_Indicator''' or '''Percentage Used''' or '''SSD Life Left'''.
-<pre>
+<syntaxhighlight lang='sh'>
 # Kingston SSD 240 GB
 # smartctl -a /dev/sda | grep Left
@@ Line 257: / Line 260: @@
 $ sudo smartctl -a /dev/sda | grep Percent
 Percent_Lifetime_Remain 0x0030   099   099   001    Old_age   Offline      -       1
-</pre>
+# PNY CS900 1T
+$ sudo smartctl -a /dev/sda | grep -A 1 -i "lifetime"
+  # '-A 1' is to include one line of context after the match
+Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
+# 1  Short offline       Completed without error       00%      8360
+  # The "00% Remaining" in this context means the test completed
+  # with no remaining time left, which is expected for a completed test.
+</syntaxhighlight>
 = USB adapter =
@@ Line 307: / Line 318: @@
 == SMART overall-health self-assessment test result ==
-You can run the command '''smartctl -H /dev/sdX''' to check the overall-health self-assessment test result of the drive. If the test result is PASSED, it means that the drive is considered healthy according to the SMART system. If the test result is FAILED, it means that the drive is considered to be in a pre-failure condition and may fail soon.
+* You can run SMART tests on a mounted disk. However, it's generally recommended to run long tests on unmounted disks to prevent any potential issues, especially during read/write operations.
+* You can run the command '''smartctl -H /dev/sdX''' to check the overall-health self-assessment test result of the drive.
+* If the test result is PASSED, it means that the drive is considered healthy according to the SMART system. If the test result is FAILED, it means that the drive is considered to be in a pre-failure condition and may fail soon.
+== Where is the log file ==
+* The test results are displayed directly in the terminal and stored in the drive's firmware and can be viewed as long as the drive is operational. So by default we can't find the test date/time.
+* Use "smartctl -a /dev/sda > smartctl_results.txt" to save the results to a file.
+== Passed status ==
+* The "PASSED" status indicates that the drive's overall SMART health checks have been completed, and no attributes have crossed their critical thresholds at the time of the assessment.
+== Low risk Error ==
+* A few reallocated sectors
+* Drives with occasional read or write errors (like CRC errors)
+== High risk error ==
+* Rapidly increasing '''reallocated sectors count''': This indicates the drive is actively deteriorating. High or rapidly increasing values (e.g., > 5–10) can indicate potential failure.
+* '''Current pending sectors''' that are not getting reallocated can mean data on those sectors is already corrupted. Non-zero values suggest potential data loss if these sectors are found to be unreadable.
+* '''Uncorrectable sectors count''' or errors during read/write operations suggest potential data loss. A value greater than zero is alarming and indicates significant drive issues.
+== Non-Critical Attributes ==
+Attributes like '''Power-On Hours''', '''Load Cycle Count''', and '''Temperature'''.
 == Raw_Read_Error_Rate ==
@@ Line 313: / Line 345: @@
 * The THRESH column tells you what the vendors considers as lowest possible value considered as healthy.
 * If the WORST column shows values below THRESH in same row, the drive is considered as not healthy. It also implies that VALUE has been seen below THRESH, of course. You can also see that only the attributes of type Pre-fail matter when evaluating health.
+== Current Pending Sector ==
+* This has been identified by UNRAID from my 3.5" WD blue HDD.
+* However, if the Current Pending Sector Count increases, it indicates that drive failure is imminent. Pending Sectors are the prediction of reallocated sectors which can also be a strong indicator of dead of the hard drive. [https://www.minitool.com/backup-tips/current-pending-sector-count.html What to Do When Encountering Current Pending Sector Count?]
+* [https://superuser.com/questions/1058592/how-should-i-understand-current-pending-sector-count-in-crystaldiskinfo-report How should I understand "Current Pending Sector Count" in CrystalDiskInfo reports?]
+== Offline uncorrectable ==
+* This has been identified by UNRAID from my 3.5" WD blue HDD.
+* [https://www.minitool.com/lib/uncorrectable-sector-count.html What Does Uncorrectable Sector Count Mean & How to Fix It]
+* [https://unix.stackexchange.com/a/549863 Smartctl utility giving uncorrectable and unreadable sectors error on HDD]
+== 	UDMA CRC error count ==
+* This has been identified by UNRAID from my Crucial CT525MX300 525G SSD. But overall-health is passed.
+== Current pending ECC count ==
+* This has been identified by UNRAID from my Crucial CT1000MX500 1T SSD. I still add it to the array. After a while, the error was gone.
 == Output from a brand new disk ==
@@ Line 320: / Line 368: @@
 <ul>
 <li>Use '''sudo smartctl -a /dev/sdb''' . If I use '''sudo smartctl -a /dev/sdb1''', it will show ''SMART support is:     Unavailable - device lacks SMART capability.''
+</ul>
+== GSmartControl ==
+<ul>
+<li>https://gsmartcontrol.shaduri.dev/downloads
+<li>GSmartControl is part of [https://gparted.org/livecd.php Gparted Live].
+<li>GSmartControl 1.1.3 -> Options -> Update Drive Database (failed). Download and install 1.1.4. A new window is open but no progress.
+<li>For some reason, GSmartControl show my WD black nvme as Unknown model. But "sudo smartctl -a /dev/nvme0n1 | grep -i model" can display the model. So the command line tool is better.
 <li>gsmartcontrol can show the command it used though it does not print everything. Select a disk and click '''Options''' -> '''View Execution Log'.</br>
@@ Line 353: / Line 409: @@
 </ul>
-= smartd: MART Disk Monitoring Daemon =
+= smartd: SMART Disk Monitoring Daemon =
-* [https://linuxconfig.org/how-to-configure-smartd-and-be-notified-of-hard-disk-problems-via-email How to configure smartd and be notified of hard disk problems via email]
+<ul>
-* https://wiki.archlinux.org/title/S.M.A.R.T.#smartd
+<li>[https://linuxconfig.org/how-to-configure-smartd-and-be-notified-of-hard-disk-problems-via-email How to configure smartd and be notified of hard disk problems via email]
-* [https://www.cyberciti.biz/tips/monitoring-hard-disk-health-with-smartd-under-linux-or-unix-operating-systems.html Monitoring hard disk health with smartd under Linux or UNIX operating systems]
+* Create a configuration file '''/etc/smartd.conf'''
+:<syntaxhighlight lang='sh'>
+/dev/sdX -H -l error -l selftest -m <email_address>
+</syntaxhighlight>where "-H" means to monitor the health status, error log (-l error), and self-test log (-l selftest) of the /dev/sdX device, and to send an email if any issues are detected.
+* Run sudo systemctl enable smartd.service && sudo systemctl start smartd.service
+<li>The threshold for the temperature of a disk is typically determined by the manufacturer and is often not directly changeable by the user.
+<li>https://wiki.archlinux.org/title/S.M.A.R.T.#smartd
+<li>[https://www.cyberciti.biz/tips/monitoring-hard-disk-health-with-smartd-under-linux-or-unix-operating-systems.html Monitoring hard disk health with smartd under Linux or UNIX operating systems]
+</ul>
 = Monitor temperature =
 * [https://www.baeldung.com/linux/hdd-ssd-temperature How to Check HDD/SSD Temperature in Linux]
 * [https://www.cyberciti.biz/tips/howto-monitor-hard-drive-temperature.html Linux Monitor Hard Disks Temperature With hddtemp]
+:<syntaxhighlight lang='bash'>
+sudo apt install hddtemp
+hddtemp
+hddtemp /dev/sda
+hddtemp -d /dev/sd[abcd]
+telnet remotebox 7634
+# OR  nc 192.168.1.100 7634
+</syntaxhighlight>
 * [https://www.amazon.com/SAMSUNG-MZ-V8V1T0B-AM-980-SSD/dp/B08V83JZH4/ SAMSUNG 980 SSD 1TB PCle 3.0x4, NVMe] and search "temperature"
 ** smartd reports warnings: Device: /dev/nvme0, Critical Warning (0x02): Temperature. It seems this is a common problem.
-** It is DRAMless. See [https://www.tomshardware.com/reviews/samsung-980-m2-nvme-ssd-review Samsung 980 M.2 NVMe SSD Review: Going DRAMless with V6 V-NAND (Updated)]. [https://slickdeals.net/f/17020420-crucial-p3-plus-2tb-pcie-gen4-3d-nand-nvme-m-2-ssd-up-to-5000mb-s-ct2000p3pssd8-74-99 Crucial P3 Plus PCIe NVMe M.2 SSD] also has no DRAM. [https://www.amazon.com/gp/product/B09JCBRBP2/ TEAMGROUP T-Force CARDEA A440 Pro Graphene Heatsink 1TB DRAM SLC Cache] & [https://www.amazon.com/gp/product/B098WL46RS/ Crucial P5 Plus 1TB PCIe Gen4] look good with 5y warranty.
 ** '''smartctl -a /dev/nvme0''' can show the current temperature
 ** It seems 970 EVO plus is better.
+= Monitor dashboard: scrutiny =
+<ul>
+<li>https://github.com/AnalogJ/scrutiny WebUI for smartd S.M.A.R.T monitoring. The dashboard shows all disks health at a glance.
+<li>Following is the docker compose file '''docker-compose.yml''' for my case. Note that also "lsblk" shows my NVME drive is "nvme0n1", but I need to use "nvme0" to make it to work. To view the dashboard, go to http://localhost:8082.
+<pre>
+version: '3.5'
+services:
+  scrutiny:
+    container_name: scrutiny
+    image: ghcr.io/analogj/scrutiny:master-omnibus
+    cap_add:
+      - SYS_ADMIN
+      - SYS_RAWIO
+    ports:
+      - "8082:8080" # webapp
+      - "8086:8086" # influxDB admin
+    volumes:
+      - /run/udev:/run/udev:ro
+      - ./config:/opt/scrutiny/config
+      - ./influxdb:/opt/scrutiny/influxdb
+    devices:
+      - "/dev/sda"
+      - "/dev/nvme0"
+</pre>
+<li>As the README described, smartd does not record S.M.A.R.T attribute history, so it can be hard to determine if an attribute is degrading slowly over time.
+</ul>
+= Reviews =
+* [https://www.tomshardware.com/reviews/samsung-980-m2-nvme-ssd-review Samsung 980 M.2 NVMe SSD Review: Going DRAMless with V6 V-NAND (Updated)]
+* [https://www.tomshardware.com/reviews/crucial-p3-plus-ssd-review-capacity-on-the-cheap Crucial P3 Plus PCIe NVMe M.2 SSD] also has no DRAM.
+* [https://www.amazon.com/gp/product/B09JCBRBP2/ TEAMGROUP T-Force CARDEA A440 Pro Graphene Heatsink 1TB DRAM SLC Cache]
+* [https://www.amazon.com/gp/product/B098WL46RS/ Crucial P5 Plus 1TB PCIe Gen4] look good with 5y warranty.
 = Some disks =
+== Best SSDs ==
+[https://www.pcworld.com/article/407542/best-ssds.html Best SSDs of 2024: Reviews and buying advice]
 == Samsung Portable SSD T7 ==
 Yes, the [https://semiconductor.samsung.com/consumer-storage/portable-ssd/t7/ Samsung Portable SSD T7] (PSSD T7) supports S.M.A.R.T. data reporting. However, there have been some issues with Linux tools not working with the Samsung PSSD T7. This issue has been fixed with a pull request to the `drivedb` of `smartmontools`. If your `drivedb` is current, it will now work correctly. If it is not current, you can manually add the `-d sntasmedia` argument to `smartctl` or update the `drivedb` independently of `smartmontools` by using the [https://www.smartmontools.org/wiki/Download update-smart-drivedb] command; see [https://superuser.com/questions/1649054/linux-tools-dont-work-with-samsung-pssd-t7 Linux tools don't work with Samsung PSSD T7] & [https://www.smartmontools.org/ticket/1403?cversion=0&cnum_hist=12 NVMe pass-through support for Samsung T7 SSD].

Smartctl: Difference between revisions

Latest revision as of 15:08, 18 October 2024

Contents

Smartmontools

NVME

Wear out

Difference of /dev/nvme0, /dev/nvme0n1, /dev/nvme0n1p1

nvme-cli command

SATA SSD

Wear out

USB adapter

eMMC

Calculation

Understanding smartctl -a output

SMART overall-health self-assessment test result

Where is the log file

Passed status

Low risk Error

High risk error

Non-Critical Attributes

Raw_Read_Error_Rate

Current Pending Sector

Offline uncorrectable

UDMA CRC error count

Current pending ECC count

Output from a brand new disk

SMART support/capability

GSmartControl

smartd: SMART Disk Monitoring Daemon

Monitor temperature

Monitor dashboard: scrutiny

Reviews

Some disks

Best SSDs

Samsung Portable SSD T7

Navigation menu

Smartctl: Difference between revisions

Latest revision as of 15:08, 18 October 2024

Smartmontools

NVME

Wear out

Difference of /dev/nvme0, /dev/nvme0n1, /dev/nvme0n1p1

nvme-cli command

SATA SSD

Wear out

USB adapter

eMMC

Calculation

Understanding smartctl -a output

SMART overall-health self-assessment test result

Where is the log file

Passed status

Low risk Error

High risk error

Non-Critical Attributes

Raw_Read_Error_Rate

Current Pending Sector

Offline uncorrectable

UDMA CRC error count

Current pending ECC count

Output from a brand new disk

SMART support/capability

GSmartControl

smartd: SMART Disk Monitoring Daemon

Monitor temperature

Monitor dashboard: scrutiny

Reviews

Some disks

Best SSDs

Samsung Portable SSD T7

Navigation menu

Search