Smartctl

From 太極
Jump to navigation Jump to search

Smartmontools

NVME

  • Version
    $ smartctl -v | head -1
    smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.19.0-46-generic] (local build)
    
  • Keyboards to look for: Written, Percentage
    $ sudo smartctl -a /dev/nvme0 | grep "Writ"
    Data Units Written:                 274,127 [140 GB]
    Host Write Commands:                7,499,312
    
    $ sudo smartctl -a /dev/nvme0 | grep "Percentage"
    Percentage Used:                    0%
    
  • Full output
    $ sudo smartctl -a /dev/nvme0 
    smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.19.0-46-generic] (local build)
    Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF INFORMATION SECTION ===
    Model Number:                       CT1000P3SSD8
    Serial Number:                      2314E6C4100F
    Firmware Version:                   P9CR30A
    PCI Vendor/Subsystem ID:            0xc0a9
    IEEE OUI Identifier:                0x00a075
    Controller ID:                      1
    NVMe Version:                       1.4
    Number of Namespaces:               1
    Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
    Namespace 1 Formatted LBA Size:     512
    Namespace 1 IEEE EUI-64:            6479a7 77f00000c9
    Local Time is:                      Sat Jul  1 11:22:30 2023 EDT
    Firmware Updates (0x12):            1 Slot, no Reset required
    Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
    Optional NVM Commands (0x005e):     Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
    Log Page Attributes (0x06):         Cmd_Eff_Lg Ext_Get_Lg
    Maximum Data Transfer Size:         64 Pages
    Warning  Comp. Temp. Threshold:     85 Celsius
    Critical Comp. Temp. Threshold:     95 Celsius
    
    Supported Power States
    St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
     0 +     6.00W  0.0000W       -    0  0  0  0        0       0
     1 +     3.00W  0.0000W       -    0  0  0  0        0       0
     2 +     1.50W  0.0000W       -    0  0  0  0        0       0
     3 -   0.0250W  0.0000W       -    3  3  3  3     5000    1900
     4 -   0.0030W       -        -    4  4  4  4    13000  100000
    
    Supported LBA Sizes (NSID 0x1)
    Id Fmt  Data  Metadt  Rel_Perf
     0 +     512       0         1
     1 -    4096       0         0
    
    === START OF SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    
    SMART/Health Information (NVMe Log 0x02)
    Critical Warning:                   0x00
    Temperature:                        26 Celsius
    Available Spare:                    100%
    Available Spare Threshold:          5%
    Percentage Used:                    0%
    Data Units Read:                    201,206 [103 GB]
    Data Units Written:                 274,128 [140 GB]
    Host Read Commands:                 4,982,258
    Host Write Commands:                7,499,381
    Controller Busy Time:               23
    Power Cycles:                       13
    Power On Hours:                     408
    Unsafe Shutdowns:                   9
    Media and Data Integrity Errors:    0
    Error Information Log Entries:      42
    Warning  Comp. Temperature Time:    0
    Critical Comp. Temperature Time:    0
    Temperature Sensor 1:               26 Celsius
    Temperature Sensor 2:               31 Celsius
    Temperature Sensor 8:               26 Celsius
    
    Error Information (NVMe Log 0x01, 16 of 16 entries)
    Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
      0         42     0  0x5007  0x4005  0x028            0     0     -
    

Wear out

Use the attribute Percentage.

smartctl -a /dev/nvme0 | grep "Percentage"
Percentage Used:                    2%

Difference of /dev/nvme0, /dev/nvme0n1, /dev/nvme0n1p1

Why is there both character device and block device for nvme?, Interpreting wiki/documentation for an NVMe disk

  • /dev/nvme0 represents the raw device and is the “control” device node that you use to configure the hardware. It’s the NVMe device controller.
  • /dev/nvme0n1, on the other hand, represents the first namespace on that device. The n1 denotes the first namespace of the device. These are the devices you use for actual storage, which will behave essentially as disks. This is what I get when I issue the "lsblk" command. So /dev/nvme0n1 is like /dev/sda.
  • /dev/nvme0n1p1 represents a partition on an NVMe storage namespace. So /dev/nvme0n1p1 is like /dev/sda1.

nvme-cli command

SATA SSD

  • How can I monitor the TBW on my Samsung SSD?
  • Crucial shows rated as 220TB Total Bytes Written (TBW) while Samsung shows as 600 TB TBW. Both 5 year warranty.
  • Sector size is 512 bytes.
  • The ID# may be different on different devices.
  • 1TB is 1024^4 bytes (~10^12).
  • Keyboards to look for: Written, Percent
    $ sudo smartctl -a /dev/sda | grep "Writ"
    206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
    246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       7384050441
    
    $ sudo smartctl -a /dev/sda | grep "Sector"
    Sector Size:      512 bytes logical/physical
    
    $ sudo smartctl -a /dev/sda | grep "Percent" # 99% life remain in this case
    202 Percent_Lifetime_Remain 0x0030   099   099   001    Old_age   Offline      -       1
    
  • Full output
    $ sudo smartctl --all /dev/sda
    smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.19.0-46-generic] (local build)
    Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF INFORMATION SECTION ===
    Model Family:     Crucial/Micron Client SSDs
    Device Model:     Crucial_CT525MX300SSD1
    Serial Number:    1644148274F7
    LU WWN Device Id: 5 00a075 1148274f7
    Firmware Version: M0CR031
    User Capacity:    525,112,713,216 bytes [525 GB]
    Sector Size:      512 bytes logical/physical
    Rotation Rate:    Solid State Device
    Form Factor:      2.5 inches
    TRIM Command:     Available, deterministic, zeroed
    Device is:        In smartctl database [for details use: -P show]
    ATA Version is:   ACS-3 T13/2161-D revision 5
    SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
    Local Time is:    Sat Jul  1 11:17:53 2023 EDT
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    
    General SMART Values:
    Offline data collection status:  (0x00)	Offline data collection activity
    					was never started.
    					Auto Offline Data Collection: Disabled.
    Self-test execution status:      (   0)	The previous self-test routine completed
    					without error or no self-test has ever 
    					been run.
    Total time to complete Offline 
    data collection: 		( 1391) seconds.
    Offline data collection
    capabilities: 			 (0x7b) SMART execute Offline immediate.
    					Auto Offline data collection on/off support.
    					Suspend Offline collection upon new
    					command.
    					Offline surface scan supported.
    					Self-test supported.
    					Conveyance Self-test supported.
    					Selective Self-test supported.
    SMART capabilities:            (0x0003)	Saves SMART data before entering
    					power-saving mode.
    					Supports SMART auto save timer.
    Error logging capability:        (0x01)	Error logging supported.
    					General Purpose Logging supported.
    Short self-test routine 
    recommended polling time: 	 (   2) minutes.
    Extended self-test routine
    recommended polling time: 	 (   7) minutes.
    Conveyance self-test routine
    recommended polling time: 	 (   3) minutes.
    SCT capabilities: 	       (0x0035)	SCT Status supported.
    					SCT Feature Control supported.
    					SCT Data Table supported.
    
    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
      5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
      9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       14046
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       118
    171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
    172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
    173 Ave_Block-Erase_Count   0x0032   099   099   000    Old_age   Always       -       17
    174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       78
    183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
    184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
    187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
    194 Temperature_Celsius     0x0022   077   058   000    Old_age   Always       -       23 (Min/Max 12/42)
    196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
    197 Current_Pending_ECC_Cnt 0x0032   100   100   000    Old_age   Always       -       0
    198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       1
    202 Percent_Lifetime_Remain 0x0030   099   099   001    Old_age   Offline      -       1
    206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
    246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       7384050441
    247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       231070651
    248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       94337836
    180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       1940
    210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
    
    SMART Error Log Version: 1
    No Errors Logged
    
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Vendor (0xff)       Completed without error       00%     14038         -
    # 2  Vendor (0xff)       Completed without error       00%     13741         -
    # 3  Vendor (0xff)       Completed without error       00%     13548         -
    # 4  Vendor (0xff)       Completed without error       00%     13126         -
    # 5  Vendor (0xff)       Completed without error       00%     12915         -
    # 6  Vendor (0xff)       Completed without error       00%      5647         -
    # 7  Vendor (0xff)       Completed without error       00%      5484         -
    # 8  Vendor (0xff)       Completed without error       00%      5312         -
    
    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.
    

Wear out

Use the attribute Media_Wearout_Indicator or Percentage Used or SSD Life Left.

# Kingston SSD 240 GB
# smartctl -a /dev/sda | grep Left
231 SSD_Life_Left           0x0000   002   002   000    Old_age   Offline      -       98

# Crucial 1T
$ sudo smartctl -a /dev/sda | grep Percent
202 Percent_Lifetime_Remain 0x0030   099   099   001    Old_age   Offline      -       1

# PNY CS900 1T
$ sudo smartctl -a /dev/sda | grep -A 1 -i "lifetime" 
  # '-A 1' is to include one line of context after the match
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      8360   
  # The "00% Remaining" in this context means the test completed 
  # with no remaining time left, which is expected for a completed test.

USB adapter

  • man smartctl and search for "-d TYPE".
  • The message Unknown USB bridge [Please specify device type with the -d option.] indicates that smartctl is unable to automatically detect the type of USB bridge used by your external drive.
  • You can try using the -d sat option to specify that the device type is a SATA drive behind a SCSI-to-ATA Translation (SAT) layer. This is exactly the case for vantec adapter. GSmartControl also showed it is /dev/sdc (scsi) in the Drive information. Oddly, when I use Ugreen adapter, dmesg also shows it is scsi but GSmartControl does not show scsi. That is being said, it does not hurt to add -d sat parameter to the smartctl command.
    $ sudo dmesg
    ...
    [997217.895800] scsi 5:0:0:0: Direct-Access     Crucial_ CT525MX300SSD1   1414 PQ: 0 ANSI: 6
    [997217.899060] sd 5:0:0:0: Attached scsi generic sg1 type 0
    ...
    
    sudo smartctl -a -d sat /dev/sdX
    

    The -d option is used to specify the device type, which can be useful when smartctl is unable to correctly guess the device type. For example, on some systems, smartctl may correctly guess that a drive is a SATA drive, while on other systems it may not. In such cases, the -d sat option can be used to explicitly specify that the device is a SATA drive.

  • GSmartControl displays SMART supported is Yes on Ugreen adapter but No on Vantec adapter.
  • On Samsung PSSD T7, I need to use sudo smartctl -a -d scsi /dev/sdc
  • If this doesn’t work, you can try other device types such as -d sat,12, -d usbcypress, -d usbjmicron, -d usbprolific, or -d usbsunplus. You can find more information about these options in the smartctl man page or by running smartctl --help.

eMMC

dmesg | grep mmc

Calculation

$ sudo apt install calc
$ calc 274127*512/1024^2
	133.85107421875
> 274127 *512/1024^2 # sudo smartctl -a /dev/nvme0 | grep "Data Units Written"
[1] 133.8511 # GB

> 7384050441 * 512/1024^3 # sudo smartctl -a /dev/sda | grep "Total_LBAs_Written"
[1] 3520.99 # GB

Understanding smartctl -a output

SMART overall-health self-assessment test result

  • You can run SMART tests on a mounted disk. However, it's generally recommended to run long tests on unmounted disks to prevent any potential issues, especially during read/write operations.
  • You can run the command smartctl -H /dev/sdX to check the overall-health self-assessment test result of the drive.
  • If the test result is PASSED, it means that the drive is considered healthy according to the SMART system. If the test result is FAILED, it means that the drive is considered to be in a pre-failure condition and may fail soon.

Where is the log file

  • The test results are displayed directly in the terminal and stored in the drive's firmware and can be viewed as long as the drive is operational. So by default we can't find the test date/time.
  • Use "smartctl -a /dev/sda > smartctl_results.txt" to save the results to a file.

Passed status

  • The "PASSED" status indicates that the drive's overall SMART health checks have been completed, and no attributes have crossed their critical thresholds at the time of the assessment.

Low risk Error

  • A few reallocated sectors
  • Drives with occasional read or write errors (like CRC errors)

High risk error

  • Rapidly increasing reallocated sectors count: This indicates the drive is actively deteriorating. High or rapidly increasing values (e.g., > 5–10) can indicate potential failure.
  • Current pending sectors that are not getting reallocated can mean data on those sectors is already corrupted. Non-zero values suggest potential data loss if these sectors are found to be unreadable.
  • Uncorrectable sectors count or errors during read/write operations suggest potential data loss. A value greater than zero is alarming and indicates significant drive issues.

Non-Critical Attributes

Attributes like Power-On Hours, Load Cycle Count, and Temperature.

Raw_Read_Error_Rate

https://unix.stackexchange.com/a/384833

  • The THRESH column tells you what the vendors considers as lowest possible value considered as healthy.
  • If the WORST column shows values below THRESH in same row, the drive is considered as not healthy. It also implies that VALUE has been seen below THRESH, of course. You can also see that only the attributes of type Pre-fail matter when evaluating health.

Current Pending Sector

Offline uncorrectable

UDMA CRC error count

  • This has been identified by UNRAID from my Crucial CT525MX300 525G SSD. But overall-health is passed.

Current pending ECC count

  • This has been identified by UNRAID from my Crucial CT1000MX500 1T SSD. I still add it to the array. After a while, the error was gone.

Output from a brand new disk

SMART support/capability

  • Use sudo smartctl -a /dev/sdb . If I use sudo smartctl -a /dev/sdb1, it will show SMART support is: Unavailable - device lacks SMART capability.

GSmartControl

  • https://gsmartcontrol.shaduri.dev/downloads
  • GSmartControl is part of Gparted Live.
  • GSmartControl 1.1.3 -> Options -> Update Drive Database (failed). Download and install 1.1.4. A new window is open but no progress.
  • For some reason, GSmartControl show my WD black nvme as Unknown model. But "sudo smartctl -a /dev/nvme0n1 | grep -i model" can display the model. So the command line tool is better.
  • gsmartcontrol can show the command it used though it does not print everything. Select a disk and click Options -> View Execution Log'.
    UGREEN adapter
    $ sudo smartctl --info --health --capabilities  /dev/sdb
    === START OF INFORMATION SECTION ===
    ...
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    ...
    

    VANTEC adapter

    $ sudo smartctl --info --health --capabilities  /dev/sdb
    === START OF INFORMATION SECTION ===
    ...
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    
    === START OF READ SMART DATA SECTION ===
    SMART Status not supported: Incomplete response, ATA output registers missing
    SMART overall-health self-assessment test result: PASSED
    Warning: This result is based on an Attribute check.
    ...
    

    But the rest of output are very similar. This bridge may not pass on all SMART commands.

  • UGREEN SATA/USB adapter and VANTEC.

smartd: SMART Disk Monitoring Daemon

Monitor temperature

sudo apt install hddtemp
hddtemp
hddtemp /dev/sda

hddtemp -d /dev/sd[abcd]
telnet remotebox 7634
# OR  nc 192.168.1.100 7634
  • SAMSUNG 980 SSD 1TB PCle 3.0x4, NVMe and search "temperature"
    • smartd reports warnings: Device: /dev/nvme0, Critical Warning (0x02): Temperature. It seems this is a common problem.
    • smartctl -a /dev/nvme0 can show the current temperature
    • It seems 970 EVO plus is better.

Monitor dashboard: scrutiny

  • https://github.com/AnalogJ/scrutiny WebUI for smartd S.M.A.R.T monitoring. The dashboard shows all disks health at a glance.
  • Following is the docker compose file docker-compose.yml for my case. Note that also "lsblk" shows my NVME drive is "nvme0n1", but I need to use "nvme0" to make it to work. To view the dashboard, go to http://localhost:8082.
    version: '3.5'
    
    services:
      scrutiny:
        container_name: scrutiny
        image: ghcr.io/analogj/scrutiny:master-omnibus
        cap_add:
          - SYS_ADMIN
          - SYS_RAWIO
        ports:
          - "8082:8080" # webapp
          - "8086:8086" # influxDB admin
        volumes:
          - /run/udev:/run/udev:ro
          - ./config:/opt/scrutiny/config
          - ./influxdb:/opt/scrutiny/influxdb
        devices:
          - "/dev/sda"
          - "/dev/nvme0"
    
  • As the README described, smartd does not record S.M.A.R.T attribute history, so it can be hard to determine if an attribute is degrading slowly over time.

Reviews

Some disks

Best SSDs

Best SSDs of 2024: Reviews and buying advice

Samsung Portable SSD T7

Yes, the Samsung Portable SSD T7 (PSSD T7) supports S.M.A.R.T. data reporting. However, there have been some issues with Linux tools not working with the Samsung PSSD T7. This issue has been fixed with a pull request to the `drivedb` of `smartmontools`. If your `drivedb` is current, it will now work correctly. If it is not current, you can manually add the `-d sntasmedia` argument to `smartctl` or update the `drivedb` independently of `smartmontools` by using the update-smart-drivedb command; see Linux tools don't work with Samsung PSSD T7 & NVMe pass-through support for Samsung T7 SSD.