Curl: Difference between revisions

From 太極
Jump to navigation Jump to search
 
(19 intermediate revisions by the same user not shown)
Line 1: Line 1:
= [http://daniel.haxx.se/docs/curl-vs-wget.html curl vs wget] =
= curl vs wget =
* http://daniel.haxx.se/docs/curl-vs-wget.html
* [https://www.howtogeek.com/447033/how-to-use-curl-to-download-files-from-the-linux-command-line/ How to Use curl to Download Files From the Linux Command Line]
<syntaxhighlight lang='bash'>
<syntaxhighlight lang='bash'>
sudo apt-get install curl
sudo apt-get install curl
Line 86: Line 88:
= wget overwrites the existing file =
= wget overwrites the existing file =
Use the '''-N''' or --timestamping option to turn on time-stamping. Don't re-retrieve files unless newer than local.
Use the '''-N''' or --timestamping option to turn on time-stamping. Don't re-retrieve files unless newer than local.
= wget to specify the output directory =
Use the '''-P prefix''' or '''--directory-prefix=prefix''' option.  For example, '''wget URL -P /tmp''' or '''wget URL -P /tmp/'''  .
= Hide progress bar output =
[https://www.cyberciti.biz/faq/curl-hide-progress-bar-output-linux-unix-macos/ curl hide progress bar output on Linux/Unix shell scripts]


= wget and username/password =
= wget and username/password =
Line 167: Line 175:
</syntaxhighlight>
</syntaxhighlight>


= wget to download a website =
<pre>
wget -r ftp://server-address.com/directory
</pre>
 
= Download a website =
* [https://www.makeuseof.com/tag/how-do-i-download-an-entire-website-for-offline-reading/ 6 Tools to Download an Entire Website for Offline Reading]
* [https://github.com/ArchiveBox/ArchiveBox ArchiveBox] - Open-source self-hosted web archiving.
 
== wget ==
* http://linux.about.com/od/commands/a/Example-Uses-Of-The-Command-Wget.htm
* http://linux.about.com/od/commands/a/Example-Uses-Of-The-Command-Wget.htm
* https://www.gnu.org/software/wget/manual/wget.html
* https://www.gnu.org/software/wget/manual/wget.html
* [https://listoffreeware.com/best-free-website-downloader-software-for-windows/ 11 Best Free Website Downloader Software For Windows]
* [https://listoffreeware.com/best-free-website-downloader-software-for-windows/ 11 Best Free Website Downloader Software For Windows]
* [https://www.httrack.com/ WebHTTrack Website Copier!] '''sudo apt install webhttrack''' On Ubuntu, the app is in a web application (http://HOSTNAME:8080). We can launch it by typing 'webhttrack' in order to launch it in our default browser. See [http://www.linux-magazine.com/Online/Features/WebHTTrack-Website-Copier Grabbing Websites with WebHTTrack] in Linux-magazine.
* [https://simpleit.rocks/linux/how-to-download-a-website-with-wget-the-right-way/ How To Download A Website With Wget The Right Way]
* [https://www.linuxjournal.com/content/downloading-entire-web-site-wget Downloading an Entire Web Site with wget] from linuxjournal
* [https://www.howtogeek.com/281663/how-to-use-wget-the-ultimate-command-line-downloading-tool/ How to Use wget, the Ultimate Command Line Downloading Tool]. Windows users can use '''Cygwin''' or Windows 10’s '''Ubuntu’s Bash shell'''.
* [https://stackoverflow.com/a/17638586 How to ignore specific type of files to download in wget?]


To download a copy of a complete web site, use the recursive option ('-r') By default it will go up to five levels deep. You can change the default level by using the '-l' option.  
To download a copy of a complete web site, use the recursive option ('-r') By default it will go up to five levels deep. You can change the default level by using the '-l' option.  
Line 179: Line 198:
wget -p --convert-links -r -l2 linux.about.com -o logfile
wget -p --convert-links -r -l2 linux.about.com -o logfile
wget -p --convert-links -r -l1 https://csgillespie.github.io/efficientR # create csgillespie/efficientR
wget -p --convert-links -r -l1 https://csgillespie.github.io/efficientR # create csgillespie/efficientR
wget -p --convert-links -r -l2 --reject WMA,doc,mp4,ppt,pdf,zip,exe,vcf  https://xxx.xxx
  # Exclude certain file types. Takes only 10 sec, for example.
</syntaxhighlight>
</syntaxhighlight>
[https://itsfoss.com/download-files-from-linux-terminal/ 2 Ways to Download Files From Linux Terminal]
<pre>
wget -m --convert-links --page-requisites website_address
</pre>
* –convert-links : links are converted so that internal links are pointed to downloaded resource instead of web
* –page-requisites: downloads additional things like style sheets so that the pages look better offline
Note
* The index.html file downloaded using the above command still differs from the website (the hyperlink). It seems it has nothing to do with the option '''--convert-links''' or '''-m'''.
* We can use wget to download the original index.html and place in the downloaded website folder. The downloaded index.html file will look perfect on the browser. We can use this way to modify the index.html file. (Cf. it seems it does not work if we place index.html file inside the folder downloaded using HTTrack).
* '''--noparent''' if you want to avoid downloading folders and files above the current level.
* The links in css/html files will be changed. So they are not the same as the original.


== HTTrack Website Copier ==
== HTTrack Website Copier ==
https://www.httrack.com/
* https://www.httrack.com/
* [https://www.httrack.com/ WebHTTrack Website Copier!] '''sudo apt install webhttrack'''  On Ubuntu, the app is in a web application (http://HOSTNAME:8080). We can launch it by typing 'webhttrack' in order to launch it in our default browser. See [http://www.linux-magazine.com/Online/Features/WebHTTrack-Website-Copier Grabbing Websites with WebHTTrack] in Linux-magazine.
* When I run it in Ubuntu, it started an http server with port 8080. The interface is in a browser. After finishing the download, we can browse the mirrored website using the same http server.
* Use [[Linux#Recursive_statistics_on_file_types_in_directory.3F|this tip]] to exclude some file types we don't need to download. This can save lots of time if we have big files from *.zip, *.ZIP, *.vcf, *.WMA, *.tar, *.tar.gz, *.ova, *.mp4, *.exe, *.jar, *.ogg, *.pdf, *.ppt.
 
Steps
<ol>
<li>Seleect an existing project or create a new project. Next. </li>
<li>Action: Update existing download. Add a URL. Click "Set Options.."
<ul>
<li>Click "Scan Rules" and enter the following (one long line) in the box. Clikc OK.
{{Pre}}
-*.ova -*.doc -*.mp4 -*.ppt -*.pdf -*.WMA -*.zip -*.exe +*.png +*.gif +*.jpg +*.jpeg +*.css +*.js -ad.doubleclick.net/*
</pre>
</li>
</ul>
</li>
<li>Click Next. </li>
<li>Click Start. </li>
</ol>


When I run it in Ubuntu, it started an http server with port 8080. The interface is in a browser. After finishing the download, we can browse the mirrored website using the same http server.
Note
* It looks we can modify the local directory name. So we can keep a timestamp backup.
* It seems even I choose 'update', it still download all files again?
* It seems it will change *.htm file names to *.html.


Use [[Linux#Recursive_statistics_on_file_types_in_directory.3F|this tip]] to exclude some file types we don't need to download. This can save lots of time if we have big files from *.zip, *.ZIP, *.vcf, *.WMA, *.tar, *.tar.gz, *.ova, *.mp4, *.exe, *.jar, *.ogg, *.pdf, *.ppt.
== httrack: command-line program ==
<ul>
<li>http://www.httrack.com/html/fcguide.html, http://www.httrack.com/html/httrack.man.html </li>
<li>[https://www.linux-magazine.com/Online/Features/WebHTTrack-Website-Copier WebHTTrack Website Copier]
{{Pre}}
httrack http://www.documentfoundation.org -* +*.htm* +*.pdf -O /home/floeff/websites
</pre>
</li>
<li>[https://spin.atomicobject.com/2016/02/12/create-a-website-copy-with-httrack/ Create a Local Copy of a Website with HTTrack] </li>
<li>[https://sleeplessbeastie.eu/2019/06/24/how-to-copy-website-using-httrack/ How to copy website using HTTrack] </li>
<li>[https://lucidgen.com/en/how-to-use-httrack-on-mac/ How to use HTTrack on Mac Terminal] </li>
</ul>


= Save Web Pages As Single HTML Files With Monolith =
= Save Web Pages As Single HTML Files With Monolith =
Line 206: Line 273:
= Internet application: cheat.sh =
= Internet application: cheat.sh =
See [[#Cheat.sh_.28better_than_TLDR.29|man -> Cheat.sh]].
See [[#Cheat.sh_.28better_than_TLDR.29|man -> Cheat.sh]].
= Cookies =
[https://www.reddit.com/r/crossword/comments/dqtnca/my_automatic_nyt_crossword_downloading_script/ My automatic NYT crossword downloading script]


= Files downloaded from a browser and wget =
= Files downloaded from a browser and wget =

Latest revision as of 11:33, 10 February 2024

curl vs wget

sudo apt-get install curl

For example, the Download link at the National Geographic Travel Photo Contest 2014 works for curl but not wget. I can use curl with -o option but wget with -o will not work in this case. Note with curl, we can also use the -O (capital O) option which will write output to a local file named like the remote file.

curl \
 http://travel.nationalgeographic.com/u/TvyamNb-BivtNwcoxtkc5xGBuGkIMh_nj4UJHQKuoXEsSpOVjL0t9P0vY7CvlbxSYeJUAZrEdZUAnSJk2-sJd-XIwQ_nYA/ \
 -o owl.jpg

Should I Use Curl Or Wget? and curl vs Wget

  • The main benefit of using the wget command is that it can be used to recursively download files.
  • The curl command lets you use wildcards to specify the URLs you wish to retrieve. And curl supports more protocols than wget (HTTP, HTTPS, FTP) does.
  • The wget command can recover when a download fails whereas the curl command cannot.

Actually curl supports continuous downloading too. But not all FTP connection supports continuous downloading. The following examples show it is possible to use the continuous downloading option in wget/curl for downloading file from ncbi FTP but not from illumina FTP.

$ wget -c ftp://igenome:[email protected]/Drosophila_melanogaster/Ensembl/BDGP6/Drosophila_melanogaster_Ensembl_BDGP6.tar.gz
--2017-04-13 10:46:16--  ftp://igenome:*password*@ussd-ftp.illumina.com/Drosophila_melanogaster/Ensembl/BDGP6/Drosophila_melanogaster_Ensembl_BDGP6.tar.gz
           => ‘Drosophila_melanogaster_Ensembl_BDGP6.tar.gz’
Resolving ussd-ftp.illumina.com (ussd-ftp.illumina.com)... 66.192.10.36
Connecting to ussd-ftp.illumina.com (ussd-ftp.illumina.com)|66.192.10.36|:21... connected.
Logging in as igenome ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /Drosophila_melanogaster/Ensembl/BDGP6 ... done.
==> SIZE Drosophila_melanogaster_Ensembl_BDGP6.tar.gz ... 762893718
==> PASV ... done.    ==> REST 1706053 ... 
REST failed, starting from scratch.
 
==> RETR Drosophila_melanogaster_Ensembl_BDGP6.tar.gz ... done.
Length: 762893718 (728M), 761187665 (726M) remaining (unauthoritative)
 
 0% [                                                                                                                   ] 374,832     79.7KB/s  eta 2h 35m ^C
 
$ curl -L -O -C - ftp://igenome:[email protected]/Drosophila_melanogaster/Ensembl/BDGP6/Drosophila_melanogaster_Ensembl_BDGP6.tar.gz
** Resuming transfer from byte position 1706053
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0  727M    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
curl: (31) Couldn't use REST

$ wget -c ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/VCF/common_all_20160601.vcf.gz
--2017-04-13 10:52:02--  ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/VCF/common_all_20160601.vcf.gz
           => ‘common_all_20160601.vcf.gz’
Resolving ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)... 2607:f220:41e:250::7, 130.14.250.10
Connecting to ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)|2607:f220:41e:250::7|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /snp/organisms/human_9606_b147_GRCh37p13/VCF ... done.
==> SIZE common_all_20160601.vcf.gz ... 1023469198
==> EPSV ... done.    ==> RETR common_all_20160601.vcf.gz ... done.
Length: 1023469198 (976M) (unauthoritative)
 
24% [===========================>                                                                                       ] 255,800,120 55.2MB/s  eta 15s    ^C
 
$ wget -c ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/VCF/common_all_20160601.vcf.gz
--2017-04-13 10:52:11--  ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/VCF/common_all_20160601.vcf.gz
           => ‘common_all_20160601.vcf.gz’
Resolving ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)... 2607:f220:41e:250::7, 130.14.250.10
Connecting to ftp.ncbi.nih.gov (ftp.ncbi.nih.gov)|2607:f220:41e:250::7|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /snp/organisms/human_9606_b147_GRCh37p13/VCF ... done.
==> SIZE common_all_20160601.vcf.gz ... 1023469198
==> EPSV ... done.    ==> REST 267759996 ... done.    
==> RETR common_all_20160601.vcf.gz ... done.
Length: 1023469198 (976M), 755709202 (721M) remaining (unauthoritative)
 
47% [++++++++++++++++++++++++++++++========================>                                                            ] 491,152,032 50.6MB/s  eta 12s    ^C

$ curl -L -O -C - ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/VCF/common_all_20160601.vcf.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 65  976M   65  639M    0     0  83.7M      0  0:00:11  0:00:07  0:00:04 90.4M^C

curl man page, supported protocols

https://curl.haxx.se/docs/manpage.html

curl complete guide

wget overwrites the existing file

Use the -N or --timestamping option to turn on time-stamping. Don't re-retrieve files unless newer than local.

wget to specify the output directory

Use the -P prefix or --directory-prefix=prefix option. For example, wget URL -P /tmp or wget URL -P /tmp/ .

Hide progress bar output

curl hide progress bar output on Linux/Unix shell scripts

wget and username/password

http://www.cyberciti.biz/faq/wget-command-with-username-password/

Download and Un-tar(Extract) in One Step

If we don't want to avoid saving a temporary file, we can use one piped statement.

curl http://download.osgeo.org/geos/geos-3.5.0.tar.bz2 | tar xvz
# OR
wget http://download.osgeo.org/geos/geos-3.5.0.tar.bz2 -O - | tar jx

# For .gz file
wget -O - ftp://ftp.direcory/file.gz | gunzip -c > gunzip.out

See shellhacks.com. Note that the magic part of the wget option "-O -"; it will output the document to the standard output instead of a file.

The "-c" in gunzip is to have gzip output to the console. PS. it seems not necessary to use the "-c" option.

Download and execute the script in one step

See Execute bash script from URL. Note "-s" parameter in curl means the silent mode.

curl -s https://server/path/script.sh | sudo sh

curl -s http://server/path/script.sh | sudo bash /dev/stdin arg1 arg2

sudo -v && wget -nv -O- https://download.calibre-ebook.com/linux-installer.sh | sudo sh /dev/stdin

Download and install binary software using sudo

One example (Calibre) is like

sudo -v && wget -nv -O- https://raw.githubusercontent.com/kovidgoyal/calibre/master/setup/linux-installer.py | \
sudo python -c "import sys; main=lambda:sys.stderr.write('Download failed\n'); exec(sys.stdin.read()); main()"

Note that in wget the option "-O-" means writing to standard output (so the file from the URL is NOT written to the disk) and "-nv" means no verbose.

If the option "-O-" is not used, we'd better to use "-N" option in wget to overwrite an existing file.

Another example is adding the GPG key.

# https://docs.docker.com/install/linux/docker-ce/ubuntu/
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

See the Logging and Download options in wget's manual.

       -O file
       --output-document=file
           The documents will not be written to the appropriate files, but all
           will be concatenated together and written to file.  If - is used as
           file, documents will be printed to standard output, disabling link
           conversion.  (Use ./- to print to a file literally named -.)

curl and POST request

curl and proxy

How to use curl command with proxy username/password on Linux/ Unix

Website performance

httpstat – A Curl Statistics Tool to Check Website Performance

wget/curl a file with correct name when redirected

wget --trust-server-names <url> 
# Or
wget --content-disposition <url>
# Or
curl -JLO <url>

wget to download a folder

https://stackoverflow.com/questions/8755229/how-to-download-all-files-but-not-html-from-a-website-using-wget

wget -A pdf,jpg,PDF,JPG -m -p -E -k -K -np http://site/path/
wget -r ftp://server-address.com/directory

Download a website

wget

To download a copy of a complete web site, use the recursive option ('-r') By default it will go up to five levels deep. You can change the default level by using the '-l' option.

All files linked to in the documents are are downloaded to enable complete offline viewing ('-p' and '--convert-links' options). Instead of having the progress messages displayed on the standard output, you can save it to a log file with the -o option.

wget -p --convert-links -r -l2 linux.about.com -o logfile
wget -p --convert-links -r -l1 https://csgillespie.github.io/efficientR # create csgillespie/efficientR
wget -p --convert-links -r -l2 --reject WMA,doc,mp4,ppt,pdf,zip,exe,vcf  https://xxx.xxx 
   # Exclude certain file types. Takes only 10 sec, for example.

2 Ways to Download Files From Linux Terminal

wget -m --convert-links --page-requisites website_address
  • –convert-links : links are converted so that internal links are pointed to downloaded resource instead of web
  • –page-requisites: downloads additional things like style sheets so that the pages look better offline

Note

  • The index.html file downloaded using the above command still differs from the website (the hyperlink). It seems it has nothing to do with the option --convert-links or -m.
  • We can use wget to download the original index.html and place in the downloaded website folder. The downloaded index.html file will look perfect on the browser. We can use this way to modify the index.html file. (Cf. it seems it does not work if we place index.html file inside the folder downloaded using HTTrack).
  • --noparent if you want to avoid downloading folders and files above the current level.
  • The links in css/html files will be changed. So they are not the same as the original.

HTTrack Website Copier

  • https://www.httrack.com/
  • WebHTTrack Website Copier! sudo apt install webhttrack On Ubuntu, the app is in a web application (http://HOSTNAME:8080). We can launch it by typing 'webhttrack' in order to launch it in our default browser. See Grabbing Websites with WebHTTrack in Linux-magazine.
  • When I run it in Ubuntu, it started an http server with port 8080. The interface is in a browser. After finishing the download, we can browse the mirrored website using the same http server.
  • Use this tip to exclude some file types we don't need to download. This can save lots of time if we have big files from *.zip, *.ZIP, *.vcf, *.WMA, *.tar, *.tar.gz, *.ova, *.mp4, *.exe, *.jar, *.ogg, *.pdf, *.ppt.

Steps

  1. Seleect an existing project or create a new project. Next.
  2. Action: Update existing download. Add a URL. Click "Set Options.."
    • Click "Scan Rules" and enter the following (one long line) in the box. Clikc OK.
      -*.ova -*.doc -*.mp4 -*.ppt -*.pdf -*.WMA -*.zip -*.exe +*.png +*.gif +*.jpg +*.jpeg +*.css +*.js -ad.doubleclick.net/*
      
  3. Click Next.
  4. Click Start.

Note

  • It looks we can modify the local directory name. So we can keep a timestamp backup.
  • It seems even I choose 'update', it still download all files again?
  • It seems it will change *.htm file names to *.html.

httrack: command-line program

Save Web Pages As Single HTML Files With Monolith

Save Web Pages As Single HTML Files For Offline Use With Monolith (Console)

Internet application: wttr.in, check weather from console/terminal

Internet application: cheat.sh

See man -> Cheat.sh.

Cookies

My automatic NYT crossword downloading script

Files downloaded from a browser and wget

Same file downloaded through a browser and the wget command has a different file size and behavior.

$ ls -lh biotrip*.gz
-rw-r--r-- 1 brb brb 198M May 15 09:11 biotrip_0.1.0_may19.tar.gz
-rw-rw-r-- 1 brb brb 195M May 14 16:57 biotrip_0.1.0.tar.gz

$ file biotrip_0.1.0_may19.tar.gz # downloaded from a browser (chrome browser, Mac or Linux)
biotrip_0.1.0_may19.tar.gz: POSIX tar archive

$ file biotrip_0.1.0.tar.gz       # downloaded from the wget command
biotrip_0.1.0.tar.gz: gzip compressed data, from HPFS filesystem (OS/2, NT)

$ tar xzvf biotrip_0.1.0_may19.tar.gz 
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now