题 使用shell脚本进行多线程下载


假设我有一个包含大量URL的文件,我想使用任意数量的进程并行下载它们。我怎么能用bash做到这一点?


4
2018-03-16 15:02






答案:


看一下 man xargs

-P max-procs --max-procs=max-procs

         Run  up  to max-procs processes at a time; the default is 1.  If
         max-procs is 0, xargs will run as many processes as possible  at
         a  time.

解:

xargs -P 20 -n 1 wget -nv <urs.txt

9
2018-03-16 15:17



哦,那太光滑了。不知道-P - Richard June
如果原始链接消失,建议的命令(无用的cat被删除)是: xargs -P 20 -n 1 wget -nv <urs.txt - Gordon Davisson


如果您只想抓取每个URL(无论数量),那么答案很简单:

#!/bin/bash
URL_LIST="http://url1/ http://url2/"

for url in $URL_LIST ; do
    wget ${url} & >/dev/null
done

如果你只想创造有限数量的拉力,比如10.然后你会做这样的事情:

#!/bin/bash
URL_LIST="http://url1/ http://url2/"

function download() {
    touch /tmp/dl-${1}.lck
    wget ${url} >/dev/null
    rm -f /tmp/dl-${1}.lck
}

for url in $URL_LIST ; do
    while [ 1 ] ; do
        iter=0
        while [ $iter -lt 10 ] ; do
            if [ ! -f /tmp/dl-${iter}.lck ] ; then
                download $iter &
                break 2
            fi
            let iter++
        done
        sleep 10s
    done
done

请注意我没有实际测试它,但只是在15分钟内将其击出。但你应该得到一个大致的想法。


1
2018-03-16 15:19





你可以用类似的东西 PUF 这是为那种东西设计的,或者你可以结合使用wget / curl / lynx GNU并行


1
2018-03-16 15:20



这会像这样讨厌:cat urlfile |并行-j50 wget - Ole Tange


http://puf.sourceforge.net/ puf这样做是为了“谋生”,并且具有完整过程的良好运行状态。


0
2018-03-16 15:20





I do stuff like this a lot. I suggest two scripts.
the parent only determines the appropriate loading factors and 
launches a new child when there is 
1. more work to do
2. not past some various limits of loadavg or bandwidth

# my pref lang is tcsh so, this is just a rough approximation
# I think with just a few debug runs, this could work fine.

# presumes a file with one url to download per line
# 
NUMPARALLEL=4 # controls how many at once
#^tune above number to control CPU and bandwidth load, you
# will not finish  fastest by doing 100 at once.
# Wed Mar 16 08:35:30 PDT 2011 , dianevm at gmail

 while : ; do
      WORKLEFT=`wc -l  < $WORKFILE`
      if [ WORKLEFT -eq 0 ];
           echo finished |write sysadmin
           echo finished |Mail sysadmin
           exit 0
           fi
      NUMWORKERS=`ps auxwwf|grep WORKER|grep -v grep|wc -l`
      if [ $NUMWORKERS -lt $NUMPARALLEL]; then  # time to fire off another 1
           set WORKTODO=`head -1 $WORKFILE`
           WORKER $WORKTODO &  # worker could just be wget "$1", ncftp, curl
           tail -n +2 $WORKFILE >TMP
           SECEPOCH=`date +%s`
           mv $WORKFILE $WORKFILE.$SECSEPOCH
           mv TMP $WORKFILE
        else # we have NUMWORKERS or more running.
           sleep 5  # suggest this time  be close to ~ 1/4 of script run time
        fi
  done

0
2018-03-16 15:38



哦,另外,除非你有单独的ISP或带宽限制或其他东西,否则你通常不会有更快的下载速度,通过并行执行 - dianevm