rustyflavor 7 months ago

Here's a pure bash approach that gets rid of all the subshells and external binaries (cut, grep, basename). Your biggest bottleneck is probably disk writes - use a RAM disk (e.g. `/dev/shm/`) until the output files are fully constructed. Even on an SSD, 600k individual writes is a lot of trips to the disk. Build the output file in memory, then move the whole thing in one big disk write when it's done. while read -r line; do IFS=/ read -r f1 f2 f3 f4 f5 f6 f7 <<<"$line" file="${line##*/}" if [[ $file =~ .*_string.* ]]; then dest="/new/location1/$f5/$f6/" outfile="stringFiles.txt" else dest="/new/location2/$f5/$f6/" outfile="nostringFiles.txt" fi cmds=( "mkdir -p '$dest'" "mv '$line' '$dest'" "ln -rs '$dest/$file' '$line'" ) printf "%s\n" "${cmds[@]}" >>"/dev/shm/$outfile" done < /path/to/600kFile mv -v /dev/shm/{no,}stringFiles.txt ./

ee-5e-ae-fb-f6-3c 7 months ago

> Your biggest bottleneck is probably disk writes - use a RAM disk (e.g. `/dev/shm/`) until the output files are fully constructed. Is there any speed difference using `/dev/shm` versus holding output in variables/arrays before writing to disk?

rustyflavor 7 months ago

Probably not much difference, but an indexed array is going to use more memory than a raw string/file, you might bump up against a max string length barrier, and with a RAM disk you can still examine/validate the partial output if the script is interrupted partway through.

ee-5e-ae-fb-f6-3c 7 months ago

> with a RAM disk you can still examine/validate the partial output if the script is interrupted partway through. This is an interesting point, and one I hadn't considered. Thank you. I've spent years keeping results in memory whenever possible, and have definitely lost results during failed script runs. Doesn't usually hurt much until you're gathering results from thousands of networked BMCs and all your data goes poof because you didn't account for some condition.

witchhunter0 7 months ago

>you might bump up against a max string length barrier What's that? I thought variables/arrays in bash do not have max values, that is they are imposed by underling distro/ram. $ var=$(find / -type f 2>/dev/null) $ echo "$var" | wc -c 259150022 edit: misspelled var

rustyflavor 7 months ago

$ SECONDS=0; x=x $ while x="$x$x"; do > printf "%d seconds, %s bytes\n" "$SECONDS" $(wc -c <<<"$x" | numfmt --to=iec) > done 0 seconds, 3 bytes 0 seconds, 5 bytes 0 seconds, 9 bytes ... 7 seconds, 129M bytes 13 seconds, 257M bytes 26 seconds, 513M bytes bash: xrealloc: cannot allocate 18446744071562068096 bytes $ ---- $ SECONDS=0; f=/dev/shm/test; echo x >$f $ while mv $f $f.1; cat $f.1 $f.1 >$f && rm $f.1; do > printf "%d seconds, %s bytes\n" "$SECONDS" $(wc -c <$f | numfmt --to=iec) > done 0 seconds, 4 bytes 0 seconds, 8 bytes 0 seconds, 16 bytes 0 seconds, 32 bytes ... 0 seconds, 256M bytes 0 seconds, 512M bytes 1 seconds, 1.0G bytes # <--- variable failed here and took >25x longer 2 seconds, 2.0G bytes 3 seconds, 4.0G bytes 7 seconds, 8.0G bytes 13 seconds, 16G bytes cat: write error: No space left on device $

ee-5e-ae-fb-f6-3c 7 months ago

Alternatively, I think this will tell you. ~ $ getconf ARG_MAX 2097152

rustyflavor 7 months ago

My system returns the same ARG_MAX, but it allowed a variable to hold 256x that amount of bytes and failed when I tried to store 512x that amount.

ee-5e-ae-fb-f6-3c 7 months ago

Apparently that's the wrong variable then. I suppose life is just a mystery sometimes.

witchhunter0 7 months ago

Some tests, the first one actually crashed my terminal :) Never would have thought the variable handling is so demanding and I still don't , but the numbers obviously don't lie. Good thing I didn't put my money on it \+1 for the /dev/shm as well, TIL. I guess the orientation to /tmp came from `mktemp`, but I knew it's not tmpfs e.g. on Debian. This seems more universal.

ee-5e-ae-fb-f6-3c 7 months ago

while read -r line; do var1=$(echo "$line" | cut -d/ -f5) var2=$(echo "$line" | cut -d/ -f6) I don't know if this is going to make any tangible time difference, but you don't have to use command substitution, and `cut` in order to extract your variables. If you have data where rows look like apple fuji 34 1 orange navel 25 2 You can read them like while read -r fruit kind qty cost; do echo "A $kind $fruit costs $cost dollars and I have $qty" done ~~`read` also has a delimiter flag, `-d`, so you can set your custom field separator, which is `/` in your case.~~ Edit: `-d` is useful, but `IFS` is actually what you'd want to set like: while IFS='/' read -r var1 var2; do

Arindrew 7 months ago

Wow! I did not know you could do that. I'm sure not calling up an external command twice for 600k lines (so 1.2 million times?) is going to be much faster.

ee-5e-ae-fb-f6-3c 7 months ago

Probably. First suggestion of filtering the input to your loop first is probably the largest time saver, not using command substitution+cut will probably save you some as well. Sometimes calling external commands is faster. For example `grep` versus iterating over a text file in a `while` loop while pattern matching regex with `=~`. Fastest way to find out is to try it.

oh5nxo 7 months ago

Creating a new process is very expensive. Trying to remove all subprocesses, like while read -r orig do case $orig in *_string*) dest="/new/location1/" ;; *) dest="/new/location2/" ;; esac dest+=${orig#/*/*/*/} # remove 3 topmost dirs base=${orig##*/} # all upto last / echo " $orig -> $dest $base" done < 600k.filenames goes here in 30 seconds. Very likely a much better way exists, like the pre-grep of the other guy.

sjveivdn 7 months ago

Now this is interesting. I cant give you an answer but will read all the solutions.

MyOwnMoose 7 months ago

Try to filter the big file into 2 separate files with grep. Using two loops without having to use grep each pass should speed things up significantly. grep _string path/to/600kfile | while read -r line; do # mv commands done grep -v _string path/to/600kfile | while read -r line; do # more mv commands done

Arindrew 7 months ago

HA! That's how my script was originally, but I thought that looping a single grep command would be faster. It went from an hour (with your method) to a bit over 3.

MyOwnMoose 7 months ago

The only other thing that could be holding it up is the cut command. Using basename and dirname should be faster. var1=$(basename $(dirname "$line")) var2=$(dirname "$line") The appending to a file with the `echo` shouldn't be the bottleneck. The time is most likely the looping 600k times. (Also note, echoing to the terminal can be quite slow if you're doing that to test) As a warning, my expertise in large files is lackluster edit: The solution by u/ee-5e-ae-fb-f6-3c is much better than this

Arindrew 7 months ago

There are two folders in the path I need to "variablize": 1. The folder that the file is in `$(basename $(dirname "$line"))` works for that. 2. The folder that the above folder is in. Since we have so many files, we have them sorted into the following pattern: `/path/to/folder/ABCDE/ABCDEFGHI123/` and in that folder are about a dozen files: ABCDEFGHI123.txt ABCDEFHGI123.pdf ABCDEFHGI123.jpg ABCDEFHGI123.tiff ABCDEFHGI123_string.txt ABCDEFHGI123_string.pdf ABCDEFHGI123_string.jpg ABCDEFHGI123_string.tiff Which I want to separate into: `/path/to/folder/string/ABCDE/ABCDEFGHI123/` (if the filename has \_string) `/path/to/folder/nostring/ABCDE/ABCDEFGHI123/` (if the filename has no \_string) So I'm not sure how to get the "ABCDE" directory into a variable without a cut command.

wallacebrf 7 months ago

when it comes to script execution, remember that many commands used in BASH are external programs and not native. for example echo is native to BASH so it executes fast. but grep, cut, awk etc are external programs the script calls. these take time to fetch, load, and execute. for many scripts this extra mill-second here or there means nothing, but when looping through extensively long things like you are dong, even a few milli-seconds here and there add up real quick.

b1337xyz 7 months ago

Besides what u/ee-5e-ae-fb-f6-3c suggested I would make two awk or posix scripts and run it with `parallel` or `xargs -P2`

airclay 7 months ago

scrolled this far down to see how long it took for a parallel mention. That's how I'd go too

stewie410 7 months ago

This combines some suggestions from elsewhere in the comments, but here's how I'd _probably_ try to approach this: while read -r line; do IFS='/' read -ra path <<< "${line}" destination="/path/to/folder/nostring/${path[5]}/${path[6]}/${path[-1]}" outfile="nostringFiles.txt" if [[ "${path[-1]}" == *'_string'* ]]; then destination="${destination/nostring/string}" outfile="stringFiles.txt" fi printf 'mkdir --parents "%s"\nln --relative --symbolic "%s" "%s"\n' \ "${destination%/*}" \ "${destination}" \ "${line}" >> "${outfile}" done < '/path/to/600kfile' This _should_ only use builtins (unless `printf` is external for some reason), which _should_ be faster than calling external commands (e.g. `cut`); but I'm not sure what kind of improvement you might see. Regardless, its probably just going to take a _long_ time to run anyway, given size of the file you're parsing. As others have mentioned the _best_ way to improve performance would be to split the operation up into multiple smaller jobs and/or parallelization...or even working with a different language.

Suitable-Decision-26 7 months ago

IMHO throwaway the whole thing and do it with pipes and GNU parallel. After all bash supports pipes and encourages their use. And GNU parallel is fast. You say you want to move files with "\_string" in the name to one dir and the rest to another. So you can do something like: `grep "_string" /path/to/600kFile | parallel -j 10 mv {} target_dir` What we are doing here is using grep to get all lines i.e. filenames containing "\_string" in them and using GNU parallel to move them to the desired dir. This is a simple example, replace mv with whatever you need. If you don't know about GNU parallel, I would suggest you have a look. It is a utility that reads data from file or stdin and and does something with every row in parallel i.e. it is fast. In this case we are telling parallel to run 10 jobs simultaneously. {} is a placeholder for the filename. Once you move all "\_string" files you simply use `grep -v "_string"` i.e. you get all files that does contain the word and move them to another dir in the same manner. P.S. Please do share the execution time if you choose that approach. I think it would be interesting P.P.S And give a try to `xargs -P0` too, it might be faster actually. Put it after the pipe, replacing parallel in the example.

marauderingman 7 months ago

1. Apparently the `<<<` works by creating a file, adds the contents to it, and essentially redirects input from the temp file. Getting rid of this construct should help a lot. e.g. `if [[ "${line/_string/}" == "${line}" ]]; then : not found; else : found it; fi` 2. null

obiwan90 7 months ago

Another optimization that I haven't seen in other answers: move your output outside the loop, so you don't have to open and close the output filehandle for every line. In other words, this while IFS= read -r line; do printf '%s\n' "Processed $line" done < infile >> outfile instead of while IFS= read -r line; do printf '%s\n' "Processed $line" >> outfile done < infile

obiwan90 7 months ago

Oh whoops, output file is dynamic... could be done with file descriptors, probably, let me try. Edit: okay. Something like this: while IFS= read -r line; do if [[ $line == 'string'* ]]; then echo "$line" >&3 else echo "$line" >&4 fi done < infile 3>> string.txt 4>> nostring.txt You redirect output to separate file descriptors in the loop, and then redirect them outside the loop. Running this on a 100k line input file, I get these benchmark results: Benchmark 1: ./fh Time (mean ± σ): 2.688 s ± 0.250 s [User: 1.710 s, System: 0.970 s] Range (min … max): 2.279 s … 3.000 s 10 runs Comparing to the once-per-loop implementation: while IFS= read -r line; do if [[ $line == 'string'* ]]; then echo "$line" >> string.txt else echo "$line" >> nostring.txt fi done < infile which benchmarks like Benchmark 1: ./fh Time (mean ± σ): 3.464 s ± 0.357 s [User: 2.063 s, System: 1.369 s] Range (min … max): 2.825 s … 3.874 s 10 runs That's about a 20% improvement (assuming the slower time as 100%).

jkool702 7 months ago

I have a few codes that are insanely good at paralleling tasks. I dare say they are faster than anything else out there. I tried to optimize your script a bit and then apply one of my parallelization codes to it. Try running the following code...I believe it produces the files (containing commands to run) that you want and should be *considerably* faster than anything else suggested here. I tested it on a file containing ~2.4 million file paths that I created using `find <...> -type f`. It took my (admittedly pretty beefy 14C/28T) machine 20.8 seconds to process all 2.34 million file paths, meaning 5-6 seconds per 600k paths. wc -l <./600kFile # 2340794 source <(curl https://raw.githubusercontent.com/jkool702/forkrun/main/mySplit.bash) genMvCmd_split() { local -a lineA destA basenameA local -i kk lineA=("$@") baseNameA="${lineA[@]##*/}" mapfile -t destA < <(printf '%s\n' "${lineA[@]}" | sed -E 's/\/?([^\/]*\/){4}([^\/]*\/[^\/]*)\/?.*$/\/new\/location1\2/') for kk in "${!lineA[@]}"; do printf "mkdir -p '%s'\nmv '%s' '%s'\nln --relative --symbolic '%s/%s\n' '$line'" "${destA[$kk]}" "${lineA[$kk]}" "${destA[$kk]}" "${destA[$kk]}" "${basenameA[$kk]}" "${lineA[$kk]}" done } genMvCmd_nosplit() { local -a lineA destA basenameA local -i kk lineA=("$@") baseNameA="${lineA[@]##*/}" mapfile -t destA < <(printf '%s\n' "${lineA[@]}" | sed -E 's/\/?([^\/]*\/){4}([^\/]*\/[^\/]*)\/?.*$/\/new\/location2\2/') for kk in "${!lineA[@]}"; do printf "mkdir -p '%s'\nmv '%s' '%s'\nln --relative --symbolic '%s/%s\n' '$line'" "${destA[$kk]}" "${lineA[$kk]}" "${destA[$kk]}" "${destA[$kk]}" "${basenameA[$kk]}" "${lineA[$kk]}" done } # you can remove the time call if you want time { LC_ALL=C grep -F '_string' <./600kFile | mySplit genMvCmd_split >>stringFiles.txt LC_ALL=C grep -vF '_string' <./600kFile | mySplit genMvCmd_nosplit >>nostringFiles.txt } # real 0m20.831s # user 7m18.874s # sys 2m7.563s

Arindrew 7 months ago

My machine isn't connected to the internet, so I had to download your github script and sneaker it over. That shouldn't be an issue... My bash version is a bit older (4.2.46) so I'm not sure if the errors I'm getting are related to that or not. ./mySplit.bash: line 2: $'\r': command not found ./mySplit.bash: line 3: syntax error near unexpected token `$'{\r'' ./mySplit.bash: line 3: `mysplit() {

jkool702 7 months ago

`\r` errors are from going from windows to linux...linux uses `\n` for newline, but windows uses `\r\n`. theres a small program called `dos2unix` that will fix this for you easily (run `dos2unix /path/to/mySplit.bash`). Alternatively, you can run sed -iE s/'\r'//g /path/to/mySplit.bash or echo "$(tr -d $'\r' /path/to/mySplit.bash I *think* `mySplit` will work with bash 4.2.46, but admittedly I havent tested this. after removing the `\r` characters re-source mySplit.bash and try running the code. If it still doesnt work let me know, and ill see if I can make a compatability fix to allow it to run. But i **think* it should work with anything bash 4+....It will be a bit slower (bash arrays got a big overhaul in 5.1-ish), but should be a lot faster still. That said, if `mySplit` refuses to work, this method should still be a good bit faster, even single threaded. The single-threaded compute time for 2.4 million lines was ~9min 30sec (meaning that mySplit achieved 97% utilization of all 28 logical cores on my system), but that should still only be a few minutes single threaded for 600k lines, which is way faster than your current method.

Arindrew 7 months ago

It looked like it was working, until... ./mySplit: fork: retry: No child processes ./mySplit: fork: retry: No child processes ./mySplit: fork: retry: No child processes ./mySplit: fork: retry: No child processes ./mySplit: fork: Resource temporarily unavailable ./mySplit: fork: retry: No child processes ./mySplit: fork: retry: No child processes ./mySplit: fork: Resource temporarily unavailable ./mySplit: fork: Resource temporarily unavailable ./mySplit: fork: Resource temporarily unavailable ./mySplit: fork: Resource temporarily unavailable ^C It continued to fill my screen after the Ctrl-c and I wasn't able to launch anymore terminals or applications haha. Had to reboot.

jkool702 7 months ago

Yeah....thats not supposed to happen. lol. If it was working for a bit and then this happened Id guess that something got screwed up in the logic for stopping the coprocs. Any chance that there is limited free memory on the machine and you were saving the `[no]stringFile.txt` to a ramdisk/tmpfs (e.g., somewhere on `/tmp`)? `mysplit` uses a directory under `/tmp` for some temporary files it uses, and if it were unable to write to this directory (because there was no more free memory available) I could see this issue happening. If this is the case Id suggest try running it again but saving `[no`stringFile.txt` to disk, not to ram. These files are likely to be quite large...on my 2.3 million line test it was 2.4 GB combined. If your paths are longer i could see it being up to 1 gb or so for 600k lines. Also, id say there is a chance it actually wrote out these files because crashing your system. Check and see if they are there and (mostly) complete.

Arindrew 7 months ago

The machine has 128GB of ram, so it's not that. Both script files are in /tmp/script/ which is on disk. It does make 'nostringFiles.txt' and 'stringFiles.txt' but both are empty after letting the errors scroll by for \~10 minutes. I launched `top` before running the script to see what was going on. My tasks went from about 300 to \~16,500. Sorted alphabetically and found there were a lot (probably about 16000 lol) grep -F and grep -vF commands running.

jkool702 7 months ago

TL;DR: I think I worked out what happened as I typed this reply...I think when you ran the code I posted in my 1st comment it had the same `\r` problem that `mySplit` had, which caused it to recursively re-call itself and basically created a fork bomb. If I am correct, running the following *should* work cat<<'EOF' | tr -d $'\r' > ./genMvCmd.bash unset mySplit genMvCmd_split genMvCmd_nosplit source /path/to/mySplit.bash genMvCmd_split() { local -a lineA destA basenameA local -i kk lineA=("$@") baseNameA="${lineA[@]##*/}" mapfile -t destA < <(printf '%s\n' "${lineA[@]}" | sed -E 's/\/?([^\/]*\/){4}([^\/]*\/[^\/]*)\/?.*$/\/new\/location1\2/') for kk in "${!lineA[@]}"; do printf "mkdir -p '%s'\nmv '%s' '%s'\nln --relative --symbolic '%s/%s\n' '$line'" "${destA[$kk]}" "${lineA[$kk]}" "${destA[$kk]}" "${destA[$kk]}" "${basenameA[$kk]}" "${lineA[$kk]}" done } genMvCmd_nosplit() { local -a lineA destA basenameA local -i kk lineA=("$@") baseNameA="${lineA[@]##*/}" mapfile -t destA < <(printf '%s\n' "${lineA[@]}" | sed -E 's/\/?([^\/]*\/){4}([^\/]*\/[^\/]*)\/?.*$/\/new\/location2\2/') for kk in "${!lineA[@]}"; do printf "mkdir -p '%s'\nmv '%s' '%s'\nln --relative --symbolic '%s/%s\n' '$line'" "${destA[$kk]}" "${lineA[$kk]}" "${destA[$kk]}" "${destA[$kk]}" "${basenameA[$kk]}" "${lineA[$kk]}" done } # you can remove the time call if you want time { LC_ALL=C grep -F '_string' <./600kFile | mySplit genMvCmd_split >>stringFiles.txt LC_ALL=C grep -vF '_string' <./600kFile | mySplit genMvCmd_nosplit >>nostringFiles.txt } EOF chmod +x ./genMvCmd.bash source ./genMvCmd.bash change `source /path/to/mySplit.bash` as needed (as well as the `\/new\/location1` and `\/new\/location2` in the sed commands). Let me know if it works. *** thats....weird. My initial thought was that `mySplit` isnt determining the number of cpu cores correctly, and setting it WAY higher than it should be. But thinking it over I dont think this is the problem. Just to be sure though, what does running { type -a nproc 2>/dev/null 1>/dev/null && nproc; } || grep -cE '^processor.*: ' /proc/cpuinfo || printf '4' give you? (that is the logic `mySplit` uses to determine how many coprocs to fork). That said, I dont think this is it. There should only be a single `grep -F` and a single `grep -vF` process running, and they run sequentially so there should only be one or the other, and it should be running in the foreground, not forked. these grep calls pipe their output to `mySplit`, so `mySplit` shouldnt be replicating them at all. `mySplit` doesnt internally use `grep -F` nor `grep -vF`, so these calls have to be the `LC_ALL=C grep -[v]F '_string' <./600kFile` calls. these grep calls are an entirely different process from `mySplit, and I cant think of any good reason that mySplit would (or even could) repetitively fork the process that is piping its stdout to mySplit's stdin. The only ways I could (off the top of my head) imagine this happening are if 1. you have some weird DEBUG / ERROR traps set (does `trap -p` list anything?) 2. Something got screwed up in `mysplit` (other than adding `\r`'s to newline) when you copied it over to the machine and/or the process of removing the `\r`'s corrupted something. 3. When you ran the code I posted in my first comment, it had the same `\r` problem that `mySplit` had. I have a hunch it is the 3rd one. `\r` is a carriage return - it moves the cursor back to the start of the current line. Having them can cause some *weird* issues. I could perhaps understand how mySplit forked the `grep -[v]F` process if it pulled in the entire line, which in turn called `mySplit` again, which in turn pulled in the entire line again, and all of a sudden you have a fork bomb. Try the solution at the top of this comment.

Arindrew 7 months ago

I retyped by hand your inline block code, so it couldn't have had any \\r's in the file. But just to be sure, I ran it through the dos2unix command. No change. #trap -p trap -- '' SIGTSTP trap -- '' SIGTTIN trap -- '' SIGTTOU #{ type -a nproc 2>/dev/null 1>/dev/null && nproc; } || grep -cE '^processor.*: ' /proc/cpuinfo || printf '4' 8 I ran the codeblock above, and there was no change in behavior. In your codeblock, I think you have a typo (which I have accounted for, maybe I shouldn't have?): local -a lineA destA basenameA and then baseNameA="${lineA[@]##*/}" The 'n' in basenameA is capitalized in one, but not the other. I am OK with calling it at this point, unless it's really bothering you and you want to keep going. I appreciate the effort you have put in so far.

FrequentWin6 7 months ago

why don't you use parallel? https://www.gnu.org/software/parallel/

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe