Performance¶

Disk IO¶

Should files be read from multiple threads, even when the disk is the bottleneck? By having multiple concurrent reads, the operating system might be able to optimize the disk access pattern, and schedule reads more efficiently for higher throughput. Let’s measure.

Disk Cache	Threads	Time (seconds)
Cold	64	106.632476927
Cold	64	106.155341479
Cold	64	104.968957864
Warm	64	0.065452067
Warm	64	0.065966143
Warm	64	0.067338459
Cold	6	109.032390370
Cold	6	108.156613210
Cold	6	110.175107966
Warm	6	0.056552910
Warm	6	0.051793717
Warm	6	0.057326269
Warm	6	0.056153033
Cold	1	131.265989187
Cold	1	130.512200200
Cold	1	130.496186066
Warm	1	0.145899503
Warm	1	0.140550669
Warm	1	0.140376767
Warm	1	0.146533344

This is for roughly 11500 files. Program output was redirected to /dev/null. Single-threaded taken from commit 4a5982ceb94b6a3dc575abce3c47e148dd28aa9f. Multi-threaded taken from commit cc06a48af7c8ea8b8b647443db1a26f77374f9e4.

Conclusion: multithreaded ingestion is advantageous, both when indexing from the disk, as well as when indexing from memory. There must be a CPU-bound part as well then. (On my system, for my workload, that is.) The next question then, is how much threads to use. 64 threads probably already takes all of the parallel gains, and the returns diminish quickly. It could be worth optimizing the number of threads for running time with a warm disk cache, and that would likely also perform almost optimally for a cold cache.

Some more results after reducing the thread queue size and doing non-blocking pushes, to keep the queue sizes more even:

Disk Cache	Threads	Queue Size	Time (seconds)
Cold	128	16	105.591609927
Cold	512	0	97.509055644
Cold	512	0	96.345510293
Cold	128	1	94.403741744
Cold	128	0	85.897972147
Cold	64	0	82.595254011
Cold	64	0	83.793832797
Cold	48	0	80.877349368
Cold	32	0	80.913407455
Cold	24	0	82.893433723
Cold	16	0	83.807142608
Cold	16	0	83.967152892
Warm	128	16	0.075636796
Warm	128	1	0.072041480
Warm	128	0	0.075571860

And without queues or channels, protecting the directory iterator with a mutex instead:

Disk Cache	Threads	Time (seconds)
Cold	48	83.731602753
Cold	48	83.806947689
Cold	24	81.919455988
Cold	24	80.765494864
Cold	12	82.537088779
Cold	12	83.135829488
Warm	48	0.056744610
Warm	24	0.059594100
Warm	24	0.054264233
Warm	12	0.056491306
Warm	12	0.056685518

Precollect¶

At commit c6c611be9179d939dc5646dc43ab8bdf5ddc2962, with 24 threads. First collecting discovered paths into a vec, and constructing the index by iterating over the paths in the vec. Is this the right thing to do, or should we put the paths iterator in a mutex directly? Measurement setup:

echo 3 | sudo tee /proc/sys/vm/drop_caches
perf stat target/release/musium ~/music

Note that the server was disabled to terminate the program after indexing. Also, these results are not comparable to the previous numbers, as the library has grown, and more data is processed. Furthermore, I did not redirect stdout to /dev/null in this case, but for a cold disk cache that does not make so much of a difference anyway.

Precollect	Time (seconds)
Vec precollect 1	91.870704962
Vec precollect 1	90.106878818
Vec precollect 1	90.031705480
Vec precollect 2	86.926306901
Vec precollect 2	86.876997701
Vec precollect 2	89.131675265
Iter, double alloc	93.370680604
Iter, double alloc	93.180283609
Iter, double alloc	93.259494622
Iter, single alloc	94.026253229
Iter, single alloc	94.147137607
Iter, single alloc	94.352803977

Note that I did upgrade Walkdir when switching from vector precollect to the iterator-based version, so the comparison may be unfair. The data collected before the switch is labelled “Vec precollect 1”, the version after upgrading to Walkdir 2.1.4 is labelled “Vec precollect 2”. Furtherore, Walkdir 2.1.4 requires copying the path (labelled “double alloc”). I made a small change to the crate to be able to avoid the copy and extra allocation (labelled “single alloc”).

Counterintuitively, copying the path returned by the iterator is faster than not copying it. It might have something to do with ordering; spending more time in the iterator lock is actually a good thing? Or maybe I should collect more data, and this is just a statistical fluctuation. Just storing the paths is definitely faster if the copy is avoided:

copy    <- c(0.023223363, 0.022365082, 0.022318216, 0.022584837,
             0.020660742, 0.023839308, 0.022084252, 0.021812114,
             0.022180668, 0.019982074, 0.020979151, 0.023186709,
             0.024758619, 0.022889618, 0.024148854, 0.024708654)
noncopy <- c(0.022403112, 0.021863389, 0.019650964, 0.020984869,
             0.021901483, 0.021376926, 0.021668108, 0.021504715,
             0.023730031, 0.021861766, 0.021060567, 0.021986531,
             0.022680138, 0.019719019, 0.020053399, 0.021137137)
t.test(copy, noncopy)

#     Welch Two Sample t-test
#
# data:  copy and noncopy
# t = 2.6055, df = 28.297, p-value = 0.01447
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
#  0.000242829 0.002024684
# sample estimates:
#  mean of x  mean of y
# 0.02260764 0.02147388

So it is preferable to read many paths at once before processing them, perhaps due to better branch prediction. The gains are so big that the extra allocations and reallocations for storing the pathbuf pointers in a vec are totally worth it. It might be even better then to alternate beween scanning paths and processing them, to reduce peak memory usage, but let’s not worry about that at this point.

Fadvise¶

Command:

echo 3 | sudo tee /proc/sys/vm/drop_caches
perf stat target/release/musium cache /pool/music /pool/volatile/covers dummy

Measurements were performed with disks spinning. If the disks needed to spin up first, I restarted the measurement as soon as the disk was spinning.

Baseline, commit bcb01aac03b72c6250823d44d2b4dd71887e387c:

Disk Cache	Tracks	Wall time (seconds)	User time (seconds)	Sys time (seconds)
Cold	15931	142.662233261	3.283129000	8.579975000
Cold	15931	147.348811539	3.236058000	8.641414000
Cold	15931	145.916103563	3.376106000	8.547039000
Warm	15931	0.346267741	0.987189000	0.427480000
Warm	15931	0.369951824	0.886352000	0.523628000
Warm	15931	0.372806305	0.929290000	0.480558000

Open files first, read later, commit 0f2d00be7ef2009fe19af79ae02ac29d11c766cf:

Disk Cache	Tracks	Wall time (seconds)	User time (seconds)	Sys time (seconds)
Cold	15931	200.334320084	4.513103000	10.766578000
Warm	15931	0.835945466	2.131593000	2.420703000

Use “frontier” read pattern, commit 64371ff0aa834add77185531bae7160cfd6134ad:

Disk Cache	Tracks	Wall time (seconds)	User time (seconds)	Sys time (seconds)
Cold	15931	148.444013742	4.524398000	10.423234000
Cold	15931	147.144940804	4.670934000	10.321421000
Warm	15931	1.134759625	2.831797000	4.271261000
Warm	15931	1.204304762	3.183732000	4.562911000

After changing the IO queue size (and also tuning internal queue a bit), this could be brought down to 93 seconds, which suggests the win is really more in IO patterns, and for the warm case, simpler is probably better.

$ cat /sys/block/sd{b,c,d}/queue/nr_requests
4
4
4
$ echo 2048 | sudo tee /sys/block/sd{b,c,d}/queue/nr_requests
2048

Just regular blocking IO again, but tuning the disk queue size, commit a1ac4d2100f6d0bd61f0be99fc31b7da260be6af:

Disk Cache	Tracks	nr_requests	Threads	Wall time (seconds)	User time (seconds)	Sys time (seconds)
Cold	15931	2048	24	91.411324389	3.086841000	6.717432000
Cold	15931	2048	24	93.145262004	3.103839000	7.002719000
Cold	15931	2048	128	72.864833181	2.616870000	5.983099000
Cold	15931	2048	128	73.701789903	2.642680000	5.953727000
Cold	15931	2048	256	72.805136103	2.503932000	5.793247000
Cold	15931	2048	256	72.108359036	2.625222000	5.811989000
Warm	15931	2048	24	0.320772413	0.952222000	0.421805000
Warm	15931	2048	24	0.359001663	0.986530000	0.390287000
Warm	15931	2048	128	0.370368452	0.927427000	0.480482000
Warm	15931	2048	128	0.454763022	0.930003000	0.776343000
Warm	15931	2048	256	0.410941427	0.922298000	0.711005000
Warm	15931	2048	256	0.365117437	0.928634000	0.472226000