0

Python – Are Threads or Processes more appropriate for disk I/O-bound operations?

For a task which involves needing to reading and writing a lot of data to and from a lot of files, would threads or processes improve performance?

Currently I’m delegating the workload to a series of worker processes using Multiprocessing, but I got to thinking– this isn’t a CPU-heavy operation, so could I do the same with Threads, or would the GIL get in the way? Managing subprocesses is kind of a pain so keeping it all in one process would ease administration.

I checked around Google and StackOverflow and couldn’t find a coherent answer that actually addresses the question, so I wrote a script to figure it out for myself. It writes 10,000,000 lines to 10 files at the same time. Just change the file_location to something with lots of space (I left it as the Mongo default directory just in case).

Testing hardware was Amazon EC2 m4.4xlarge, 64 GB RAM. General Purpose SSD (gp2).

(Results may differ on spinning disks; I don’t have any to test on!)

PyPy 2.6.0 (Python 2.7.12)

Python 2.7.12 (aff251e543859ce4508159dd9f1a82a2f553de00, Nov 12 2016, 08:50:18)
[PyPy 5.6.0 with GCC 6.2.0] on linux2

(PYPY) [email protected]:~$ python iotest.py 
Starting Multithreaded ...
Multithreaded attempt 1 of 10 took 15.2312870026 seconds to write 10000000 lines to 10 files.
Multithreaded attempt 2 of 10 took 15.3182249069 seconds to write 10000000 lines to 10 files.
Multithreaded attempt 3 of 10 took 15.2006270885 seconds to write 10000000 lines to 10 files.
Multithreaded attempt 4 of 10 took 15.7974839211 seconds to write 10000000 lines to 10 files.
Multithreaded attempt 5 of 10 took 15.3022320271 seconds to write 10000000 lines to 10 files.
Multithreaded attempt 6 of 10 took 15.4110209942 seconds to write 10000000 lines to 10 files.
Multithreaded attempt 7 of 10 took 15.2489261627 seconds to write 10000000 lines to 10 files.
Multithreaded attempt 8 of 10 took 15.2728102207 seconds to write 10000000 lines to 10 files.
Multithreaded attempt 9 of 10 took 15.5378439426 seconds to write 10000000 lines to 10 files.
Multithreaded attempt 10 of 10 took 15.6445670128 seconds to write 10000000 lines to 10 files.
--------------------------------------------------------------------------------
Multithreaded results: [15.23, 15.32, 15.2, 15.8, 15.3, 15.41, 15.25, 15.27, 15.54, 15.64]
avg: 15.3965023279
--------------------------------------------------------------------------------
Starting Multiprocessed ...
Multiprocessed attempt 1 of 10 took 2.72080206871 seconds to write 10000000 lines to 10 files.
Multiprocessed attempt 2 of 10 took 2.78000712395 seconds to write 10000000 lines to 10 files.
Multiprocessed attempt 3 of 10 took 2.83391690254 seconds to write 10000000 lines to 10 files.
Multiprocessed attempt 4 of 10 took 2.97232913971 seconds to write 10000000 lines to 10 files.
Multiprocessed attempt 5 of 10 took 2.73234081268 seconds to write 10000000 lines to 10 files.
Multiprocessed attempt 6 of 10 took 2.72563314438 seconds to write 10000000 lines to 10 files.
Multiprocessed attempt 7 of 10 took 2.69105911255 seconds to write 10000000 lines to 10 files.
Multiprocessed attempt 8 of 10 took 2.70203900337 seconds to write 10000000 lines to 10 files.
Multiprocessed attempt 9 of 10 took 2.84333300591 seconds to write 10000000 lines to 10 files.
Multiprocessed attempt 10 of 10 took 2.72442889214 seconds to write 10000000 lines to 10 files.
--------------------------------------------------------------------------------
Multiprocessed results: [2.72, 2.78, 2.83, 2.97, 2.73, 2.73, 2.69, 2.7, 2.84, 2.72]
avg: 2.77258892059
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Multiprocessing is faster.

…so concurrent file I/O is faster using multiprocessing, but that’s with PyPy– how about standard CPython 2.7?

Python 2.7.12 (CPython)

Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609] on linux2

I ran fewer tests for this one because I don’t have all night but the result was similarly disparate.

[email protected]:~$ python iotest.py 
Starting Multithreaded ...
Multithreaded attempt 1 of 1 took 102.89225316 seconds to write 10000000 lines to 10 files.
--------------------------------------------------------------------------------
Multithreaded results: [102.89]
avg: 102.89225316
--------------------------------------------------------------------------------
Starting Multiprocessed ...
Multiprocessed attempt 1 of 1 took 4.37144494057 seconds to write 10000000 lines to 10 files.
--------------------------------------------------------------------------------
Multiprocessed results: [4.37]
avg: 4.37144494057
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Multiprocessing is faster.

One More Test…

I ran the test again on PyPy, changing two parameters– write 1000000 lines to 30 files.

(PYPY) [email protected]:~$ python iotest.py 
Starting Multiprocessed ...
Multiprocessed attempt 1 of 1 took 0.658452987671 seconds to write 1000000 lines to 30 files.
--------------------------------------------------------------------------------
Multiprocessed results: [0.66]
avg: 0.658452987671
--------------------------------------------------------------------------------
Starting Multithreaded ...
Multithreaded attempt 1 of 1 took 4.59706401825 seconds to write 1000000 lines to 30 files.
--------------------------------------------------------------------------------
Multithreaded results: [4.6]
avg: 4.59706401825
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Multiprocessing is faster.

Conclusion

I’m sure any number of highly-educated people can find any number of problems with my test methodology, but I’m trusting my eyes on this one. It appears if you’re doing heavy file I/O, you are best served by delegating tasks to worker Processes instead of Threads.

Leave a Reply