Speed Up Your Data Collection Efforts With Multithreading in Python | by Aashish Nair | Jun, 2022

Data retrieval doesn’t have to be a one-person job

Photo by Randy Fath on Unsplash

Many projects require a form of data collection, which entails procuring information from external sources like databases and websites.

Unfortunately, data collection procedures can become tedious and time-consuming as the number of data retrieval tasks scales up.

Fortunately, users can spare themselves the long run times by incorporating multithreading when performing their data pulls.

Here, we highlight the benefits of multithreading and discuss how it can be incorporated into programs to expedite data collection operations in Python.

Multithreading enables users to execute operations quicker by running different parts of a process (known as threads) concurrently.

Consider the following function, which takes around five seconds to run.

Code Output (Created By Author)

If the function requires five seconds to execute, how long would it take to run it three times?

1*g9CH64lhdlIHcuN DlHGOg
Code Output (Created By Author)

It’s a no-brainer. The whole operation would require about 15 seconds altogether.

However, such an arrangement can be unbearably inefficient when there is a high number of tasks. How cumbersome would this approach be if the function had to be iterated hundreds or thousands of times?

Fortunately, if the tasks of interest are independent of each other, users can reduce the time needed to complete the operation by utilizing multithreading, which distributes the tasks across multiple threads.

Python allows users to incorporate multithreading with the concurrent.futures package, which offers the ThreadPoolExecutor class.

One of the most powerful features of this class is the map method which, much like the built-in map method, maps the function of interest with the given collection of arguments.

We can use this class to expedite the previous operation by performing the tasks concurrently across multiple threads.

Code Output (Created By Author)

Instead of carrying out tasks one at a time, the ThreadPoolExecutor executes multiple tasks in the same period of time, thereby reducing run time considerably.

Multithreading is suited for I/O bound tasks, which spend a lot of time interacting with some form of an external system, such as a database or a web server. As data collection procedures often entail receiving data from an external source, it is practical to improve such processes with multithreading.

Multithreading is commonly used to execute operations like scrapping data from websites, reading files, or making calls with an API with much greater efficiency.

As a simple example, suppose that we are looking to collect data using the University Domains and Names API, which provides information on the universities in a given country.

Using the following function, one can obtain data on the university names of any given country.

The goal is to procure information on universities from 248 different countries. Without multithreading, this function can be used to collect data on universities for one country at a time using a simple loop.

Code Output (Created By Author)

Collecting data for all countries in this manner takes about 14 seconds. While the approach gets the job done, it is highly inefficient.

Let’s see how much time is taken to procure the same data by executing the function across ten threads with the ThreadPoolExecutor class.

1*Bboos9qmBVaQeISImz NDg
Code Output (Created By Author)

Carrying out data retrievals with the API concurrently through the use of multithreading shaves off a significant amount of time.

Although you might be assessing multithreading for its ability to decrease run times for data collection procedures, you might be wondering: why stop at just using it for data collection? Why don’t we use multithreading to speed up every operation of interest?

Simply put, this technique has its own limitations.

While multithreading is suitable for I/O-bound tasks, it is inadequate when it comes to dealing with CPU-bound tasks. CPU-bound tasks primarily rely on the processing power of the CPU.

Incorporating multithreading in such operations may be ineffective and could even yield longer run times.

So, if you’re dealing with computationally intensive processes like training machine learning models or processing images, multithreading is not a suitable solution. For such cases, consider exploring more viable alternatives like multiprocessing.

Photo by Prateek Katyal on Unsplash

Multithreading is an invaluable asset for data collection efforts.

That being said, this article mainly serves to advocate the use of the technique and doesn’t delve too much into the best Python modules for carrying out multithreading.

I personally prefer using the ThreadPoolExecutor class from the concurrent.futures package due to its simple interface. However, there are other options one can utilize for carrying out multithreading (e.g., the threading module).

Feel free to explore the various methods for incorporating multithreading in Python and experiment with them to see which ones suit you best.

Happy coding!

News Credit

%d bloggers like this: