Technology
A Data Scientists Perspective on Model Building: From Scratch or Imports?
A Data Scientist's Perspective on Model Building: From Scratch or Imports?
When it comes to modeling and machine learning in the real world, many data scientists and machine learning engineers often find themselves weighing between building models from scratch and using pre-existing packages and libraries. In this article, we will explore the practical considerations and reasoning behind choosing one method over the other.
The Reality of Model Building in the Real World
Contrary to popular belief, it is rare for data scientists to build models from scratch. According to my experience and observation, the vast majority of data scientists and machine learning engineers working in the real world do not have the technical expertise to develop their own models. The requirement for advanced programming skills can be a significant barrier, and most professionals simply do not possess the necessary knowledge to create custom models.
The primary reason for this is that companies are primarily interested in data science professionals who can leverage existing models to solve their specific business problems. Building models from scratch is not only time-consuming but also risky, especially when industry-tested models already exist. Employers want to see results, and they want to see them quickly. Building custom models may lead to missed deadlines or financial loss, as companies focus on gaining a competitive edge.
Industry Standards and Practicality
So what does this mean for data scientists and machine learning engineers? It means that in most cases, you should prioritize using pre-existing packages and libraries over building everything from scratch. The reason is straightforward: industry standards and practicality.
Leveraging pre-built models not only saves time and resources but also ensures that you are using well-established and tested solutions. When you rely on industry-standard packages, you benefit from the collective knowledge and experience of the community. Libraries like TensorFlow, PyTorch, Scikit-learn, and Keras are widely adopted and continuously maintained, providing robust and efficient implementations of various machine learning algorithms.
For example, consider a simple linear regression model. After 7 years in the industry, I’ve met only two people who could write this from the ground up. This fact highlights the specialized nature of model building. Most data scientists prefer to use pre-existing packages like Scikit-learn for linear regression due to the simplicity and reliability they offer. By choosing these packages, you can focus on the unique aspects of your project rather than reinventing the wheel.
When to Build from Scratch
Despite the prevalence of pre-existing packages, there are scenarios where building a model from scratch may be necessary. This typically occurs when you are tackling a highly specific problem or conducting cutting-edge research. In such cases, the off-the-shelf solutions may not meet your requirements, and you may need to develop custom solutions. This is especially true in academic or research settings where the goal is to push the boundaries of existing algorithms and approaches.
While building models from scratch can be a rewarding endeavor, it requires a significant investment of time and effort. You must be prepared to spend time understanding the nuances of the algorithms, debugging, and optimizing the code. This is not a task for the faint of heart and may not align with the expectations of a business setting where faster and more reliable solutions are valued.
Conclusion
In the real world, it is more practical and efficient to use pre-existing packages and libraries for model building. This approach aligns with industry standards and leverages the collective expertise of the community. While building custom models can be valuable in certain research or specialized contexts, it is rarely the preferred method in the day-to-day work of data scientists and machine learning engineers. By focusing on using proven tools, you can deliver results more efficiently and effectively, meeting the demands of your company and clients.
Keyword1: data science
Keyword2: machine learning
Keyword3: model building