“Hypothesis” is a generic guess or a scientifically intelligent guess which we make every day. In our everyday lives, we set and think of multiple hypotheses. Some common examples are “It has been raining for the past several days, it might rain today as well. I should carry an Umbrella to work” or “Our class teacher told that this topic is crucial from exam perspective, she might give us a surprise test next week.” Although such statements are generic, they influence decisions that humans make every day. Nevertheless saying, a hypothesis backs almost every decision taken in businesses.
HYPOTHESIS GENERATION vs. HYPOTHESIS TESTING
We also come across specific hypotheses in data science projects, which we further validate using several statistical techniques over stipulated data. This hypothesis is to be set up by the clients or the stakeholders’ team before it reaches data scientists. Once it goes to the group of data scientists, they perform statistical analysis to validate the statements.
If I have to put it simply, there are two phases under this process. The first phase of the process is “Hypothesis generation,” which has to be done by the clients or business stakeholders. This phase is usually carried out by the “analysts” or “Subject matter experts” in a specific domain.
The second phase is the “Hypothesis testing” carried out by the team of data scientists. Data scientists spend a substantial amount of time understanding the data before performing features and model engineering.
One can use EDA techniques to validate if the business’s hypothesis makes sense or if the data has substantial gaps. This article will focus on the first phase of the process, i.e., “Hypothesis generation.”
IMPORTANCE OF HYPOTHESIS FOR DATA SCIENCE PROJECTS
Setting up a hypothesis in data science projects has several positive implications, especially when you have many variables. Few of them are:
- It helps to understand the factors that will affect the target variable, allowing data scientists to start with the right set of independent variables.
- Having a strong hypothesis helps in getting a good start on EDA, i.e., Exploratory Data Analysis, a foundation for feature & model engineering.
Let’s take an example to understand how businesses should ideally frame the hypothesis. A company named “heels on wheels” is a taxi service operating in Mexico. The company is trying to understand the time taken by the taxi for each trip. The organization’s key members met last Thursday for a quick brainstorm session. Linda prepared keynotes of the meeting, which gives pointers about all the hypotheses discussed in the discussion. A few of the leads are below:
- Distance – it is directly proportional to the time taken by the car. The higher stretch of the drop location will imply a higher trip duration.
- Speed – it is inversely proportional to the time taken by the car. A higher rate means lower trip duration.
- Car condition: Good conditioned cars will cause no or less breakdown. Hence, the better the state of the vehicle, the lower will be its trip duration. One vital factor here is the servicing of the car.
- Car size: The cars of smaller dimensions will easily pass through the traffic with lower time, while giant SUV’s will take higher time.
- Driver’s age: The drivers of less age are prone to drive fast. Hence, the lesser is the age of the driver; the lower will be the trip duration.
- It saves both time and money: A strong hypothesis helps minimize the process of oscillating between analysts and data scientists.
- It has roots of robust EDA: Framing the right hypothesis will generate a significant set of predictor variables, which eventually lays the foundation for a robust Exploratory data analysis (EDA). Strong EDA, in turn, builds a solid foundation for both feature & model engineering.
- It aids in foreseeing significant predictor variables: The businesses should ideally brainstorm the predictor variables that can significantly impact the target variable. When this task gets assigned to data scientists, they tend to wear businesspersons’ shoes, and the overall data science project turns into a research project.
This article discussed the importance of setting up the right hypothesis to reach the solution more effectively. We also discussed the difference between hypothesis generation and its testing. With the help of a case study, we saw how identifying the right data set is of utmost importance. Finally, we bridged the topic with our real-time findings.
About the author: I am currently working as a data scientist with around 2-3 years of experience in analytics. Alongside my job, I own a blog and also write for business websites. I like to read books, listen to some good music, explore places to travel, and, most importantly, “dream” in my free time. One can connect with me: here