Normalizing and Scaling Data
Concept: In machine learning, features (input data) can have different ranges. For instance, some features may vary from 0 to 1, while others may range from 0 to 10,000. Normalization and scaling ensure that all features contribute equally to the model.
Simple Example: Imagine we are comparing scores from different exams. One exam is scored out of 100, while another is scored out of 10. To ensure fair comparison, we scale the scores so they are on the same range:
from sklearn.preprocessing import normalize scores = [[95, 8], [85, 9], [74, 6], [70, 7]] normalized_scores = normalize(scores) print(normalized_scores)
In this example, the exam scores are adjusted to a common scale so that no exam gets more importance just because of a higher range.
Exoplanet Data: In the exoplanet dataset, the flux values can vary greatly between different stars. To ensure that all flux values are treated equally by the model, we normalize the data:
from sklearn.preprocessing import normalize x_train = normalize(x_train) x_test = normalize(x_test)
This step makes sure that features with larger ranges don’t dominate the learning process.
After normalizing, we also apply feature scaling to further standardize the data:
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() x_train = scaler.fit_transform(x_train) x_test = scaler.transform(x_test)
Scaling the data ensures that the machine learning model treats all features equally, regardless of their original scale.