Name	Name	Last commit message	Last commit date
parent directory ..
README.rst	README.rst
cluster.rst	cluster.rst
compose.rst	compose.rst
decomposition.rst	decomposition.rst
ensemble.rst	ensemble.rst
forecasting.rst	forecasting.rst
imported.rst	imported.rst
impute.rst	impute.rst
index.rst	index.rst
linear_model.rst	linear_model.rst
llm.rst	llm.rst
metrics.pairwise.rst	metrics.pairwise.rst
metrics.rst	metrics.rst
model_selection.rst	model_selection.rst
pipeline.rst	pipeline.rst
preprocessing.rst	preprocessing.rst
remote.rst	remote.rst

Name

Last commit message

Last commit date

BigQuery DataFrames ML

As BigQuery DataFrames implements the Pandas API over top of BigQuery, BigQuery DataFrame ML implements the SKLearn API over top of BigQuery Machine Learning.

Tutorial

Start a session and initialize a dataframe for a BigQuery table

import bigframes.pandas

df = bigframes.pandas.read_gbq("bigquery-public-data.ml_datasets.penguins")
df

Clean and prepare the data

# filter down to the data we want to analyze
adelie_data = df[df.species == "Adelie Penguin (Pygoscelis adeliae)"]

# drop the columns we don't care about
adelie_data = adelie_data.drop(columns=["species"])

# drop rows with nulls to get our training data
training_data = adelie_data.dropna()

# take a peek at the training data
training_data

# pick feature columns and label column
X = training_data[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex']]
y = training_data[['body_mass_g']]

Use train_test_split to create train and test datasets

from bigframes.ml.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2)

Define the model training pipeline

from bigframes.ml.linear_model import LinearRegression
from bigframes.ml.pipeline import Pipeline
from bigframes.ml.compose import ColumnTransformer
from bigframes.ml.preprocessing import StandardScaler, OneHotEncoder

preprocessing = ColumnTransformer([
    ("onehot", OneHotEncoder(), ["island", "species", "sex"]),
    ("scaler", StandardScaler(), ["culmen_depth_mm", "culmen_length_mm", "flipper_length_mm"]),
])

model = LinearRegression(fit_intercept=False)

pipeline = Pipeline([
    ('preproc', preprocessing),
    ('linreg', model)
])

# view the pipeline
pipeline

Train the pipeline

pipeline.fit(X_train, y_train)

Evaluate the model's performance on the test data

from bigframes.ml.metrics import r2_score

y_pred = pipeline.predict(X_test)

r2_score(y_test, y_pred)

Make predictions on new data

import pandas

new_penguins = bigframes.pandas.read_pandas(
    pandas.DataFrame(
        {
            "tag_number": [1633, 1672, 1690],
            "species": [
                "Adelie Penguin (Pygoscelis adeliae)",
                "Adelie Penguin (Pygoscelis adeliae)",
                "Adelie Penguin (Pygoscelis adeliae)",
            ],
            "island": ["Torgersen", "Torgersen", "Dream"],
            "culmen_length_mm": [39.5, 38.5, 37.9],
            "culmen_depth_mm": [18.8, 17.2, 18.1],
            "flipper_length_mm": [196.0, 181.0, 188.0],
            "sex": ["MALE", "FEMALE", "FEMALE"],
        }
    ).set_index("tag_number")
)

# view the new data
new_penguins

pipeline.predict(new_penguins)

Save the trained model to BigQuery, so we can load it later

pipeline.to_gbq("bqml_tutorial.penguins_model", replace=True)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.rst

BigQuery DataFrames ML

Tutorial

FilesExpand file tree

bigframes.ml

Directory actions

More options

Directory actions

More options

Latest commit

History

bigframes.ml

Folders and files

parent directory

README.rst

BigQuery DataFrames ML

Tutorial