• Résumé

    Ce document présente le programme de référence à utiliser dans le cadre du data challenge "Sobriété numérique : développer un modèle prédictif éco-responsable". Il décrit la thématique fonctionnelle à traiter en précisant le contexte des graphes de connaissance et les données utilisées dans le challenge.

    Citation : . (Oct 2022). Comment développer un modèle de prédiction éco-responsable ?. https://management-datascience.org/scripts/21395/.
    L'auteur : 
    • Henri Laude
       (henri.laude@ar-p.com) - Advanced Research Partners
    Copyright : © 2022 l'auteur. Publication sous licence Creative Commons CC BY-ND.
    Liens d'intérêts : Le ou les auteurs déclarent ne pas avoir connaissance de conflit d'intérêts impliqués par l'écriture de cet article.
    Financement : Le ou les auteurs déclarent ne pas avoir bénéficié de financement pour le travail mis en jeu par cet article.
    Contenu

    Le programme de référence

    Vous avez le choix du langage (Python ou R), des algorithmes et des packages que vous voulez invoquer dans ceux proposés (Tensorflow 2.x, Pytorch …). Pour autant, il vous faut utiliser les mêmes données en entrée (fichier “cora”), effectuer un split 50/50 aléatoire sur les données et la dernière ligne produite par votre programme doit être formatée de la façon suivante :

    Accuracy = 79.0%, Time = 30.1 s

    Accuracy étant le pourcentage de réussite de prédiction sur le graphe de test, et Time étant le délais entre le début du code de création du modèle et le calcul de l’accuracy sur les données de test.

    
    # Graph attention network (GAT) for node classification (d'après un tutoriel keras.com)
    
    # Author: akensert
    # Date created: 2021/09/13
    # Last modified: 2021/12/26
    # Description: An implementation of a Graph Attention Network (GAT)
    # for node classification.
    
    # minor changes for our challenge : H.Laude
    # Date : 2022/10/05
    
    # Import packages 
    
    import tensorflow as tf
    from tensorflow import keras
    from tensorflow.keras import layers
    import numpy as np
    import pandas as pd
    import os
    import warnings
    import time
    
    warnings.filterwarnings("ignore")
    pd.set_option("display.max_columns", 6)
    pd.set_option("display.max_rows", 6)
    np.random.seed(2)
    # chargement data 
    
    zip_file = keras.utils.get_file(
        fname="cora.tgz",
        origin="https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz",
        extract=True,
    )
    
    data_dir = os.path.join(os.path.dirname(zip_file), "cora")
    
    citations = pd.read_csv(
        os.path.join(data_dir, "cora.cites"),
        sep="\t",
        header=None,
        names=["target", "source"],
    )
    
    papers = pd.read_csv(
        os.path.join(data_dir, "cora.content"),
        sep="\t",
        header=None,
        names=["paper_id"] + [f"term_{idx}" for idx in range(1433)] + ["subject"],
    )
    
    
    class_values = sorted(papers["subject"].unique())
    class_idx = {name: id for id, name in enumerate(class_values)}
    paper_idx = {name: idx for idx, name in enumerate(sorted(
      papers["paper_id"].unique()))}
    
    papers["paper_id"] = papers["paper_id"].apply(lambda name: paper_idx[name])
    citations["source"] = citations["source"].apply(lambda name: paper_idx[name])
    citations["target"] = citations["target"].apply(lambda name: paper_idx[name])
    papers["subject"] = papers["subject"].apply(lambda value: class_idx[value])

    Extrait des “citations” :

    
    print(citations)
          target  source
    0          0      21
    1          0     905
    2          0     906
    ...      ...     ...
    5426    1874    2586
    5427    1876    1874
    5428    1897    2707
    
    [5429 rows x 2 columns]

    Extrait des “papiers” :

    print(papers)
          paper_id  term_0  term_1  ...  term_1431  term_1432  subject
    0          462       0       0  ...          0          0        2
    1         1911       0       0  ...          0          0        5
    2         2002       0       0  ...          0          0        4
    ...        ...     ...     ...  ...        ...        ...      ...
    2705      2372       0       0  ...          0          0        1
    2706       955       0       0  ...          0          0        0
    2707       376       0       0  ...          0          0        2
    
    [2708 rows x 1435 columns]

    Travaux de préparation

    # split 
    
    # Obtain random indices
    random_indices = np.random.permutation(range(papers.shape[0]))
    
    # 50/50 split
    train_data = papers.iloc[random_indices[: len(random_indices) // 2]]
    test_data = papers.iloc[random_indices[len(random_indices) // 2 :]]
    
    # graph construction 
    
    # Obtain paper indices which will be used to gather node states
    # from the graph later on when training the model
    train_indices = train_data["paper_id"].to_numpy()
    test_indices = test_data["paper_id"].to_numpy()
    
    # Obtain ground truth labels corresponding to each paper_id
    train_labels = train_data["subject"].to_numpy()
    test_labels = test_data["subject"].to_numpy()
    
    # Define graph, namely an edge tensor and a node feature tensor
    edges = tf.convert_to_tensor(citations[["target", "source"]])
    node_states = tf.convert_to_tensor(papers.sort_values("paper_id").iloc[:, 1:-1])

    Vérification des formes des tenseurs en question

    # Print shapes of the graph
    print("Edges shape:\t\t", edges.shape)
    Edges shape:         (5429, 2)
    print("Node features shape:", node_states.shape)
    Node features shape: (2708, 1433)

    Début du véritable challenge, avec lancement du timer

    # timer
    
    start_time = time.time()
    
    # model -----------------------------------------------------------------------
    
    class GraphAttention(layers.Layer):
        def __init__(
            self,
            units,
            kernel_initializer="glorot_uniform",
            kernel_regularizer=None,
            **kwargs,
        ):
            super().__init__(**kwargs)
            self.units = units
            self.kernel_initializer = keras.initializers.get(kernel_initializer)
            self.kernel_regularizer = keras.regularizers.get(kernel_regularizer)
    
        def build(self, input_shape):
    
            self.kernel = self.add_weight(
                shape=(input_shape[0][-1], self.units),
                trainable=True,
                initializer=self.kernel_initializer,
                regularizer=self.kernel_regularizer,
                name="kernel",
            )
            self.kernel_attention = self.add_weight(
                shape=(self.units * 2, 1),
                trainable=True,
                initializer=self.kernel_initializer,
                regularizer=self.kernel_regularizer,
                name="kernel_attention",
            )
            self.built = True
    
        def call(self, inputs):
            node_states, edges = inputs
    
            # Linearly transform node states
            node_states_transformed = tf.matmul(node_states, self.kernel)
    
            # (1) Compute pair-wise attention scores
            node_states_expanded = tf.gather(node_states_transformed, edges)
            node_states_expanded = tf.reshape(
                node_states_expanded, (tf.shape(edges)[0], -1)
            )
            attention_scores = tf.nn.leaky_relu(
                tf.matmul(node_states_expanded, self.kernel_attention)
            )
            attention_scores = tf.squeeze(attention_scores, -1)
    
            # (2) Normalize attention scores
            attention_scores = tf.math.exp(tf.clip_by_value(attention_scores, -2, 2))
            attention_scores_sum = tf.math.unsorted_segment_sum(
                data=attention_scores,
                segment_ids=edges[:, 0],
                num_segments=tf.reduce_max(edges[:, 0]) + 1,
            )
            attention_scores_sum = tf.repeat(
                attention_scores_sum, tf.math.bincount(tf.cast(edges[:, 0], "int32"))
            )
            attention_scores_norm = attention_scores / attention_scores_sum
    
            # (3) Gather node states of neighbors, apply attention scores and aggregate
            node_states_neighbors = tf.gather(node_states_transformed, edges[:, 1])
            out = tf.math.unsorted_segment_sum(
                data=node_states_neighbors * attention_scores_norm[:, tf.newaxis],
                segment_ids=edges[:, 0],
                num_segments=tf.shape(node_states)[0],
            )
            return out
    
    
    class MultiHeadGraphAttention(layers.Layer):
        def __init__(self, units, num_heads=8, merge_type="concat", **kwargs):
            super().__init__(**kwargs)
            self.num_heads = num_heads
            self.merge_type = merge_type
            self.attention_layers = [GraphAttention(units) for _ in range(num_heads)]
    
        def call(self, inputs):
            atom_features, pair_indices = inputs
    
            # Obtain outputs from each attention head
            outputs = [
                attention_layer([atom_features, pair_indices])
                for attention_layer in self.attention_layers
            ]
            # Concatenate or average the node states from each head
            if self.merge_type == "concat":
                outputs = tf.concat(outputs, axis=-1)
            else:
                outputs = tf.reduce_mean(tf.stack(outputs, axis=-1), axis=-1)
            # Activate and return node states
            return tf.nn.relu(outputs)
    
    class GraphAttentionNetwork(keras.Model):
        def __init__(
            self,
            node_states,
            edges,
            hidden_units,
            num_heads,
            num_layers,
            output_dim,
            **kwargs,
        ):
            super().__init__(**kwargs)
            self.node_states = node_states
            self.edges = edges
            self.preprocess = layers.Dense(hidden_units * num_heads, activation="relu")
            self.attention_layers = [
                MultiHeadGraphAttention(hidden_units, num_heads) for _ in range(num_layers)
            ]
            self.output_layer = layers.Dense(output_dim)
    
        def call(self, inputs):
            node_states, edges = inputs
            x = self.preprocess(node_states)
            for attention_layer in self.attention_layers:
                x = attention_layer([x, edges]) + x
            outputs = self.output_layer(x)
            return outputs
    
        def train_step(self, data):
            indices, labels = data
    
            with tf.GradientTape() as tape:
                # Forward pass
                outputs = self([self.node_states, self.edges])
                # Compute loss
                loss = self.compiled_loss(labels, tf.gather(outputs, indices))
            # Compute gradients
            grads = tape.gradient(loss, self.trainable_weights)
            # Apply gradients (update weights)
            optimizer.apply_gradients(zip(grads, self.trainable_weights))
            # Update metric(s)
            self.compiled_metrics.update_state(labels, tf.gather(outputs, indices))
    
            return {m.name: m.result() for m in self.metrics}
    
        def predict_step(self, data):
            indices = data
            # Forward pass
            outputs = self([self.node_states, self.edges])
            # Compute probabilities
            return tf.nn.softmax(tf.gather(outputs, indices))
    
        def test_step(self, data):
            indices, labels = data
            # Forward pass
            outputs = self([self.node_states, self.edges])
            # Compute loss
            loss = self.compiled_loss(labels, tf.gather(outputs, indices))
            # Update metric(s)
            self.compiled_metrics.update_state(labels, tf.gather(outputs, indices))
    
            return {m.name: m.result() for m in self.metrics}
    
    
    # train 
    
    # Define hyper-parameters
    HIDDEN_UNITS = 100
    NUM_HEADS = 8
    NUM_LAYERS = 3
    OUTPUT_DIM = len(class_values)
    
    NUM_EPOCHS = 100
    BATCH_SIZE = 256
    VALIDATION_SPLIT = 0.1
    LEARNING_RATE = 3e-1
    MOMENTUM = 0.9
    
    loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    optimizer = keras.optimizers.SGD(LEARNING_RATE, momentum=MOMENTUM)
    accuracy_fn = keras.metrics.SparseCategoricalAccuracy(name="acc")
    early_stopping = keras.callbacks.EarlyStopping(
        monitor="val_acc", min_delta=1e-5, patience=5, restore_best_weights=True
    )
    
    # Build model
    gat_model = GraphAttentionNetwork(
        node_states, edges, HIDDEN_UNITS, NUM_HEADS, NUM_LAYERS, OUTPUT_DIM
    )
    
    # Compile model
    gat_model.compile(loss=loss_fn, optimizer=optimizer, metrics=[accuracy_fn])
    
    gat_model.fit(
        x=train_indices,
        y=train_labels,
        validation_split=VALIDATION_SPLIT,
        batch_size=BATCH_SIZE,
        epochs=NUM_EPOCHS,
        callbacks=[early_stopping],
        verbose=2,
    )
    Epoch 1/100
    5/5 - 8s - loss: 1.8487 - acc: 0.2783 - val_loss: 1.5296 - val_acc: 0.4485
    Epoch 2/100
    5/5 - 0s - loss: 1.2256 - acc: 0.5829 - val_loss: 1.0110 - val_acc: 0.7279
    Epoch 3/100
    5/5 - 0s - loss: 0.6838 - acc: 0.8120 - val_loss: 0.7001 - val_acc: 0.8015
    Epoch 4/100
    5/5 - 0s - loss: 0.3986 - acc: 0.8842 - val_loss: 0.7902 - val_acc: 0.7941
    Epoch 5/100
    5/5 - 0s - loss: 0.2429 - acc: 0.9343 - val_loss: 0.7304 - val_acc: 0.8162
    Epoch 6/100
    5/5 - 0s - loss: 0.1470 - acc: 0.9655 - val_loss: 0.8260 - val_acc: 0.8015
    Epoch 7/100
    5/5 - 0s - loss: 0.0896 - acc: 0.9852 - val_loss: 0.7618 - val_acc: 0.8235
    Epoch 8/100
    5/5 - 0s - loss: 0.0566 - acc: 0.9959 - val_loss: 0.6827 - val_acc: 0.8235
    Epoch 9/100
    5/5 - 0s - loss: 0.0403 - acc: 0.9951 - val_loss: 0.7350 - val_acc: 0.8456
    Epoch 10/100
    5/5 - 0s - loss: 0.0313 - acc: 0.9959 - val_loss: 0.7944 - val_acc: 0.8235
    Epoch 11/100
    5/5 - 0s - loss: 0.0275 - acc: 0.9975 - val_loss: 0.8226 - val_acc: 0.8162
    Epoch 12/100
    5/5 - 0s - loss: 0.0221 - acc: 0.9975 - val_loss: 0.8895 - val_acc: 0.8235
    Epoch 13/100
    5/5 - 0s - loss: 0.0168 - acc: 0.9992 - val_loss: 0.8565 - val_acc: 0.8309
    Epoch 14/100
    5/5 - 0s - loss: 0.0125 - acc: 0.9992 - val_loss: 0.8511 - val_acc: 0.8309
    <keras.callbacks.History object at 0x000000006A46BD68>
    _, test_accuracy = gat_model.evaluate(x=test_indices, y=test_labels, verbose=0)
    
    # prediction ------------------------------------------------------------------
    
    test_probs = gat_model.predict(x=test_indices)
    
    mapping = {v: k for (k, v) in class_idx.items()}
    
    # FIN MESURE TEMPS APRES AVOIR EFFECTUE DES PREDICTIONS SUR L'ENSEMBLE
    # DES DONNEES DE TEST
    
    interval = time.time() - start_time

    Le travail est effectué et nous avons à la fois collecté l’accuracy et le temps de traitement pour créer, entraîner et tester notre modèle.

    Pour prouver notre capacité à utiliser ce modèle, nous allons effectuer 10 prédictions.

    # 10 predictions 
    
    for i, (probs, label) in enumerate(zip(test_probs[:10], test_labels[:10])):
        print(f"Example {i+1}: {mapping[label]}")
        for j, c in zip(probs, class_idx.keys()):
            print(f"\tProbability of {c: <24} = {j*100:7.3f}%")
        print("---" * 20)
        
    # end -------------------------------------------------------------------------
    Example 1: Probabilistic_Methods
        Probability of Case_Based               =   0.994%
        Probability of Genetic_Algorithms       =   0.052%
        Probability of Neural_Networks          =  10.639%
        Probability of Probabilistic_Methods    =  87.432%
        Probability of Reinforcement_Learning   =   0.188%
        Probability of Rule_Learning            =   0.007%
        Probability of Theory                   =   0.689%
    ------------------------------------------------------------
    Example 2: Genetic_Algorithms
        Probability of Case_Based               =   0.000%
        Probability of Genetic_Algorithms       = 100.000%
        Probability of Neural_Networks          =   0.000%
        Probability of Probabilistic_Methods    =   0.000%
        Probability of Reinforcement_Learning   =   0.000%
        Probability of Rule_Learning            =   0.000%
        Probability of Theory                   =   0.000%
    ------------------------------------------------------------
    Example 3: Theory
        Probability of Case_Based               =   4.472%
        Probability of Genetic_Algorithms       =   0.190%
        Probability of Neural_Networks          =   0.019%
        Probability of Probabilistic_Methods    =  16.739%
        Probability of Reinforcement_Learning   =   0.444%
        Probability of Rule_Learning            =   1.738%
        Probability of Theory                   =  76.398%
    ------------------------------------------------------------
    Example 4: Neural_Networks
        Probability of Case_Based               =   0.000%
        Probability of Genetic_Algorithms       =   0.000%
        Probability of Neural_Networks          =  99.963%
        Probability of Probabilistic_Methods    =   0.031%
        Probability of Reinforcement_Learning   =   0.000%
        Probability of Rule_Learning            =   0.000%
        Probability of Theory                   =   0.005%
    ------------------------------------------------------------
    Example 5: Theory
        Probability of Case_Based               =  12.009%
        Probability of Genetic_Algorithms       =   7.963%
        Probability of Neural_Networks          =   3.157%
        Probability of Probabilistic_Methods    =  27.334%
        Probability of Reinforcement_Learning   =   0.657%
        Probability of Rule_Learning            =  30.371%
        Probability of Theory                   =  18.510%
    ------------------------------------------------------------
    Example 6: Genetic_Algorithms
        Probability of Case_Based               =   0.000%
        Probability of Genetic_Algorithms       = 100.000%
        Probability of Neural_Networks          =   0.000%
        Probability of Probabilistic_Methods    =   0.000%
        Probability of Reinforcement_Learning   =   0.000%
        Probability of Rule_Learning            =   0.000%
        Probability of Theory                   =   0.000%
    ------------------------------------------------------------
    Example 7: Neural_Networks
        Probability of Case_Based               =   0.005%
        Probability of Genetic_Algorithms       =   0.007%
        Probability of Neural_Networks          =  98.378%
        Probability of Probabilistic_Methods    =   1.453%
        Probability of Reinforcement_Learning   =   0.000%
        Probability of Rule_Learning            =   0.002%
        Probability of Theory                   =   0.156%
    ------------------------------------------------------------
    Example 8: Genetic_Algorithms
        Probability of Case_Based               =   0.000%
        Probability of Genetic_Algorithms       = 100.000%
        Probability of Neural_Networks          =   0.000%
        Probability of Probabilistic_Methods    =   0.000%
        Probability of Reinforcement_Learning   =   0.000%
        Probability of Rule_Learning            =   0.000%
        Probability of Theory                   =   0.000%
    ------------------------------------------------------------
    Example 9: Theory
        Probability of Case_Based               =   1.714%
        Probability of Genetic_Algorithms       =   1.794%
        Probability of Neural_Networks          =  19.014%
        Probability of Probabilistic_Methods    =  72.576%
        Probability of Reinforcement_Learning   =   0.905%
        Probability of Rule_Learning            =   0.504%
        Probability of Theory                   =   3.493%
    ------------------------------------------------------------
    Example 10: Case_Based
        Probability of Case_Based               =  99.873%
        Probability of Genetic_Algorithms       =   0.001%
        Probability of Neural_Networks          =   0.001%
        Probability of Probabilistic_Methods    =   0.085%
        Probability of Reinforcement_Learning   =   0.001%
        Probability of Rule_Learning            =   0.030%
        Probability of Theory                   =   0.009%
    ------------------------------------------------------------

    Voici la ligne que vous devez produire avec vos propres valeurs. Merci de respecter scrupuleusement la formulation afin que nous puissions la traiter automatiquement (n’oubliez pas que nous allons utiliser votre programme 10 fois et effectuer la moyenne de ces 10 passages).

    print("--" *20 +f"\nAccuracy = {test_accuracy*100:.1f}%, Time = {interval:.1f} s")
    ----------------------------------------
    Accuracy = 78.7%, Time = 15.2 s
  • Évaluation globale
    (Pas d'évaluation)

    (Il n'y a pas encore d'évaluation.)

    (Il n'y a pas encore de commentaire.)