SMM4H’s documentation!

Welcome to SMM4H’s documentation. Check out our ReadMe for more details.

CNN

class smm4h.cnn.CNN(x_train, y_train, embedding_matrix, x_val, y_val, labels, dim, maxlen, maxwords, filter_length, cross_val, weights, weight_ratios, test=False, x_test=None, x_test_path=None, export_path=None, x_col_name=None, y_col_name=None, id_col_name=None)

This is the class that contains the CNN.

Parameters:
  • x_train (List) – X train data pre-processed
  • y_train (List) – Y train data pre-processed
  • embedding_matrix – embedding ready for embedding layer
  • x_val (List) – X validation data pre-processed
  • y_val (List) – Y validation data pre-processed
  • labels (List) – list of unique labels
  • dim (Int) – dimension of word embedding
  • maxlen (Int) – maximum input length of a tweet
  • maxwords (Int) – maximum words
  • filter_length (Int) – length of filter
  • cross_val (Bool) – flag for cross validation
  • weights (Bool) – flag for keras class weights
  • weight_ratios (List) – list of weights
  • test (Bool) – flag for test data
  • x_test (List) – X test data pre-processed
  • x_test_path (Str) – path to X test data
  • export_path (Str) – path to where results should be exported to
  • x_col_name (Str) – name of the column with the X data
  • y_col_name (Str) – name of the column with the Y data
  • id_col_name (Str) – name of the column with the IDs
cv()

This function does the cross validation.

cv_evaluation_fold(y_pred, y_true, labels)

Evaluation metrics for emicroach fold.

Parameters:
  • y_pred (List) – predicted y data
  • y_true (List) – correct y data
  • labels (List) – list of possible labels
Returns:

fold stats

fit_Model(model, x_train, y_train)

fit the defined model to train on the data

Parameters:
  • model – trained model
  • x_train (List) – training data
  • y_train (List) – training labels
Returns:

model and loss & accuracy stats

predict_model(model, x_test, y_test, encoder_classes)

Takes the predictions as input and returns the indices of the maximum values along an axis using numpy argmax function as true labels. Then evaluates it against the trained model

Parameters:
  • model – trained model
  • x_test (List) – test data
  • y_test (List) – test true labels
  • encoder_classes (List) – labels
Returns:

predicted and true labels

Return type:

List

prediction_to_label(prediction)

Turns the prediction into a label.

Parameters:prediction – prediction for X data
Returns:labels in dictionary form
Return type:dict
test_data(model)

This file will export test data. You will likely have to modifiy it slightly.

train_test()

This function does train-test.

Model

class smm4h.model.Model(Xdata_train, Ydata_train, Xdata_val, Ydata_val, maxwords, maxlen, test=False, data_test=None)

Prepares data for CNN

Parameters:
  • Xdata_train (List) – preprocessed X train data
  • Ydata_train (List) – preprocessed Y train data
  • Xdata_val (List) – preprocessed X validation data
  • Ydata_val (List) – preprocessed Y validation data
  • maxwords (Int) – maximum words to use
  • maxlen (Int) – maximum input length for tweet
  • test (Bool) – test data flag
  • data_test (List) – preprocessed test data
get_features(text_series, tokenizer)

Transforms text data to feature_vectors that can be used in the ml model. tokenizer must be available.

Parameters:
  • text_series – text to create sequences from
  • tokenizer – scikit learn tokenizer that has been fitted to text
Returns:

padded sequences

process_test(data_test)

Prepares test data for model.

Parameters:
  • data_test – x test data
  • data_test – List
Returns:

X test data prepared for model

Return type:

List

process_train(Xdata_train, Ydata_train)

Reads in X data and formats it correctly.

Parameters:
  • Xdata_train – CSV file of X train data read in via the read_from_file function
  • Ydata_train – CSV file of Y train data read in via the read_from_file function
Returns:

X & Y train data, word_index, labels

process_val(x_data_val, y_data_val, tok)

Prepares validation data for model.

Parameters:
  • Xdata_val (List) – x validation data
  • Ydata_val (List) – y validation data
Returns:

X&Y validation data ready for model

Embedding

class smm4h.embedding.MakeEmbedding(word_index, embedding, dim, maxwords, error_handling=False)

Creates embedding for CNN

Parameters:
  • word_index – word_index from model file
  • embedding (Str) – path to embedding file
  • dim (Int) – dimension of the embedding
  • maxwords (Int) – maximum number of words to use
  • error_handling (Bool) – flag for embedding read function that handles errors
init_embedding()

function that creates embedding matrix from word_index.

Parameters:word_index (Dict) – word index of X data
Returns:embedding matrix
read_embeddings_from_file(path)

Function to read external embedding files to build an index mapping words (as strings) to their vector representation (as number vectors).

Parameters:path (Str) – path to emebedding
Return dictionary:
 word vectors
Return type:dictionary: dict
read_embeddings_from_file_error_handling(path)

Function to read external embedding files to build an index mapping words (as strings) to their vector representation (as number vectors). Works exactly the same as def read_embeddings_from_file except has addional error handling for some embeddings that are tricky to read.

Parameters:path (Str) – path to emebedding
Return dictionary:
 word vectors

Preprocessing

class smm4h.preprocessing.Preprocessing(file, test, x_col_name, y_col_name)

This file preprocesses the data and cleans it.

Parameters:
  • file (Str) – path to data file
  • test (Bool) – flag for if this is processing the test data
  • x_col_name (Str) – Name of the column that has the X data
  • y_col_name (Str) – Name of the column that has the Y data
remove_drug_names(tweets)

Replaces drug names with word drug. This is an experimental feature.

Parameters:tweets (List) – X data
Returns:tweets in list with drug names replaced
Return type:List
remove_html(tweets)

Removes &amp

Parameters:tweets (List) – X data
Returns:tweets in list &amp removed
Return type:List
remove_punctuation(tweets)

Removes double quotes

Parameters:tweets (List) – X data
Returns:tweets in list with double quotes removed
Return type:List
replace_emojis(tweets)

Remove emojis and replace them with word that represents them

Parameters:tweets (List) – X data
Returns:tweets in list with emojis replaced
Return type:List
replace_hashtags(tweets)

Replaces hashtags with word hashtag

Parameters:tweets (List) – X data
Returns:tweets in list with hashtags replaced
Return type:List

Replaces links with word hyperlink

Parameters:tweets (List) – X data
Returns:tweets in list with links replaced
Return type:List
replace_usernames(tweets)

Replaces usernames with word username

Parameters:tweets (List) – X data
Returns:tweets in list with usernames replaced
Return type:List

Unbalanced

class smm4h.unbalanced.Unbalanced(X, Y, unbalanced, multiplier=None, ratio1=None, ratio2=None, ratio1_label=None, ratio2_label=None)

Oversamples or desamples data. Only works with 2 classes.

Parameters:
  • X (List) – X data
  • Y (List) – Y data
  • unbalanced (Str) – flag for desample, oversample or none. Four options: desample, oversample, weights, none.
  • multiplier (Int) – number to duplicate by for oversampling. 1 more than duplications desired.
  • ratio1 (Int) – ratio desired for first label
  • ratio2 (Int) – ratio desired for second label
  • ratio1_label (Int) – first label
  • ratio2_label (Int) – second label
desample(sentences, labels)

Desamples.

Parameters:
  • labels (List) – y data
  • sentences (List) – x data
Returns:

desampled sentences, desampled labels

Return type:

List

oversample(labels, sentences)

Oversamples.

Parameters:
  • labels (List) – y data
  • sentences (List) – x data
Returns:

oversampled sentences, oversampled labels

Return type:

List

File

class smm4h.file.File
read_from_file(file)

Reads external files and insert the content to a list. It also removes whitespace characters like new line at the end of each lines.

Parameters:file (Str) – name of the input file.
Returns:content of the file
Return type:List
write_from_list(list, file_path)

Creates a csv file from a list

Parameters:
  • list (List) – list to become a csv file
  • file_path (Str) – path/name to the new file.