SMM4H’s documentation!¶

Welcome to SMM4H’s documentation. Check out our ReadMe for more details.

Contents

SMM4H’s documentation!
CNN
Model
Embedding
Preprocessing
Unbalanced
File

CNN ¶

class smm4h.cnn.CNN(x_train, y_train, embedding_matrix, x_val, y_val, labels, dim, maxlen, maxwords, filter_length, cross_val, weights, weight_ratios, test=False, x_test=None, x_test_path=None, export_path=None, x_col_name=None, y_col_name=None, id_col_name=None)¶

This is the class that contains the CNN.

Parameters:

x_train (List) – X train data pre-processed
y_train (List) – Y train data pre-processed
embedding_matrix – embedding ready for embedding layer
x_val (List) – X validation data pre-processed
y_val (List) – Y validation data pre-processed
labels (List) – list of unique labels
dim (Int) – dimension of word embedding
maxlen (Int) – maximum input length of a tweet
maxwords (Int) – maximum words
filter_length (Int) – length of filter
cross_val (Bool) – flag for cross validation
weights (Bool) – flag for keras class weights
weight_ratios (List) – list of weights
test (Bool) – flag for test data
x_test (List) – X test data pre-processed
x_test_path (Str) – path to X test data
export_path (Str) – path to where results should be exported to
x_col_name (Str) – name of the column with the X data
y_col_name (Str) – name of the column with the Y data
id_col_name (Str) – name of the column with the IDs

cv()¶: This function does the cross validation.

cv_evaluation_fold(y_pred, y_true, labels)¶

Evaluation metrics for emicroach fold.

Parameters:	y_pred (List) – predicted y data y_true (List) – correct y data labels (List) – list of possible labels
Returns:	fold stats

fit_Model(model, x_train, y_train)¶

fit the defined model to train on the data

Parameters:	model – trained model x_train (List) – training data y_train (List) – training labels
Returns:	model and loss & accuracy stats

predict_model(model, x_test, y_test, encoder_classes)¶

Takes the predictions as input and returns the indices of the maximum values along an axis using numpy argmax function as true labels. Then evaluates it against the trained model

Parameters:	model – trained model x_test (List) – test data y_test (List) – test true labels encoder_classes (List) – labels
Returns:	predicted and true labels
Return type:	List

prediction_to_label(prediction)¶

Turns the prediction into a label.

Parameters:	prediction – prediction for X data
Returns:	labels in dictionary form
Return type:	dict

test_data(model)¶: This file will export test data. You will likely have to modifiy it slightly.

train_test()¶: This function does train-test.

Model ¶

class smm4h.model.Model(Xdata_train, Ydata_train, Xdata_val, Ydata_val, maxwords, maxlen, test=False, data_test=None)¶

Prepares data for CNN

Parameters:

Xdata_train (List) – preprocessed X train data
Ydata_train (List) – preprocessed Y train data
Xdata_val (List) – preprocessed X validation data
Ydata_val (List) – preprocessed Y validation data
maxwords (Int) – maximum words to use
maxlen (Int) – maximum input length for tweet
test (Bool) – test data flag
data_test (List) – preprocessed test data

get_features(text_series, tokenizer)¶

Transforms text data to feature_vectors that can be used in the ml model. tokenizer must be available.

Parameters:	text_series – text to create sequences from tokenizer – scikit learn tokenizer that has been fitted to text
Returns:	padded sequences

process_test(data_test)¶

Prepares test data for model.

Parameters:	data_test – x test data data_test – List
Returns:	X test data prepared for model
Return type:	List

process_train(Xdata_train, Ydata_train)¶

Reads in X data and formats it correctly.

Parameters:	Xdata_train – CSV file of X train data read in via the read_from_file function Ydata_train – CSV file of Y train data read in via the read_from_file function
Returns:	X & Y train data, word_index, labels

process_val(x_data_val, y_data_val, tok)¶

Prepares validation data for model.

Parameters:	Xdata_val (List) – x validation data Ydata_val (List) – y validation data
Returns:	X&Y validation data ready for model

Embedding ¶

class smm4h.embedding.MakeEmbedding(word_index, embedding, dim, maxwords, error_handling=False)¶

Creates embedding for CNN

Parameters:	word_index – word_index from model file embedding (Str) – path to embedding file dim (Int) – dimension of the embedding maxwords (Int) – maximum number of words to use error_handling (Bool) – flag for embedding read function that handles errors

init_embedding()¶

function that creates embedding matrix from word_index.

Parameters:	word_index (Dict) – word index of X data
Returns:	embedding matrix

read_embeddings_from_file(path)¶

Function to read external embedding files to build an index mapping words (as strings) to their vector representation (as number vectors).

Return dictionary:
Parameters:	path (Str) – path to emebedding
	word vectors
Return type:	dictionary: dict

read_embeddings_from_file_error_handling(path)¶

Function to read external embedding files to build an index mapping words (as strings) to their vector representation (as number vectors). Works exactly the same as def read_embeddings_from_file except has addional error handling for some embeddings that are tricky to read.

Return dictionary:
Parameters:	path (Str) – path to emebedding
	word vectors

Preprocessing ¶

class smm4h.preprocessing.Preprocessing(file, test, x_col_name, y_col_name)¶

This file preprocesses the data and cleans it.

Parameters:	file (Str) – path to data file test (Bool) – flag for if this is processing the test data x_col_name (Str) – Name of the column that has the X data y_col_name (Str) – Name of the column that has the Y data

remove_drug_names(tweets)¶

Replaces drug names with word drug. This is an experimental feature.

Parameters:	tweets (List) – X data
Returns:	tweets in list with drug names replaced
Return type:	List

remove_html(tweets)¶

Removes &amp

Parameters:	tweets (List) – X data
Returns:	tweets in list &amp removed
Return type:	List

remove_punctuation(tweets)¶

Removes double quotes

Parameters:	tweets (List) – X data
Returns:	tweets in list with double quotes removed
Return type:	List

replace_emojis(tweets)¶

Remove emojis and replace them with word that represents them

Parameters:	tweets (List) – X data
Returns:	tweets in list with emojis replaced
Return type:	List

replace_hashtags(tweets)¶

Replaces hashtags with word hashtag

Parameters:	tweets (List) – X data
Returns:	tweets in list with hashtags replaced
Return type:	List

replace_links(tweets)¶

Replaces links with word hyperlink

Parameters:	tweets (List) – X data
Returns:	tweets in list with links replaced
Return type:	List

replace_usernames(tweets)¶

Replaces usernames with word username

Parameters:	tweets (List) – X data
Returns:	tweets in list with usernames replaced
Return type:	List

Unbalanced ¶

class smm4h.unbalanced.Unbalanced(X, Y, unbalanced, multiplier=None, ratio1=None, ratio2=None, ratio1_label=None, ratio2_label=None)¶

Oversamples or desamples data. Only works with 2 classes.

Parameters:

X (List) – X data
Y (List) – Y data
unbalanced (Str) – flag for desample, oversample or none. Four options: desample, oversample, weights, none.
multiplier (Int) – number to duplicate by for oversampling. 1 more than duplications desired.
ratio1 (Int) – ratio desired for first label
ratio2 (Int) – ratio desired for second label
ratio1_label (Int) – first label
ratio2_label (Int) – second label

desample(sentences, labels)¶

Desamples.

Parameters:	labels (List) – y data sentences (List) – x data
Returns:	desampled sentences, desampled labels
Return type:	List

oversample(labels, sentences)¶

Oversamples.

Parameters:	labels (List) – y data sentences (List) – x data
Returns:	oversampled sentences, oversampled labels
Return type:	List

File ¶

class smm4h.file.File¶

read_from_file(file)¶

Reads external files and insert the content to a list. It also removes whitespace characters like new line at the end of each lines.

Parameters:	file (Str) – name of the input file.
Returns:	content of the file
Return type:	List

write_from_list(list, file_path)¶

Creates a csv file from a list

Parameters:	list (List) – list to become a csv file file_path (Str) – path/name to the new file.

SMM4H’s documentation!¶

CNN¶

Model¶

Embedding¶

Preprocessing¶

Unbalanced¶

File¶

CNN ¶

Model ¶

Embedding ¶

Preprocessing ¶

Unbalanced ¶

File ¶