SMM4H’s documentation!¶
Welcome to SMM4H’s documentation. Check out our ReadMe for more details.
CNN¶
-
class
smm4h.cnn.CNN(x_train, y_train, embedding_matrix, x_val, y_val, labels, dim, maxlen, maxwords, filter_length, cross_val, weights, weight_ratios, test=False, x_test=None, x_test_path=None, export_path=None, x_col_name=None, y_col_name=None, id_col_name=None)¶ This is the class that contains the CNN.
Parameters: - x_train (List) – X train data pre-processed
- y_train (List) – Y train data pre-processed
- embedding_matrix – embedding ready for embedding layer
- x_val (List) – X validation data pre-processed
- y_val (List) – Y validation data pre-processed
- labels (List) – list of unique labels
- dim (Int) – dimension of word embedding
- maxlen (Int) – maximum input length of a tweet
- maxwords (Int) – maximum words
- filter_length (Int) – length of filter
- cross_val (Bool) – flag for cross validation
- weights (Bool) – flag for keras class weights
- weight_ratios (List) – list of weights
- test (Bool) – flag for test data
- x_test (List) – X test data pre-processed
- x_test_path (Str) – path to X test data
- export_path (Str) – path to where results should be exported to
- x_col_name (Str) – name of the column with the X data
- y_col_name (Str) – name of the column with the Y data
- id_col_name (Str) – name of the column with the IDs
-
cv()¶ This function does the cross validation.
-
cv_evaluation_fold(y_pred, y_true, labels)¶ Evaluation metrics for emicroach fold.
Parameters: - y_pred (List) – predicted y data
- y_true (List) – correct y data
- labels (List) – list of possible labels
Returns: fold stats
-
fit_Model(model, x_train, y_train)¶ fit the defined model to train on the data
Parameters: - model – trained model
- x_train (List) – training data
- y_train (List) – training labels
Returns: model and loss & accuracy stats
-
predict_model(model, x_test, y_test, encoder_classes)¶ Takes the predictions as input and returns the indices of the maximum values along an axis using numpy argmax function as true labels. Then evaluates it against the trained model
Parameters: - model – trained model
- x_test (List) – test data
- y_test (List) – test true labels
- encoder_classes (List) – labels
Returns: predicted and true labels
Return type: List
-
prediction_to_label(prediction)¶ Turns the prediction into a label.
Parameters: prediction – prediction for X data Returns: labels in dictionary form Return type: dict
-
test_data(model)¶ This file will export test data. You will likely have to modifiy it slightly.
-
train_test()¶ This function does train-test.
Model¶
-
class
smm4h.model.Model(Xdata_train, Ydata_train, Xdata_val, Ydata_val, maxwords, maxlen, test=False, data_test=None)¶ Prepares data for CNN
Parameters: - Xdata_train (List) – preprocessed X train data
- Ydata_train (List) – preprocessed Y train data
- Xdata_val (List) – preprocessed X validation data
- Ydata_val (List) – preprocessed Y validation data
- maxwords (Int) – maximum words to use
- maxlen (Int) – maximum input length for tweet
- test (Bool) – test data flag
- data_test (List) – preprocessed test data
-
get_features(text_series, tokenizer)¶ Transforms text data to feature_vectors that can be used in the ml model. tokenizer must be available.
Parameters: - text_series – text to create sequences from
- tokenizer – scikit learn tokenizer that has been fitted to text
Returns: padded sequences
-
process_test(data_test)¶ Prepares test data for model.
Parameters: - data_test – x test data
- data_test – List
Returns: X test data prepared for model
Return type: List
-
process_train(Xdata_train, Ydata_train)¶ Reads in X data and formats it correctly.
Parameters: - Xdata_train – CSV file of X train data read in via the read_from_file function
- Ydata_train – CSV file of Y train data read in via the read_from_file function
Returns: X & Y train data, word_index, labels
-
process_val(x_data_val, y_data_val, tok)¶ Prepares validation data for model.
Parameters: - Xdata_val (List) – x validation data
- Ydata_val (List) – y validation data
Returns: X&Y validation data ready for model
Embedding¶
-
class
smm4h.embedding.MakeEmbedding(word_index, embedding, dim, maxwords, error_handling=False)¶ Creates embedding for CNN
Parameters: - word_index – word_index from model file
- embedding (Str) – path to embedding file
- dim (Int) – dimension of the embedding
- maxwords (Int) – maximum number of words to use
- error_handling (Bool) – flag for embedding read function that handles errors
-
init_embedding()¶ function that creates embedding matrix from word_index.
Parameters: word_index (Dict) – word index of X data Returns: embedding matrix
-
read_embeddings_from_file(path)¶ Function to read external embedding files to build an index mapping words (as strings) to their vector representation (as number vectors).
Parameters: path (Str) – path to emebedding Return dictionary: word vectors Return type: dictionary: dict
-
read_embeddings_from_file_error_handling(path)¶ Function to read external embedding files to build an index mapping words (as strings) to their vector representation (as number vectors). Works exactly the same as def read_embeddings_from_file except has addional error handling for some embeddings that are tricky to read.
Parameters: path (Str) – path to emebedding Return dictionary: word vectors
Preprocessing¶
-
class
smm4h.preprocessing.Preprocessing(file, test, x_col_name, y_col_name)¶ This file preprocesses the data and cleans it.
Parameters: - file (Str) – path to data file
- test (Bool) – flag for if this is processing the test data
- x_col_name (Str) – Name of the column that has the X data
- y_col_name (Str) – Name of the column that has the Y data
-
remove_drug_names(tweets)¶ Replaces drug names with word drug. This is an experimental feature.
Parameters: tweets (List) – X data Returns: tweets in list with drug names replaced Return type: List
-
remove_html(tweets)¶ Removes &
Parameters: tweets (List) – X data Returns: tweets in list & removed Return type: List
-
remove_punctuation(tweets)¶ Removes double quotes
Parameters: tweets (List) – X data Returns: tweets in list with double quotes removed Return type: List
-
replace_emojis(tweets)¶ Remove emojis and replace them with word that represents them
Parameters: tweets (List) – X data Returns: tweets in list with emojis replaced Return type: List
Replaces hashtags with word hashtag
Parameters: tweets (List) – X data Returns: tweets in list with hashtags replaced Return type: List
-
replace_links(tweets)¶ Replaces links with word hyperlink
Parameters: tweets (List) – X data Returns: tweets in list with links replaced Return type: List
-
replace_usernames(tweets)¶ Replaces usernames with word username
Parameters: tweets (List) – X data Returns: tweets in list with usernames replaced Return type: List
Unbalanced¶
-
class
smm4h.unbalanced.Unbalanced(X, Y, unbalanced, multiplier=None, ratio1=None, ratio2=None, ratio1_label=None, ratio2_label=None)¶ Oversamples or desamples data. Only works with 2 classes.
Parameters: - X (List) – X data
- Y (List) – Y data
- unbalanced (Str) – flag for desample, oversample or none. Four options: desample, oversample, weights, none.
- multiplier (Int) – number to duplicate by for oversampling. 1 more than duplications desired.
- ratio1 (Int) – ratio desired for first label
- ratio2 (Int) – ratio desired for second label
- ratio1_label (Int) – first label
- ratio2_label (Int) – second label
-
desample(sentences, labels)¶ Desamples.
Parameters: - labels (List) – y data
- sentences (List) – x data
Returns: desampled sentences, desampled labels
Return type: List
-
oversample(labels, sentences)¶ Oversamples.
Parameters: - labels (List) – y data
- sentences (List) – x data
Returns: oversampled sentences, oversampled labels
Return type: List
File¶
-
class
smm4h.file.File¶ -
read_from_file(file)¶ Reads external files and insert the content to a list. It also removes whitespace characters like new line at the end of each lines.
Parameters: file (Str) – name of the input file. Returns: content of the file Return type: List
-
write_from_list(list, file_path)¶ Creates a csv file from a list
Parameters: - list (List) – list to become a csv file
- file_path (Str) – path/name to the new file.
-