Fake News Detection Using Machine Learning

9 min readJun 28, 2021

Fake News has consistently remained a topic of concern during the current era of politics and media. As time goes on, it only becomes more and more clear that it is an issue that needs to be addressed. While machine learning is not, and cannot, be the entire solution to this social problem, it nonetheless can play a vital role in addressing this phenomena. In this post, I am going to explain how I used scikit-learn to build a fake news classifier. Stay tuned for a later post on using deep learning techniques for the same task!

Part 1: The Data

As with any data science project, let’s begin by performing some EDA (Exploratory Data Analysis). I used this dataset from Kaggle. After reading in the CSV file as a pandas DataFrame, we can see that the data contains five columns: id, title, author, text, and label. There are approximately 20,800 instances in this dataset; however, there are some null values in the title, author, and text fields. For simplicity, I filled these NA values with spaces. While this approach allowed me to get acceptable results, future exploration could be done on how different strategies could affect the results (in essence, looking at the methodology chosen to address NA values as an additional hyperparameter to be tweaked).

df = pd.read_csv("fake_news_train.csv")
df.head()

df.info()

df.isnull().sum()

df = df.fillna(" ")

Next, I wanted to create some simple data visualizations to get a better sense of my data. First, I created a bar chart to check if the dataset was balanced or not. By plotting the distribution of instances labeled as reliable and unreliable (i.e., not fake and fake, respectively), I was able to confirm that both labels have about 10,000 instances each.

plt.hist(df.label, df=train.label, bins=3)
plt.xlabel("Fake versus Real News")
plt.ylabel("Count")
plt.title("Distribution of Fake and Real News in Training Set")
plt.show()

I also wanted to create some wordclouds to visualize the most common words in both real and fake news in order to see if there were any obvious differences between the two. To do so, I first subset the data based on its label. I then created a generate_wordcloud function that accepted as its arguments some text and a max_words parameter. I used this function to visualize the top 50 most common words in both the real and fake news subsets. Unsurprisingly, there is some overlap between the two. Both feature references to former President Donald Trump, to the United States, and to the government in general. However, it is interesting to note that only the fake news wordcloud contains terms like “Hillary Clinton,” “election,” and “Russia.” This makes sense, given that the dataset was collected approximately 3 years ago when there was greater focus on the 2016 election and potential Russian interference.

textdf_real = df.text[train.label==0] 
textdf_fake = df.text[train.label==1]text_real = " ".join(text for text in textdf_real)
text_fake = " ".join(text for text in textdf_fake)def generate_wordcloud(text, max_words):
    wordcloud = WordCloud(max_font_size=50, max_words = max_words, background_color='white').generate(text)
    plt.figure()
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.show()generate_wordcloud(text_real, 50)
generate_wordcloud(text_fake, 50)

word cloud for ‘unreliable’ (‘fake’) news

Part 2: Topic Modeling using Unsupervised Learning

Next, I’ll use an unsupervised learning technique called Latent Dirichlet Allocation (LDA) to perform what’s known as Topic Modeling. LDA is essentially a dimensionality reduction technique, and, by reducing the dimensions of the textual data, we can discover the most common topics. For the sake of brevity, I won’t dig into the details too much in this article, but I found this explanation to be helpful in understanding what LDA is and why it can be useful.

Before building the LDA model, however, the text needs to be preprocessed. To do so, I’ll write a few preprocessing functions that will: 1) strip the text of its punctuation, 2) tokenize the text, 3) lemmatize the text, and 4) remove stop words. If you want more information on preprocessing techniques for textual data, I highly recommend articles such as this one. For this project, I used WordNet to lemmatize the text; however, the Porters algorithm has been empirically proven to be effective for English NLP tasks and is often faster than Lemmatization.

def sent_to_words(sentences):
    for sent in sentences:
        sent = re.sub('\s*@\s*\s?', '', sent)  # remove emails
        sent = re.sub('\s+', ' ', sent)  # remove newline chars
        sent = re.sub("\'", "", sent)  # remove single quotes
        sent = gensim.utils.simple_preprocess(str(sent), deacc=True) 
        yield(sent)spacy.load('en')
from spacy.lang.en import English
parser = English()
def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        elif token.like_url:
            lda_tokens.append('URL')
        elif token.orth_.startswith('@'):
            lda_tokens.append('SCREEN_NAME')
        else:
            lda_tokens.append(token.lower_)
    return lda_tokensnltk.download('wordnet')
from nltk.corpus import wordnet as wn
def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma
    
from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))

I then created a prepare_text_for_lda function that would tokenize and lemmatize the text using the functions created above. I used this prepare_text_for_lda function while re-reading in my csv file line-by-line in order to create a list of tokens. This list of tokens can then be used to generate the dictionary and corpus, that will then, in turn, be used to generate the LDA models.

def prepare_text_for_lda(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 4]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    return tokensimport random
text_data = []
with open('train.csv') as f:
    for line in f:
        tokens = prepare_text_for_lda(line)
        if random.random() > .99:
            text_data.append(tokens)dictionary = corpora.Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('fn_dictionary.gensim')

Now it’s time to create the LDA models! I created three LDA models: one model to group data into 3 topics, one model to group into 5 topics, and one for 10.

dictionary = gensim.corpora.Dictionary.load('fn_dictionary.gensim')
corpus = pickle.load(open('corpus.pkl', 'rb'))ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 3, id2word=dictionary, passes=15)
ldamodel.save('model3.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 5, id2word=dictionary, passes=15)
ldamodel.save('model5.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 10, id2word=dictionary, passes=15)
ldamodel.save('model10.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

I then visualized these models using the pyLDAvis package. These visualizations are interactive inside Jupyter Notebooks, and you can click on different topic ‘bubbles’ to see which words appear in each topic and at what frequency. A static example of one of these visualizations is shown below.

Part 3: Text Preprocessing

Before building the model, I first had to do some preprocessing. I began by creating an additional column for the dataframe, df[‘total’], that combined the entries for title, author, and text. This column will serve as our input variable, and the target variable will be the ‘label’ column. Using scikit-learn’s train_test_split function, I then split our data into training and testing sets.

df['total'] = df['title'] + " " + df['author'] + " " + df['text']
X_train, X_test, y_train, y_test = train_test_split(df['total'], df.label, test_size=0.20, random_state=42)

Next, I wanted to see if there would be a significant difference between using a Count Vectorizer and a TFIDF-Vectorizer on the Data. Using scikit-learn’s functions, I created separate train and test sets for each vectorizer. It’s important to note that you should fit only on the training set to avoid information leakage from the test set.

count_vectorizer = CountVectorizer(ngram_range=(1,2), stop_words='english')
count_train = count_vectorizer.fit_transform(X_train)
count_test = count_vectorizer.transform(X_test)tfidf_vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)

Part 4: Model Creation

Now comes the fun part — building the models! For this project, I built three different kinds of models: Logistic Regression, Random Forest, and SGDClassifier. For each of these, I trained one on the count-vectorized text, and one model on the tfidf-vectorized text.

LogReg = LogisticRegressionCV(cv=5, random_state=42, scoring='precision', penalty='l2')
LogReg.fit(count_train, y_train)
y_pred = LogReg.predict(count_test)
print(classification_report(y_test, y_pred))

LogReg.fit(tfidf_train, y_train)
y_pred = LogReg.predict(tfidf_test)
print(classification_report(y_test, y_pred))

rf = RandomForestClassifier()
rf.fit(count_train, y_train)
y_pred = rf.predict(count_test)
print(classification_report(y_test, y_pred)

rf.fit(tfidf_train, y_train)
y_pred = rf.predict(tfidf_train)
print(classification_report(y_test, y_pred))

SGD = SGDClassifier()
SGD.fit(count_train, y_train)
y_pred = SGD.predict(count_test)
print(classification_report(y_test, y_pred)

SGD.fit(tfidf_train, y_train)
y_pred = SGD.predict(tfidf_train)
print(classification_report(y_test, y_pred))

Part 5: Conclusions

For the original Kaggle competition that used this dataset, all that was required was to submit a csv model containing predictions on an additional test set. I’d likely choose to submit predictions made with the SGD classifier since that model appears to have the best overall performance. However, if I were tasked to choose a final model to put into production, there are a few factors I’d want to consider: 1) explainability; 2) computational complexity for training and predictions; and 3) precision versus recall.

Let’s start with explainability. Since fake news is such a sensitive topic, it might be important to select a model that’s easily explainable. Out of these three models, the random forest models provide the greatest explainability and might be appealing for this reason. On the other hand, the SGD classifier does not output probabilities for each class, so it would be less likely to be a good fit if explainability is a high priority. However, there are elements other than just the model that may need modeling; for instance, it may be important to be able to explain how the dataset was curated. For this project, explaining that the data came from Kaggle is likely enough, but a production-capable pipeline would likely require a more rigorous dataset curation and explanation.

Next, let’s consider computational complexity. There might be a few different reasons to choose a model with a quicker training time but slower prediction time, or vice versa. Say you wanted to feed your model new data on a periodical basis in order to keep up with the changing nature and content of fake news; for this use case, you may prefer a model with quicker training time or one that is more suited for out-of-core learning. On the other hand, if you are deploying your model to a web application where users can submit a link to a news article and see if that given article is fake news or not, you may want to prioritize a model that makes quicker predictions.

Let’s conclude by examining the trade-off between precision and recall. Again, because of the relatively sensitive nature of the topic, it may be worthwhile to consider whether it would be worth it examine one over the other. The central concern here would be: what is worse, to have a fake news article be labeled as real news, or to have a real news article labeled as fake? Both scenarios could be dangerous; a fake news article labeled as reliable news could accelerate the propagation of disinformation, while a reliable news article labeled as fake could cause further confusion and distrust over what exactly the ‘truth’ is.There are several ways to resolve this issue, such as building an ensemble model. However, for our purposes here, let’s consider who would potentially be using, say, a web application using this model. If our imagined audience is someone who is already deep into the world of fake news, it’d likely be more harmful to have fake news articles labeled as real than vice versa.

Overall, the best model here is still the SGD Classifier. Stay tuned for my follow-up article on using deep-learning methods on the same task!

All code for this project can be found here.