# Stop the spread …Part II

--

Source of the image: https://www.theguardian.com/environment/2020

This is Part II for further exploration and analysis on the #stopthespread. If you did not check Part I, I recommend you check that first!

Inferring attributes that are private, protected, or missing may reveal useful information. Generally speaking, many attributes may be inferred from the Network, such as Age, Political affiliation, Location, Gender, Religion…etc. In this exercise, I focus on the ‘Country’ attribute and the “Page Rank”.

The motivation behind inferring the Country from Location using googlemaps.geocode() is that having information and analysis by country helps provide an idea about the level of awareness across different places. Also, that inference may be used to see if there is a correlation between the count by country and the infection rate. (That correlation is not part of this exercise, but it reflects how the ‘Country’ attribute may help add a new dimension). While the motivation for the PageRank is trying to infer influencers nodes (vertices) but from a different perspective. In Assignment II, I evaluated nodes based on In-Degree, while PageRank provides more weight based on receiving a link from a known or unknown user (or node). According to NodeXL’s manual, PageRank is the core metric used by Google search engine. So the motivation for doing the PageRank is for learning purposes and to compare the results with In-Degree results.

The data set include attributes about users, their description, tweet, followers, following, relationship type(tweet/retweet/mention), relationship date, hashtag in the tweet, tweet date, language, location, and other features that were not used. The location attribute is missing for almost 30% of the records and for the remaining 70%, most of them had either city only, or state only, or State abbreviation, or building number and street with No Zip or state. Also, some tweets had random words not related to the location at all (i.e. one user have location: “Not where I’d like to be”, and another user has a location: “Finding My Way”).

Implementation steps for ‘Country’ Attribute:

  1. I got Google maps API, then used Geocoding Library
  2. Retrieved again 1000 tweets from #stopthespread starting from date 3/11/2020 using tweepy as the NodeXL file gave errors when using google maps.geocode().
  3. Used the googlemaps geocode() on attribute location. geocode() convert address into geographic coodinates(latitute and logitute).
Fig 1: Google Maps geocode()
Fig 2: Sample of geocode() result

4. Function get_country, is based on geocode function that returns the location based on having any of the following parameters: If the country is given by the user in the location, then mostly it is listed in the end, so the function split and return anything after the last “,” reading from right to left. If that is not available in the data then the function performs reverse geocode and obtain the closest readable address. It returns an error if nothing could be inferred from the text written.

Fig 3: get_country()

5. Then I counted the countries: tweet_df[‘country’].value_counts() . The Top four countries are the USA, UK, Canada, and India. USA and India are the top two countries in terms of the total number of COVID19 cases according to John Hopkins University.

Fig 4: count of tweets by country

Implementation steps for ‘PageRank’ Attribute: The page rank was done using NodeXL. Pagerank in NodeXL is calculated based on (1) the number of vertices connected to a Target node (2) the PageRank of those vertices and (3) the link propensity.

Table 1 shows the top nodes by PageRank and In-Degree. The order is not the same by PageRank and In-degree. Fig 5 shows the top 5 nodes by Pagerank, where the nodes with higher page rank have larger diameters. Fig 6 shows the location of the top nodes with respect to the whole network. NodeXL change provides an option to change the layout of the network, to be seen clearly.

Table 1: Pagerank versus In-degree
Fig 5: PageRank for Top 5 Nodes
Fig 6: PageRank for Top 5 nodes versus the rest of the Network

Validation

  • Validation is done by splitting the data into train and test and compare the accuracy/precision/and recall of the model based on the 4 categories: True Positive, True Negative, False Positive, and False Negative. In my case, my data set did not have a ground fact that may be used for evaluation. The tweets were retrieved from Tweepy and Nodexl, so all the locations were written by the customers. So having a target value to be able to calculate TP, TN, FP, FN requires manually entering locations for each tweet to be used for evaluation versus what google maps return.
  • Also, the functions used for Country and Page rank were predeveloped by Google maps and NodeXL, I only incorporate it, so validating a ready made function will not add any value to that setup. Accordingly, I evaluated the value of the output manually by checking a sample to learn and understand how accurate the results.
  • Looking at the results in Fig 4, USA has 2 records (USA, and the United States).
  • By looking at USA/United States cases, I found that any location that was USA or United Stated was returned by geocode as United States. While, any address that falls in USA but the address was written with street and state without the country was coded as USA
  • Table 2 shows the 30 tweets that were unclassified. Surprisingly a couple of tweets that had England were not detected, after looking at a sample of UK cases, it turned out that the problem was the place “North West”, as “Nottingham, England” and “London, England” were correctly found.
  • 7 cases out of the 30 below should have returned a Country, while the rest are correctly classified as errors.
Table 2: 30 Tweets that had an unclassified location(error)

Ethical issues

Since the data is collected from Twitter, there are almost no ethical issues here as by default Twitter turnoff the tweet location, and the users have to opt-in themselves. Also, users can delete past locations from displaying tweets. So the source data is collected from users with their consent or choice.

Limitations

Twitter does not represent the whole population and different segments. Also, the majority of tweets are not geocoded.

Also, the tweets used were the ones in English, any other language was excluded. So, maybe in other countries, they may have an equivalent awareness but the tweet and # are in another language.

Another limitation is having cities with the same name in more than one Country. That would cause a problem if the location has a city name only, and if longitude and latitude are missing.

Some locations were written in a different language, although the language of the tweet in English. Googlemaps geocode() was able to find the Country based on Longitute and Latitude. For example Table 3 is a sample of my dataset that has this case. Looking at the username, to validate, some had names that sounded from that region.

Table 3: Cases with Eng Tweet and non-English address

--

--