{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## SQL Analysis ##" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 1" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext sql" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%sql mysql://admin:sql_2021@lmu-dev-011.cwuw28ktwpbp.us-east-2.rds.amazonaws.com/PortfolioProject" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A simple overview of the technologies being assessed, and the category that they fall under." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " * mysql://admin:***@lmu-dev-011.cwuw28ktwpbp.us-east-2.rds.amazonaws.com/PortfolioProject\n", "18 rows affected.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TechnologyCategory
Angularframework
Djangoframework
Vueframework
Reactframework
Flutterframework
jQueryframework
JavaScriptlanguage
Rustlanguage
GoLanglanguage
Javalanguage
Pythonlanguage
C++language
C#language
HTMLlanguage
CSSlanguage
SQLlanguage
Pearllanguage
PHPlanguage
" ], "text/plain": [ "[('Angular', 'framework'),\n", " ('Django', 'framework'),\n", " ('Vue', 'framework'),\n", " ('React', 'framework'),\n", " ('Flutter', 'framework'),\n", " ('jQuery', 'framework'),\n", " ('JavaScript', 'language'),\n", " ('Rust', 'language'),\n", " ('GoLang', 'language'),\n", " ('Java', 'language'),\n", " ('Python', 'language'),\n", " ('C++', 'language'),\n", " ('C#', 'language'),\n", " ('HTML', 'language'),\n", " ('CSS', 'language'),\n", " ('SQL', 'language'),\n", " ('Pearl', 'language'),\n", " ('PHP', 'language')]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT\n", "\t`name` AS Technology,\n", "\t`type` AS Category\n", "FROM\n", "\ttechnology\n", "\tJOIN technology_type ON technology.technology_type_id = technology_type.id\n", "WHERE\n", "\ttechnology.id IN(\n", "\t\tSELECT\n", "\t\t\ttechnology_id FROM tweet);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pretty standard output. 6 frameworks and 12 languages being analyzed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----------------------------------------------------------------------------------------------------------------------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This query provides counts of tweets in the database broken down by each of the technologies. This will provide context for future queries that look at engagment." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " * mysql://admin:***@lmu-dev-011.cwuw28ktwpbp.us-east-2.rds.amazonaws.com/PortfolioProject\n", "18 rows affected.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TechnologyTotalTweets
HTML5299
C#5299
C++5299
Java5299
PHP5298
Python5298
JavaScript5298
CSS3574
React2927
SQL1446
Flutter1254
Angular1115
Django666
Rust650
jQuery551
GoLang532
Vue343
Pearl126
" ], "text/plain": [ "[('HTML', 5299),\n", " ('C#', 5299),\n", " ('C++', 5299),\n", " ('Java', 5299),\n", " ('PHP', 5298),\n", " ('Python', 5298),\n", " ('JavaScript', 5298),\n", " ('CSS', 3574),\n", " ('React', 2927),\n", " ('SQL', 1446),\n", " ('Flutter', 1254),\n", " ('Angular', 1115),\n", " ('Django', 666),\n", " ('Rust', 650),\n", " ('jQuery', 551),\n", " ('GoLang', 532),\n", " ('Vue', 343),\n", " ('Pearl', 126)]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT\n", "\ttechnology.name AS Technology,\n", "\tcount(*) AS TotalTweets\n", "FROM\n", "\ttweet\n", "\tJOIN technology ON tweet.technology_id = technology.id\n", "GROUP BY\n", "\ttechnology_id\n", "ORDER BY\n", "\tTotalTweets DESC;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As to be expected, the categories with the most tweets are some of the most prevelant programming languages, with many of the frameworks being less talked about." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----------------------------------------------------------------------------------------------------------------------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This similar query breaks down the number of tweets by the technology type the tweet represents. This is more helpful context to keep in mind when it comes to answering the the primary question." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " * mysql://admin:***@lmu-dev-011.cwuw28ktwpbp.us-east-2.rds.amazonaws.com/PortfolioProject\n", "2 rows affected.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CategoryTotalTweets
language43418
framework6856
" ], "text/plain": [ "[('language', 43418), ('framework', 6856)]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT\n", "\ttechnology_type. `type` as Category,\n", "\tcount(*) AS TotalTweets\n", "FROM\n", "\ttweet\n", "\tJOIN technology ON technology.id = tweet.technology_id\n", "\tJOIN technology_type ON technology_type.id = technology.technology_type_id\n", "GROUP BY\n", "\tCategory\n", "ORDER BY\n", "\tTotalTweets DESC;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This result confirms that was resonably clear from the previous query. Specific web frameworks are less talked about than the most popular programming languages. This makes sense, especially when considering that the basis for almost all of these frameworks is JavaScript. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----------------------------------------------------------------------------------------------------------------------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This query creates a view that acts as an identical copy of the tweet table, but contains the full URL of the tweet. All tweets have a unique ID which is part of the URL to the tweet. This query forms the URL from the tweet ID, so a tweet can be looked up manualy if needed. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " * mysql://admin:***@lmu-dev-011.cwuw28ktwpbp.us-east-2.rds.amazonaws.com/PortfolioProject\n", "(MySQLdb._exceptions.OperationalError) (1050, \"Table 'tweet_with_url' already exists\")\n", "[SQL: CREATE VIEW tweet_with_url AS (\n", "\n", "\tSELECT\n", "\t\tid,\n", "\t\tusername,\n", "\t\tcontent AS raw_text,\n", "\t\tclean_text,\n", "\t\treplies,\n", "\t\tlikes,\n", "\t\tquote_tweets,\n", "\t\tretweets,\n", "\t\tCONCAT('https://twitter.com/t/status/', url) as url,\n", "\t\tsentiment,\n", "\t\tsubjectivity,\n", "\t\ttechnology_id\n", "\tFROM\n", "\t\ttweet\n", "\n", ");]\n", "(Background on this error at: https://sqlalche.me/e/14/e3q8)\n" ] } ], "source": [ "%%sql\n", "CREATE VIEW tweet_with_url AS (\n", "\n", "\tSELECT\n", "\t\tid,\n", "\t\tusername,\n", "\t\tcontent AS raw_text,\n", "\t\tclean_text,\n", "\t\treplies,\n", "\t\tlikes,\n", "\t\tquote_tweets,\n", "\t\tretweets,\n", "\t\tCONCAT('https://twitter.com/t/status/', url) as url,\n", "\t\tsentiment,\n", "\t\tsubjectivity,\n", "\t\ttechnology_id\n", "\tFROM\n", "\t\ttweet\n", "\n", ");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----------------------------------------------------------------------------------------------------------------------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This query identifies overlapping tweets. Overlapping tweets being defined as tweets identical tweets that were pulled in muptiple searches for muptiple different technologies." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " * mysql://admin:***@lmu-dev-011.cwuw28ktwpbp.us-east-2.rds.amazonaws.com/PortfolioProject\n", "1 rows affected.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
OverlappingTweets
25587
" ], "text/plain": [ "[(25587,)]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "WITH `overlaps` AS (\n", "\n", "SELECT\n", "\tid,\n", "\tusername,\n", "\turl AS path,\n", "\tCASE WHEN (SELECT COUNT(*) FROM tweet WHERE url = path) > 1 THEN 1 ELSE 0 END AS overlap\n", "FROM\n", "\ttweet\n", ")\n", "\n", "SELECT count(*) AS OverlappingTweets FROM `overlaps` WHERE overlap = 1;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ALTER TABLE `PortfolioProject`.`tweet`\n", "ADD INDEX `tweet_url_idx` (`url`) USING BTREE;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This query shows that nearly half of the tweets in the database reference two or more different technologies. This is insight into how data collection could be improved for a project like this. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This query will show the top 10 authors of tweets in terms of how many of their tweets appear in the database. This is helpful for identifying potential spam tweets." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " * mysql://admin:***@lmu-dev-011.cwuw28ktwpbp.us-east-2.rds.amazonaws.com/PortfolioProject\n", "10 rows affected.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
usernameNumberOfTweets
10115658121581
1490214721131540480856
183984039696
1401391924754325512622
1507969499483746305614
1332714745871421443476
2420443250413
718463394278477824306
1279550512040341504300
134434122278
" ], "text/plain": [ "[('1011565812', 1581),\n", " ('1490214721131540480', 856),\n", " ('183984039', 696),\n", " ('1401391924754325512', 622),\n", " ('1507969499483746305', 614),\n", " ('1332714745871421443', 476),\n", " ('2420443250', 413),\n", " ('718463394278477824', 306),\n", " ('1279550512040341504', 300),\n", " ('134434122', 278)]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "SELECT\n", "\tusername,\n", "\tcount(*) AS NumberOfTweets\n", "FROM\n", "\ttweet\n", "GROUP BY\n", "\tusername\n", "ORDER BY\n", "\tNumberOfTweets DESC\n", "LIMIT\n", "\t10;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, the author with ID: 1011565812 is responsible for over 1500 tweets alone. This information is helpful for the data cleaning process, as these are automated tweets that do not express a particular sentiment. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Business question: What is the most popular framework and language?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following query returns the top framework, and the top language ranked by the number of engagments per tweet pulled. This metric aims to measure how active the community surrounding a particular technology is. A more active community means more active development for open source projects, including additional features and security patches.\n", "\n", "The query uses multiple common table expressions, a dense rank window function, aggregate functions with a group by and multiple joins." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " * mysql://admin:***@lmu-dev-011.cwuw28ktwpbp.us-east-2.rds.amazonaws.com/PortfolioProject\n", "2 rows affected.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nametypeEngagmentsPerTweetTechnologyTypeRank
Flutterframework20.31021
GoLanglanguage16.59401
" ], "text/plain": [ "[('Flutter', 'framework', Decimal('20.3102'), 1),\n", " ('GoLang', 'language', Decimal('16.5940'), 1)]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "WITH engagment_metrics AS (\n", "\tSELECT\n", "\t\ttechnology. `name`,\n", "\t\ttechnology_type.`type`,\n", "\t\tcount(*) AS Tweets,\n", "\t\tSUM(likes) AS TotalLikes,\n", "\t\tSUM(replies) AS TotalReplies,\n", "\t\tSUM(quote_tweets) AS TotalQuotes,\n", "\t\tSUM(retweets) AS TotalRetweets,\n", "\t\tSUM(likes) + SUM(replies) + SUM(quote_tweets) + SUM(retweets) AS TotalEngagements\n", "\tFROM\n", "\t\ttweet\n", "\t\tJOIN technology ON technology.id = tweet.technology_id\n", "\t\tJOIN technology_type ON technology_type.id = technology.technology_type_id\n", "\tGROUP BY\n", "\t\ttechnology. `name`\n", "),\n", "PerTweetEngagment AS (\n", "SELECT\n", "\t*,\n", "\t(TotalEngagements / Tweets) AS EngagmentsPerTweet\n", "FROM \n", "\tengagment_metrics\n", "ORDER BY \n", "\tEngagmentsPerTweet DESC\n", "),\n", "EngagementRanking AS(\n", "SELECT `name`, `type`, EngagmentsPerTweet,\n", "\tDENSE_RANK() OVER ( PARTITION BY `type` ORDER BY EngagmentsPerTweet DESC) AS TechnologyTypeRank\n", "\n", "FROM PerTweetEngagment\n", ")\n", "\n", "SELECT \n", "\t* \n", "FROM \n", "\tEngagementRanking\n", "WHERE \n", "\tTechnologyTypeRank = 1;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results show Flutter has having the highest engegments per tweet of all the frameworks, and GoLand as having the highest engegments per tweet of all the languages. This could be helpful information for a startup, or existing company who is chosing a tech stack for a new product. Of couse many other factors should be taken into account, apart from Twitter engagments, when making a choice like this, but this data may also factor into the decision, as it is an indication of how active the community is. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This query identifies potiental spam tweets that are being counted in the aggregate functions. The spam tweets are tweets from authors identified preciously as automated accounts that post tweets that do not contain actual sentiment toward the associated technology." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " * mysql://admin:***@lmu-dev-011.cwuw28ktwpbp.us-east-2.rds.amazonaws.com/PortfolioProject\n", "2 rows affected.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
techPossibleSpam
Flutter276
GoLang7
" ], "text/plain": [ "[('Flutter', 276), ('GoLang', 7)]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "WITH spam_tweets AS(\n", "\n", "SELECT\n", "\t*,\n", "\tCASE WHEN tweet.technology_id = 6 THEN 'Flutter' WHEN tweet.technology_id = 16 THEN 'GoLang' ELSE '' END AS tech \n", "FROM\n", "\ttweet\n", "WHERE\n", "\ttweet.username IN('1011565812', '1490214721131540480', '183984039', '1401391924754325512')\n", "\n", ")\n", "\n", "SELECT\n", "\ttech,\n", "\tcount(*) AS PossibleSpam\n", "FROM \n", "\tspam_tweets\n", "WHERE \n", "\ttech != ''\n", "GROUP BY \n", "\ttech;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This result indicates that a sizable portion of the tweets (around 20 percent for flutter) are from automated authors. Further analysis should filter these results out of the result set. The same is not true for GoLang however, meaning the engagments per tweet metric for GoLang is likley more accurate. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next query analyzes overlapping tweets being counted towards the aggregate metrics for Flutter and GoLang. Tweets referencing multiple different languages or frameworks should not be fully counted towards a single technology, and thus their inclusing could skew results. This query identifies if this effect is significant or not based on how many overlapping tweets are in each set." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " * mysql://admin:***@lmu-dev-011.cwuw28ktwpbp.us-east-2.rds.amazonaws.com/PortfolioProject\n", "2 rows affected.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TechnologyOverlappingTweets
Flutter600
GoLang222
" ], "text/plain": [ "[('Flutter', 600), ('GoLang', 222)]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "WITH `overlaps` AS (\n", "\n", "SELECT\n", "\ttweet.id,\n", "\t`name` AS Technology,\n", "\tusername,\n", "\turl AS path,\n", "\tCASE WHEN (SELECT COUNT(*) FROM tweet WHERE url = path) > 1 THEN 1 ELSE 0 END AS overlap\n", "FROM\n", "\ttweet JOIN technology on tweet.technology_id = technology.id\n", ")\n", "\n", "SELECT Technology, count(*) AS OverlappingTweets \n", "\tFROM \n", "\t\toverlaps \n", "\tWHERE \n", "\t\tTechnology IN ('Flutter', 'GoLang') AND overlap = 1 \n", "\tGROUP BY \n", "\t\tTechnology;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results show that there is considerable overlap in tweets being analysed for Flutter and GoLang. This could possibly be effecting the engagments per tweet metric, as engagments are being counter on tweets that mention more than just Flutter and GoLang. For future analysis, this data should be filtered out or otherwise handled. " ] } ], "metadata": { "interpreter": { "hash": "a854d57973609d053ac5dbf40cf71754a95903bcdddedbf10f0c1a0500665e38" }, "kernelspec": { "display_name": "Python 3.9.10 ('env_tensorflow')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.10" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }