For this project we used tools including Rstudio and Python. Rsutdio was used for a variety of data processing purposes. First, Rstudio was used for data processing including appending columns such as percent change in price for later analysis. Rstudio was once again used to retrieve sentiment and append scores using the VADER package. Finally, Rstudio was used to run the final regression and GGplot2 was used to visualize. Rstudio for Windows was used for regression and visualization, and all other processes were run on MacOS. Python was used to obtain date data from the URL using regex.
├── DATA
│ ├── LCID Stock Trends .gsheet
│ ├── LCID_Stock_Trends.csv
│ ├── LCID_initial_raw_query.xlsx
│ ├── LCID_updated_with_date.xlsx
│ ├── NVDA_Stock_Trends.csv
│ ├── NVDA_initial_raw_query.xlsx
│ ├── NVDA_with_dates.xlsx
│ ├── Sentiment_score.xlsx
│ ├── daily_stock_percent_changes.csv
│ ├── lcid_stock_data_sentimentVpercChange.xlsx
│ └── nvda_stock_data_sentimentVpercChange.xlsx
│ └── Data Appendix.pdf
├── OUTPUT
│ └── Analysis_plan_final.png
│ └── Compound_Sentiment_Score.png
│ └── regression_results.png
│ └── Sentiment_by_date.png
│ └── Stock_Information_Comparison.png
├── SCRIPTS
│ ├── R
│ │ ├── R.Rproj
│ │ ├── correlationLm.R
│ │ ├── desktop.ini
│ │ ├── lcid_stock_data.csv
│ │ ├── mergingdailypercents.Rmd
│ │ └── nvda_stock_data.csv
│ └── rawDataProcess
│ ├── extractDatesFromUrl.py
│ ├── sentimentVsScoreClean.py
│ └── sentiment_score.Rmd
This project takes 4 base data files and performs sentiment analysis and linear regression to achieve the given results.
NVDA_initial_raw_query.xlsx: original raw data query from databar.ai. Link to data:https://databar.ai/public-table-full/9419fc15-9bd5-4050-9253-155b4dc56111
LCID_initial_raw_query.xlsx: original raw data query from databar.ai. Link to data:https://databar.ai/public-table-full/361fe669-c6b6-4f89-b7cc-9561d7bc2314
- To extract Date info from the raw data, run extractDatesFromUrl.py on NVDA_initial_raw_query.xlsx and LCID_initial_raw_query.xlsx
- The result files are NVDA_with_dates.xlsx and LCID_with_dates.xlsx
- To combine sentiment scores and daily percentage change by the same date for each data, run sentimentVsScoreClean.py on NVDA_with_dates.xlsx and LCID_with_dates.xlsx
- The result files are nvda_stock_data_sentimentVpercChange.xlsx and lcid_stock_data_sentimentVpercChange.xlsx
First, take the base data files NVDA.xlsx and LCID.xlsx. Running the extractDatesFromUrl.py code file will pull date data for each sentiment from the URL's which will be important for merging later on. The result datasets with date are stored in data folder, called LCID_updated_with_date.xlsx and NVDA_with_date.xlsx.
Second, run the sentiment_score.Rmd file in Rstudio to append the VADER compound sentiment scores to the text data, and combined the NVDA sentiment score with LCID sentiment score by published date. The result file is named as Sentiment_score.xlsx.
NVDA_Stock_Trends.csv and LCID_Stock_Trends.csv must be cleaned to create daily_stock_percent_changes.csv. Their data types must be converted to reflect date and integer data types in order to run analysis. Then, the daily percent price change is found for each stock by subtracting the opening value from the closing value, dividing it by the opening value, and multiplying that value by 100. The daily stock percent change is made into a column, and the two sets are merged on date. Finally, the merged data set is exported as daily_stock_percent_changes.csv.
- Correlation analysis and Linear regression:
- Run correlationLm.R, make sure to load nvda_stock_data_sentimentVpercChange.xlsx and lcid_stock_data_sentimentVpercChange.xlsx in your R workspace