Group Project 1

Alec Mills
4 min readAug 2, 2022

For my blog post I am going to take you through my first group project. The purpose of this project is to enhance and apply our exploratory data analysis, data cleansing and data visualization skills to a real-world problem or scenario. We first started off with the question. Microsoft has decided to directly enter the entertainment industry, specifically, by creating a new movie studio. However, the new Head of the Movie Studio does not have significant experience in this space, and needs our help in understanding what types of films are doing the best at the box office. We decided to look into 3 key factors. What genres of movies are the most profitable at the box office? Do ratings of movies seem to have any type of relationship with how well they do in theaters? And lastly what timeline for release should Microsoft aim for? While answering these key questions, We are going to give three concrete recommendations or actionable insights for Microsoft.

When getting started with our project, we were provided with 5 different databases all set up in different ways. We were given csv and tsv files, as well as one sql file. All the data was from Box Office Mojo, IMDB, Rotten Tomatoes, TheMovieDB, and The Numbers. We then performed exploratory data analysis on each of these databases. To start looking into the data we opened and saved the CSV and TSV files using Pandas read_csv command to data frames. Reviewing the contents of each database to identify relevant financial metrics, such as budget and sales.

Reviewing the contents of each database to identify other potentially relevant metrics, such as genre, release dates, run-times, directors/cast. After looking into all of the data we decided to look at The Numbers: includes production budget, domestic and worldwide sales, and release date. IMDB — Movie Basics: includes unique movie id, original title, start year (release), runtime minutes, and genres. IMDB — Ratings: includes unique movie id, average rating and number of votes.

Cleaning the data consisted of dropping records that didn’t have all the data we needed. For IMDB Movie Basics — Drop records that have a non-null value using .dropna() — Remove duplicate entries by first sorting the dataframe on runtime minutes (highest to smallest) and then removing duplicates based on original title and release year. We then performed a left merge between the cleansed IMDB Movie Basics database and the IMDB Ratings database on the movie id field. After merging the IMDB tables, we then performed an inner merge between the merged IMDB table and the cleansed The Numbers database (‘TN_clean’) on two columns — movie title and release year.

After analyzing the makeup of the merged Movies_DF data frame, we further reduced the dataset by applying the following filters: Movies with a production budget of over $10,000,000. We believe Microsoft will want to focus on generally bigger budget films, and this would remove independent films and any movies that had “0” as the production budget (i.e. incomplete). Movies with domestic sales over $0. We saw that some of the movies were international with $0 as domestic, and therefore wanted to remove these from our analysis. Microsoft will presumably want to have a local box office release in English as its first box office film. Movies release after 2012. We felt that we should focus our analysis on the most recent 10 year period.

We then analyzed the resulting dataset for trends and relationships to profitability metrics with genres, the month of release, and ratings.

Our analysis resulted in the following visualizations and underlying observations

Genres vs. Profitability image:

Release Month vs. Profitability image:

Based on our data analysis and the visualizations above, our 3 recommendations are as follows. A Sci-Fi / Action-Adventure Movie with a release date in June/July. Microsoft should produce a big, action-packed sci-fi / action-adventure movie on existing IP. We think “Gears of War” would be a great first movie. They could consider making it slightly horror, which also does well overall. This would also allow Microsoft to feature one of its own, and very successful, IP products.

The movie should be release in or around June or July (US big summer hit). These are when many of the big Marvel, DC, and Star Wars movies are released. You may want to stay away from holiday weekends, so as not to compete with those established franchises. For our second movie we would recommend an Animated / Family movie with a December release. Microsoft should produce a family-friendly animated movie that will further expand their viewership to kids and others who are not interested in the “Gears of War” franchise. This movie should be released in December, when families are together for the holidays. Lastly we recommend a Comedy (Possibly Romantic) movie with a November release Finally, Microsoft should invest in a Comedy, potentially a Rom-Com. This would again, expand their following and possibly obtain new viewers (more females, couples, etc.). This movie should be released in November, which is a very successful month in terms of ROI and Net Income.

--

--