Teaching Data Analysis with Real Estate Data

文章總覽

Teaching Data Analysis with Real Estate Data

Tingting (Rachel) Chung, PhD | William & Mary | 撰文 2022-07-14

#EN #175期

The good students push themselves through a dreadful collection of Greek letters, memorize an array of formulas, and pass an exam. They may have even earned an A. How much they actually remember a few years later, not to mention how much they can actually apply their statistical knowledge personally or professionally, is another story.

（封面出處：Unsplash）

In this era of big data, everyone wants to have some basic knowledge in data analysis. Big data, data analysis, artificial intelligence, neural networks, and deep learning are all increasingly important skills in this data intensive world. Serving as professor of the MBA core course on data analysis at College of William & Mary, the oldest public university in the United States, I often need to wrestle with ways of introducing core data analysis concepts to my students from diverse professional backgrounds in the most time efficient manner.

Statistics is a branch of mathematics that derives its beauty from elegant and concise mathematical equations. It is, however, also an academic topic about which many people feel ambivalent. They are told it is important, and many are required to take a statistics course. The good students push themselves through a dreadful collection of Greek letters, memorize an array of formulas, and pass an exam. They may have even earned an A. How much they actually remember a few years later, not to mention how much they can actually apply their statistical knowledge personally or professionally, is another story.

These abstract representations can also be so terse that most people have trouble decoding them, if the concepts are presented as equations and symbols only. Is deviation the same as standard deviation ?? How is standard deviation different from standard score Z? What about standard error? Is that a special type of error ?? Isn’t error also called residual? Why does everything also have a Greek letter name? Most importantly, what is the point of standard deviation? Do businesses actually make use standard deviation to make money? As a result, much of the effort in learning statistics may be spent on decoding and applying mathematical equations and very little on its practical use. This is unfortunate because statistics is such a practical tool used in so many ways. For example, quality control in manufacturing, Netflix recommendations, and Google ads are all made possible by statistical analyses.

Having taught data analysis for years, I have noticed that classroom exercises using real estate data from the Zillow website are easy and effective tools for introducing core data analysis concepts to students. I have designed classroom activities using real estate data from the real estate website Zillow to illustrate core concepts of data analysis. Most of my graduate students have taken some form of statistics course at the undergraduate or high school level. However, they usually have retained very little and struggled to articulate the definition or use cases of basic concepts such as standard deviation. Therefore, I have assumed no prior prerequisite knowledge when designing these activities. These students have been introduced to probability concepts prior to taking my courses.

The first assignment in my MBA Data Analysis and MSBA Machine Learning courses is always for each student to go house shopping by choosing a real estate property in the local area. Students enter data points about their chosen properties into a survey. All of the student entries are compiled into a property dataset that we can then use for class discussions and exercises. Table 2 shows sample records from the most recent dataset¹. The survey questions include both categorical variables (e.g., ZIP code) and numeric variables (e.g. Price). Price, like most financial data, is always skewed, which gives us the opportunity to discuss transformation. Therefore, we can use the same dataset to run a wide range of analyses by combining different categorical and numeric variables. We would spend the next month or so calculating numbers and analyzing the dataset in different ways. In fact, we can get through an entire semester of the introductory Data Analysis course using the same dataset and never run out of ideas.

FName	ZIP code	Property Type	Year	Parking	Miles	Price	Zestimate	Zillow Days	Taxes	BED	BATH
Alexis	23185	House	1999	4	4	355000	394200	7	2608	3	3
Allan	23188	House	1995	2	5.6	1300000	1283200	11	5007	4	4
Andi Muhammad Farid	23188	House	2017	1	3.8	349500	355300	105	2126	4	4
Andres	23188	House	2002	3	6.2	874000	882500	75	7011	3	4
Andrew	23185	House	1964	2	5.1	398000	428200	2	3087	3	3
Andy	23188	House	2002	2	5.8	899000	887400	4	6715	5	5
Antoine	23185	House	2006	4	3.5	895300	883700	19	4763	4	4
Connor	23188	House	2005	2	6.1	999999	994900	47	7143	4	5
Brandon	23185	House	2015	3	6.8	2100000	2040200	115	12251	4	5
Table 2. Sample Data Records

Student names are always included in the dataset. Seeing their own names associated with data records in the class dataset allows students to make personal connections to data, and to have tangible references to otherwise very abstract concepts, such as deviations and Z scores.

During the first week, students are introduced to basic descriptive statistics of numeric variables. Figure 1 illustrates how calculations of central tendency are set up on a shared Google Sheets file. These calculations using formulas are complemented with an “arts and crafts” exercise. Students are given round-shaped sticky notes, and are instructed to write down their names, raw scores (e.g., property age), deviations, and Z scores (i.e., standard score.) I demonstrate how to compute these scores on the Google Sheets file, and then students can find their own scores and copy them onto their own “data dots.” Questions to ask the class include: Who has a Z score of 1.96? Who is above 3? Below -3? Who has zero scores? Who is inside the 95% confidence interval? Who is outside? The Google Sheets file is accessible to all students (in view only mode) so they can follow along, inspect the formulas and consider how their own records contribute to the class statistics and models.

I always invite the entire class to come up to a whiteboard and to construct a histogram together using their data dots (see Figure 2.) This exercise is dynamic, fun, and participatory. The physical experience of constructing a data dot individually, and then a histogram together as a class, reinforces basic but important statistical concepts discussed above. Students can clearly see where they are located on the distribution, how their positions on the histogram are reflected by their Z scores, and how deviations tell us about their relationships to the class mean. Together, the class can also compute variance, standard deviation, skewness, and other properties that belong to the entire class sample rather than individual records. This is particularly effective in helping students distinguish linguistically confusing terms, such as deviation (of a case) vs. standard deviation (of a sample), and standard score (of a case) vs. standard deviation (of a sample.) It also creates a deep appreciation of the relationship between a case, and a dataset. After the hands-on activity, I demonstrate how to create the histogram on the computer using a variety of software programs, such as Google Sheets, JASP, and Tableau. Students can then clearly see that the computer programs simply automated the manual process of putting together a histogram.

When discussing correlation, again I ask students to make “data dots,” but this time with their own data records for two, instead of one, variables (e.g., number of bedrooms, or “BED”, and number of bathrooms, or “BATH.”) The students would also write down their own Z scores for these two variables, and multiply them together. The class can then look at the Google Sheets to see how to add all of these scores up together, and divide the total by (sample size -1) to obtain Pearson’s Correlation r. Students are invited to put up their data dots on the whiteboard to construct a class scatterplot together. We can then use the scatterplot to discuss regression, residuals, model fit, and other important concepts of predictive modeling (see Figure 3). Again, I follow up with a demonstration of calculating correlation and creating scatterplots using Google Sheets. See Figure 4 for an example.

The real estate dataset is a good teaching tool for two key reasons. (1) The business domain knowledge is easily accessible to the general population. For example, most people intuitively understand that bigger houses (i.e., a larger square footage) will probably be more expensive, and adding a bathroom is probably going to increase the value of the property. Also, (2) relationships between the variables are usually very strong and predictable. For example, the number of bedrooms is always highly correlated with the number of bathrooms, and so we can always use them to discuss multicollinearity. Each additional bedroom is always going to add a significant chunk of value, and so we can always count on the beta coefficient of the regression model to be significant and positive. Therefore, instructors do not need to worry whether the statistical magic will work again in a new semester or not. The real estate dataset is also excellent for teaching prescriptive analytics. For example, students can determine which ZIP code has higher property values, by running T tests. Students can also determine which property type (house vs. condo) is older.

Most importantly, Zillow uses their dataset to produce Zestimate, a real-world application of predictive modeling, or machine learning predictions, for estimating property valuation (Schneider 2019). Conceptually, Zestimate is similar to the notion of home appraisals, which are intuitive for most people, including our students, to understand. However, few would immediately feel that they know how to create the estimates on their own. When the students realize that they can build their own Zestimate models after having learned predictive modeling, they often get a great sense of accomplishment. The class can build their own Zestimate equations, make their own predictions, and calculate individual students’ residuals (i.e., the amount of mispredictions. These models are very easy to interpret. For example, the beta coefficient shows how much the price would increase if the house had an additional bedroom, or how much the price would drop, if the house moves from one ZIP code to another. These simple, vivid, and intuitive exercises make the rather abstract concepts and Greek letters (e.g., ?) easier to digest, remember, and apply.

This manual process is not only critical for illustrating the inner works of predictive models, and the method of evaluating predictive model performance, it is also crucial for students to see how predictive models work in the real world, when we visit the Zillow Research site² together to study additional explanations of Zestimate. Again students feel a great sense of satisfaction when they realize that they can actually understand all the technical jargon that explains how Zestimate works!

To sum up, data analysis, commonly known as statistics, can be intimidating due to its numerous jargon, symbols, and obscure terminologies. It is easy and fun to make these abstract and technical concepts much more concrete and approachable by having students participate in the process of putting together a real estate dataset manually.

¹The dataset and the associated analyses are available upon request.

²https://www.zillow.com/research/

返回列表

編輯室推薦閱讀

用房地產學習數據分析

心之所向，吾之所往：三次轉行的權衡分析

美最高法院推翻墮胎權！搜尋紀錄等數據將成為法院定罪證據

熱門閱讀

主編的話

主編的話｜你的名字，在科技進步的每一頁

科技是工具，但真正賦予科技方向與價值的，始終是人。女性，正用自己的方式，定義科技的模樣。當我們回首來時路，值得驕傲的或許不是走得有多快、多遠，而是始終沒有忘記，我們為何出發。也沒有忘記在改變世界的路上，我們願意讓自己的堅持，成為他人勇敢前行的起點。

共鳴

謝謝那個勇敢堅持的自己：寫給科技路上的妳

在科技產業這個快速變遷且傳統上以男性為主的領域中，女性科技人正展現出無比的韌性與獨特影響力。

日常

科技的力量帶來醫學美容新篇章

從事皮膚專科醫師與醫學美容工作十四年，正好遇上醫美技術、產品、儀器都快速發展的時代，除了不斷吸收新知與精進技術，也希望把近年來醫美產業常見的各種範疇整理並介紹給大眾認識。

職涯

在瘋狂變化的矽谷，不變的是驅使自我成長的好奇心

前陣子高中同學聚會，收穫了許多大家分享多元且具有啟發性的生涯發展，不禁讓我思索自己一路走來是否做了什麼不一樣的選擇，而成就了今天的自己？還是我只是那『努力就會成功』迷思的具體體現？回首自從高中與電腦程式結緣，爾後一路求學求職都在科技領域，這將近三十載的日子裡，埋頭苦幹從來都不是我的座右銘。相反地，純粹抱持著努力工作的想法實在太過殘酷，那更像是一場苦刑，而非能夠長久堅持的常態。那什麼是長久的呢？

趨勢

數位時代父母如何協助孩子管理網路與3C產品使用

「兒童手機使用玩到失控，父母不給手機就跳樓」「沉溺手機降低智商，專注力變差」「網路交友陷阱，大學生裸照遭詐騙二十萬元」這讓您憂心嗎？

共鳴

在科學教育路上的小提醒

陳宜欣提醒科學教育不只是實驗設備、精彩設計或升學捷徑，而是一種常保好奇心、卻又虛懷若谷的人生態度。

《女科技人電子報》轉載規則

文章為共有版權，轉載前須取得《台灣女科技人電子報》與作者同意。
(可由《台灣女科技人電子報》協助聯繫作者)
《台灣女科技人電子報》有三天優先刊登權;若僅轉載三分之一以內之內容，於文章刊出後隔日即可轉載。
轉載需附上文章作者姓名。(若圖片需要轉載的話，也請附上攝影的姓名)
文末請註明【《台灣女科技人電子報》授權刊登】的字樣，並附上文章連結。
文章標題維持不變。
全文轉載、詳細內容使用與授權方式，請參考內容使用與著作權說明