top of page
 Projects About Data Analysis 

I was exposed to a lot of work and courses related to data analytics during my undergraduate and graduate studies, which made my knowledge practically useful, because both in my courses and in my research projects, I could learn about the world's need for data analytics and urban spatial planning, which is an important technical support for social progress and development, and I was happy to find the wonderful relationship among this chaotic data exploration process and implement it into every realistic scene.

If you need my data or code for these projects, please send me an email with the intent and I will get a message back to you.

Analysis and Prediction of Wildfire Locations in California

This is a fire location prediction project for California wildfires that I conducted as part of my studies at the University of Pennsylvania. In this project, I collected wildfire data from previous years and integrated other key predictive elements (geographic elevation, temperature, precipitation, landcover, urban and rural regional distribution) to build a unique logic model, including a cost-benefit analysis, to select optimal thresholds and predict minimum losses for wildfire events.

The model was brought into k-fold cross-validation and spatial cross-validation to assess the accuracy of the prediction results, and the results were presented in a variety of visualizations.

For this prediction model, I conceived an application for California wildfires that can query past wildfire records and also show the likelihood of future fires in an area.There is a YouTube video of our introduction to the final application policies, and the video is explained by another of my teammates.

Time-Space Prediction of Bike Share Demand in Washington, DC

One of the most difficult operational problems for urban bike share systems is the need to ‘re-balance’ bicycles across the network. Bike share is not useful if a dock has no bikes to pickup, nor if there are no open docking spaces to deposit a bike. Re-balancing is the practice of anticipating (or predicting) bike share demand for all docks at all times and manually redistributing bikes to ensure a bike or a docking place is available when needed.

In this project, I chose Washington, D.C. as my research destination and selected February to March 2022 as the data collection period to predict vehicle usage at bike-sharing stations in different areas over time using a combination of time-lagged and spatial information and ensure a satisfactory margin of error MAE of approximately 0.4. The use of cross-validation to verify the feasibility of the model ensures that this data analysis has good generalization, which indicates that my model may continue to provide reliable rebalancing prediction results for bicycles.The greatest point of this predictive model is to be able to improve the supply and demand of bicycle sharing, because in many cases the movement and placement of bicycles in various stations requires human mobilization, which needs to be supported by the policies of the bicycle sharing system, and this often consumes a lot of time and labor, in order to achieve rebalancing, I made some suggestions with the help of this data analysis project.

Targeting A Housing Subsidy

Department of Housing and Community Development (HCD) is considering a more proactive approach for targeting home owners who qualify for a home repair tax credit program. This tax credit program has been around for close to twenty years, and while HCD tries to proactively reach out to eligible homeowners ever year, the uptake of the credit is woefully inadequate. Typically not all eligible homeowners they reach out and enter the program ultimately takes the credit. In order to assign this policy in a more targeted way, we need to build analytical models based on the information of various households and train the models with some data to get the best classifier.

The data will include basic information such as age, employment status, marriage, education, etc. There will also be more detailed data such as whether there is a mortgage, whether the owner has a lien on the property, the results of the last grant activity, etc. Not every one of these data is used for the whole model, and some of them need to be reclassified and censored in the process. I think the selection of the dataset features is very important, without good features involved in the regression, there is no way to make better predictions. In addition, the feature engineering operation needs to be very accurate in restructuring and deleting the results.

Home Price Prediction Model of Mecklenberg County

Zillow has realized that its housing market predictions are not as accurate as they could be because they do not factor in enough local intelligence.  As such they have asked you and your partner (as well as several other teams) to build a better predictive model of home prices for Mecklenberg County, NC. Mecklenburg County is home to the metropolitan area of Charlotte, the most populous city in the State of North Carolina. Home to several vibrant hubs of recreation and culture, from an up-and-coming food scene to the NASCAR Hall of Fame, tourists and new residents arrive to this area from across the world.

The most important thing in this project is to find the best predictive "features" or variables and inject enough predictive power into the model to make good predictions without overfitting the training data. We chose a number of predictive features, in addition to some features of the house itself, such as house size, number of bedrooms, wall material, etc., and features related to the environment in which the house is located, such as median household income, crime density, educational background, and distance to the nearest medical institution. I visualize some geographically relevant characteristics to facilitate direct comparison with the geographic house price distribution.

After getting the result distribution of house prices afterwards, maps of absolute errors were also made and used to assess the accuracy of the results. After that, the areas of high and low house prices were drawn in the map based on the income level of $32,322 as the cut-off point.

bottom of page