3일차_1 Data cleaning

파이썬정복
|2024. 1. 19. 18:41

material 수업 내용 정리

 

Sberbank Data

In this notebook we will look at data from the Russian bank Sberbank on real estate sales. The data was published as part of a Kaggle Competition and contains 30471 entries and 292 features. Each row corresponds to a real estate transaction, enriched with macroeconomic key figures. The large number of columns makes data cleaning much more interesting, as we cannot simply check each column "manually", but have to proceed automatically.

 

 

 

Method 1 : Heatmap

 

 

method 2 : List with relative frequency

Since we cannot analyze every column in the data set in a heat map due to the size of the data, we should take a look at which columns contain missing values. We can also specify the relative frequency of the missing values:

 

반복문으로 col 객체를 생성하여 loc으로 채운다

numpy 평균을 내어 확인할 수 있다

 

Method 3 : Missing Data Histogram

 

 


 

Dealing with missing values

There are various methods for dealing with missing values. It is often unclear which of these is the right one. This often has to be done in close coordination with the business requirements and decided on a case-by-case basis.

 

Method 1 : Remove values

dropna(thresh=2)라면 thresh는 threshold(임계값)을 설정할 수 있는데, NaN이 2개 이상 있는 것에 대해서만 삭제해버린다는 의미이다.

출처: https://enjoyiot.tistory.com/entry/02-Missing-Data [Shining...:티스토리]

 

Method 2 : Imputation 대체

예시


Outlier 들은 어떻게 처리할까?

In statistics, outliers are data points that are far removed from other observations. These can be the result of a large variance or an error in the data itself. Depending on the type of analysis, it may make sense to find and treat these outliers.

 

box plot  vs. z-score

 


Exercise code review

Question 1. state indicates the status of the apartment and should contain the values 1-4.

Quention 2. What is the connection between floor and max_floor? Find and correct any inconsistencies

 

Question3. Check build_year for incorrect values and correct them in a meaningful way

 

 

Question4. Remove outlier from life_sp and perform two different types of imputation

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

'해외경험 > THU_AICOSS' 카테고리의 다른 글

2일차 Pandas Basic  (0) 2024.01.19
1일차 Python Recap & Numpy  (0) 2024.01.18