[์ธ๊ณต์ง€๋Šฅ] End-to-end ML project #AI

2023. 4. 7. 21:31ใ†_Study/AI

728x90

End-to-end ML project๐Ÿ‡¸.•*¨*•¸.•*¨*•¸.•*¨*•¸.•*¨*•

ํ•ด๋‹น ์ž๋ฃŒ๋Š” ๊ฐ•์˜ ํ•™์Šต์ž๋ฃŒ์ž…๋‹ˆ๋‹ค. ๊ฐ•์˜ ์ด์™ธ์˜ ๋‚ด์šฉ์€ ๊ฒ€์ƒ‰ ๋ฐ ๋‹ค์–‘ํ•œ ์ž๋ฃŒ๋ฅผ ํ†ตํ•ด ๊ณต๋ถ€ํ•˜๋ฉฐ ์ •๋ฆฌํ•œ ๋‚ด์šฉ์˜ ํฌ์ŠคํŒ…์ž…๋‹ˆ๋‹ค.

#AI #์ธ๊ณต์ง€๋Šฅ #๊ธฐ๊ณ„ํ•™์Šต๊ณผ์ธ์‹ #chatgpt #python #study #0407


 

Main steps for end-to-end ML project (1~4  about data)

1. Look at the big picture.

2. Get the data.

3. Discover and visualize the data to gain insights.

4. Prepare the data for Machine Learning algorithms.

5. Select a model and train it.

6. Fine-tune your model.

7. Present your solution.

8. Launch, monitor, and maintain your system.

 

 


 

๊ธฐ๊ณ„ ํ•™์Šต์ด ์–ด๋–ค ๊ณผ์ •์„ ํ†ตํ•ด์„œ ์ด๋ฃจ์–ด์ง€๋Š”์ง€ ์•Œ์•„๋ณด์ž.

 

๊ธฐ๊ณ„ํ•™์Šต์˜ ๊ณผ์ •

 

1. Frame the Problem (superviesed, unsupervised, reinforcement learning)

๋ฌธ์ œ๋ฅผ ๋ฐ›์•˜์œผ๋ฉด ๊ผผ๊ผผํžˆ ๋ฌผ์–ด๋ด์•ผํ•œ๋‹ค. ๋น„์ฆˆ๋‹ˆ์Šค์˜ ๋ชฉ์ ์ด ๋ฌด์—‡์ธ์ง€ ํ•ด๋‹น ๋ชจ๋ธ์˜ ๊ถ๊ทน์ ์ธ ๋ชฉ์ ์ด ๋ฌด์—‡์ธ์ง€ ์„ธ๋ถ€์ ์œผ๋กœ ๊น๊นํ•˜๊ฒŒ ์ •์˜ํ•ด์•ผํ•œ๋‹ค.

 

์ง€๋„ํ•™์Šต? ๋น„์ง€๋„ํ•™์Šต? ๊ฐ•ํ™”ํ•™์Šต ์–ด๋Š ๊ฒƒ์„ ์‚ฌ์šฉํ•ด์•ผ ํ• ์ง€?

๋ถ„๋ฅ˜? ํšŒ๊ท€ํ•™์Šต์ธ์ง€? ๋ฌด์Šจ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•ด์•ผํ• ์ง€ ์ƒ๊ฐํ•ด๋ณธ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„๋Š” ์„ฑ๋Šฅ ์ธก์ •์„ ์œ„ํ•ด ์„ ํƒํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

 

 

2. Get thd data.

์ด 4๊ฐ€์ง€์˜ ๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.

2.1. Create Workspace

2.2. Download thd data.

2.3. Take a Quick look at the data structure.

2.4. Create a Test set

 

์–ด๋””์„œ ํ•  ๊ฒƒ์ธ์ง€ ์›Œํฌ์ŠคํŽ˜์ด์Šค๋ฅผ ์ •ํ•ด์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜จ ํ›„ ๊ตฌ์กฐ๋ฅผ ํŒŒ์•…ํ•˜๊ณ  ํ…Œ์ŠคํŠธ ์…‹์„ ๋งŒ๋“ ๋‹ค.

 

csv_path๋ฅผ return ๋ฐ›์•„์„œ ์—ฐ๊ฒฐํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ค์–ด์ค€๋‹ค.

 

 

.head()

.describe() ์™€ ๊ฐ™์€ ํ•จ์ˆ˜๋กœ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ๋ฅผ ์‚ดํŽด๋ณธ๋‹ค.

 

 

 

์ผ์ • ๋น„์œจ๋กœ train๊ณผ test๋ฅผ ๋‚˜๋ˆ„์–ด์ฃผ๋Š” ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ ๋‹ค. ์ด๋•Œ ํŠน์ • domain์˜ ์ „๋ฌธ๊ฐ€์™€ ํ•จ๊ป˜ ์–ด๋–ค ๋ณ€์ˆ˜๊ฐ€ ๋ฐ์ดํ„ฐ์— ์ค‘์š”ํ•œ์ง€ ์ •ํ•ด์•ผํ•œ๋‹ค.

 

 

 

 

3. Discover and visualize the data to gain insights.

 

 

์ด๋•Œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ์–ด๋‚ด๋Š” ๋Šฅ๋ ฅ, ๋ฐ์ดํ„ฐ ๋ฆฌํ„ฐ๋ฆฌ์‹œ๊ฐ€ ์ค‘์š”ํ•  ์ˆ˜ ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ๋Œ€๊ฐ์„ ์œผ๋กœ ์˜ฌ๋ผ๊ฐ€๊ฑฐ๋‚˜ ๋‚ด๋ ค๊ฐ€๋Š” ๋ฐฉํ–ฅ์ด ์žˆ๋‹ค๋ฉด ๋‘ ๋ณ€์ˆ˜๋Š” ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์žˆ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

\

 

4. Prepare thd data for ML algorithms.

 

Data Cleaning ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

 

ํ•„์š”์—†๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ฑฐํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์นดํ…Œ์ฝ”๋ฆฌ๋ฅผ ๋ฌธ์ž์—ด์ด๋‚˜ ์ˆซ์ž, ์ •์ˆ˜๋ฐ์ดํ„ฐ, ์‹ค์ˆ˜๋ฐ์ดํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ Coverting = Encodingํ•œ๋‹ค. ์ฆ‰, ๊ธฐ๊ณ„ํ•™์Šต์— ๋„์ค‘์— ํƒ€์ž…๊ด€๋ จ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜์ง€ ์•Š๋„๋ก ํƒ€์ž…์„ ๋ชจ๋‘ ๋ณ€ํ™˜๋ฐ ํ†ต์ผ์‹œํ‚จ๋‹ค.

 

 

 

 

Feature Scaling (transformation)

๋ฌธ์ œ ๋ฐœ์ƒ: ๋ฐฉ์˜ ๊ฐœ์ˆ˜๊ฐ€ 6~39320 ์ค‘์•™๊ฐ’์˜ ๋ฒ”์œ„๊ฐ€ 0~15์ด๋‹ค.

์ด๋Ÿฌํ•œ ์†์„ฑ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๊ณตํ†ต์ ์ธ ๋ฐฉ๋ฒ•์ด ๋‘๊ฐ€์ง€๊ฐ€ ์žˆ๋‹ค.

 

- min-max scaling ( normalization, end up ranging from 0 to 1) : 0~1์‚ฌ์ด์— ์˜ค๋„๋ก ์ •๊ทœํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•

๋ชจ๋“  ๊ฐ’์„ ๊ฐ€์žฅ ํฐ ๊ฐ’์œผ๋กœ ๋‚˜๋ˆ„์–ด ์Šค์ผ€์ผ๋งํ•˜๋Š” ๋ฐฉ๋ฒ•

- standardization (unit varance, not bounded)

ํ‰๊ท ๊ณผ ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์Šค์ผ€์ผ๋งํ•˜๋Š” ๋ฐฉ๋ฒ•. ํ‘œ์ค€ ์ •๊ทœ ๋ถ„ํฌ๋กœ ๋ณ€ํ™˜ํ• ์ˆ˜ ์žˆ๊ณ , min-max scaling๊ณผ ๋‹ฌ๋ฆฌ ๋ฒ”์œ„๊ฐ€ ์ •ํ•ด์ ธ์žˆ์ง€ ์•Š์œผ๋ฉฐ ํŠนํžˆ ์ด์ƒ์น˜๊ฐ€ ์žˆ๋Š” ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ๋Š” ๋”์šฑ ๊ฒฌ๊ณ ํ•˜๊ฒŒ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•œ๋‹ค.(robust)

 

 

 

5. Select a model and train it

Linear Regression : ์„ ํ˜• ํšŒ๊ท€ ๋ฐฉ์‹์€ ์ž…๋ ฅ๋ณ€์ˆ˜X์™€ ์ถœ๋ ฅ ๋ณ€์ˆ˜Y ์‚ฌ์ด์˜ ์„ ํ˜• ๊ด€๊ณ„๋ฅผ ๋ชจ๋ธ๋งํ•˜๋Š” ๊ธฐ๊ณ„ ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. 

 

 

from ~~ import ๋ชจ๋ธ

 

๋‚ด ๋ชจ๋ธ = ๋ชจ๋ธ์— ํ•ด๋‹นํ•˜๋Š” ํ•จ์ˆ˜()

๋‚ด ๋ชจ๋ธ.fit(train, validation) // ๋ชจ๋ธ์„ ํ•™์Šต ์‹œํ‚ค๋Š” ์ฝ”๋“œ

๋‚ด ๋ชจ๋ธ.predict(test) // ์ƒˆ๋กœ์šด ์ž…๋ ฅ์— ๋Œ€ํ•œ ์ถœ๋ ฅ์„ ์˜ˆ์ธกํ•˜๋Š” ์ฝ”๋“œ

 

 

 

๋” ์ข‹์€ ๋ชจ๋ธ์„ ์ฐพ์•„๊ฐ€์•ผํ•œ๋‹ค.

 

 

6. Fine-tune your model

๋ฌด์Šจ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ• ์ง€? parameter๋Š” ๋ญ˜ ํ• ์ง€ ์ •ํ•˜๋Š” ๋‹จ๊ณ„์ด๋‹ค.

 

1. Grid Search

์žฅ์  : ์•Œ์•„์„œ ๋” ์ข‹์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ฐพ์•„์ค€๋‹ค.

๋‹จ์  : ์˜ค๋ž˜๊ฑธ๋ฆฐ๋‹ค.

 

 

2. Randomized Search

๋„ˆ๋ฌด hyperparemeter๊ฐ€ ๋งŽ์€ ๊ฒฝ์šฐ. ์ด๊ฒƒ์„ ์‚ฌ์šฉํ•ด๋ณด์ž.

 

 

 

3. Ensemble methods (ํ•ฉ์น˜๊ธฐ)

Try to combine the models that perform best.

The group will often perform better than the best individual model.

 

 

8. Launch, monitor, and maintain your system.

deployment is not the end of the story. : ๊ณ„์† ๋ฐœ์ „์‹œํ‚ค๊ธฐ

 


 

์ •๋ฆฌํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค. ์ˆœ์„œ๋ฅผ ์ž˜ ๊ธฐ์–ตํ•ด๋‘๊ณ  ์ ์šฉํ•ด๋ณด์ž.