# AI - Collecting Data
:::info
[TOC]
:::

<br/>
## Preface
The performance of an AI model highly depends on the **quality and quantity** of data. Building an AI system, we need to know where to get data and which APIs or tools to use.
<br/>
### 1. Data sources
Data can come from multiple origins, depending on the use case:
1. **Self-Collected Data**
- Manually gathered or generated through surveys, experiments, sensors, or logs.
2. **Open Datasets**
- Public datasets provided by communities.
- e.g. Kaggle, UCI, ImageNet
3. **APIs**
- Access to structured and often real-time data from external services.
- e.g. Twitter API, Google Maps API
- **Web Scraping**
- Collecting information from websites when APIs or datasets are not available.
> ++Each approach has trade-offs between cost, scalability, reliability, and legality.++
<br/>
### 2. Tools & APIs
Efficient data collection and preparation require the right tools:
- **Collection**:
- APIs (`requests`, `tweepy`, `googlemaps`)
- Web scraping (`BeautifulSoup`, `Scrapy`, `Selenium`, `Playwright`)
- **Storage**:
- Flat files (`CSV`, `JSON`)
- Databases (`SQLite`, `PostgreSQL`, `MongoDB`)
- Cloud storage (`AWS S3`, `Google Cloud Storage`)
- **Cleaning & Preprocessing**:
- `pandas`, `numpy` → For tabular data.
- `OpenRefine` → For large-scale data cleaning.
- `NLTK`, `spaCy` → For text preprocessing.
- `OpenCV`, `Pillow` → For image preprocessing
> ++The right data sources and tools ensures the **foundation of a reliable AI system**.++
<br/>
## A. Self-Collected Data
Collecting data manually, suitable for specific domains or projects.
### 1. Methods
- **Surveys / User feedback**
- **Experimental data** (e.g., sensors, IoT devices)
- **Business systems** (CRM, ERP, log files)
- **Manual annotation**: e.g., image classification, sentiment analysis
### 2. Pros
- Data is tailored to the task
- High quality (controlled annotation)
### 3. Cons
- High cost (time, labor, devices)
- Hard to scale quickly
<br/>
## B. Open Datasets
Large-scale datasets for training, evaluating, and testing models through online sources.
### 1. [Kaggle 🔗]()
- Largest open data science platform.
- Covers image, text, finance, medical, etc.
- Download with `kaggle` CLI:
```bash
pip install kaggle
kaggle datasets download -d <dataset-owner>/<dataset-name>
```
### 2. Academic Datasets
- **UCI Machine Learning Repository**
- **Google Dataset Search**
- **ImageNet** (image classification)
- **COCO Dataset** (image + annotation)
- **LibriSpeech** (speech recognition)
### 3. Government & Open Data
- [data.gov](https://www.data.gov/) (USA)
- [data.gov.tw](https://data.gov.tw/) (Taiwan)
- [EU Open Data Portal](https://data.europa.eu/) (European Union)
<br/>
## C. APIs
### Common sources
- **Social media** (Twitter/X API, Reddit API, YouTube Data API)
- **Finance** (Yahoo Finance API, Alpha Vantage)
- **Maps & Geodata** (Google Maps API, OpenStreetMap API)
- **News** (NewsAPI, GDELT)
### Example - Python Twitter API
```python
import requests
bearer_token = "YOUR_BEARER_TOKEN"
headers = {"Authorization": f"Bearer {bearer_token}"}
query = "AI lang:en -is:retweet"
url = f"https://api.twitter.com/2/tweets/search/recent?query={query}&max_results=10"
response = requests.get(url, headers=headers)
tweets = response.json()
print(tweets)
```
<br/>
## D. Web Scraping
If no API exists, data can be collected with scraping.
### Tools
- **BeautifulSoup** (Python)
- **Selenium / Playwright** (dynamic pages)
- **Scrapy** (large-scale framework)
### Example - python `BeautifulSoup`
```python
import requests
from bs4 import BeautifulSoup
url = "https://news.ycombinator.com/"
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
titles = [a.get_text() for a in soup.select(".storylink")]
print(titles)
```
<br/>
## Data Storage & Cleaning
After collecting, data must be stored and cleaned:
**Storage** — `.CSV`, `.JSON`, `.SQL`, `.NoSQL` (MongoDB)
**Cleaning** — handle missing values, duplicates, outliers
**Tools** — [Pandas](/AhWjfIcIReqHAQcfSe1Qcw), [Numpy](/ckWTq6gBQR6xJovnlF5ZVg), OpenRefine
#### Pandas Example
```python
import pandas as pd
df = pd.read_csv("data.csv")
df = df.dropna() # remove missing values
df = df.drop_duplicates() # remove duplicates
print(df.head())
```
<br/>
## Summary
1. **Self-Collected Data** → High quality but costly
2. **Open Datasets** → Fast and free, but may not fit perfectly
3. **APIs** → Real-time, great for dynamic data
4. **Web Scraping** → Flexible, but requires legal caution
5. **Data Cleaning & Storage** → Ensure reliability
:::warning
📌 Key point — Data quality is more important than quantity!
:::
<br/>
## Next Topic
:::info
⏭️ Next Topic: **[AI - Data Preprocessing](/PsKQKpehQz6QF1oku0qYFg)**
:::
<br/>
:::spoiler Relevant Resource
[Kaggle.com](https://www.kaggle.com/)
[Kaggle Note](/F4z_vYeVRNmFhQC8cyB5Nw)
:::