AI - Collecting Data

# AI - Collecting Data :::info [TOC] ::: ![1920_1280_](https://hackmd.io/_uploads/By-t9-Gjle.png) ## Preface The performance of an AI model highly depends on the **quality and quantity** of data. Building an AI system, we need to know where to get data and which APIs or tools to use. ### 1. Data sources Data can come from multiple origins, depending on the use case: 1. **Self-Collected Data** - Manually gathered or generated through surveys, experiments, sensors, or logs. 2. **Open Datasets** - Public datasets provided by communities. - e.g. Kaggle, UCI, ImageNet 3. **APIs** - Access to structured and often real-time data from external services. - e.g. Twitter API, Google Maps API - **Web Scraping** - Collecting information from websites when APIs or datasets are not available. > ++Each approach has trade-offs between cost, scalability, reliability, and legality.++ ### 2. Tools & APIs Efficient data collection and preparation require the right tools: - **Collection**: - APIs (`requests`, `tweepy`, `googlemaps`) - Web scraping (`BeautifulSoup`, `Scrapy`, `Selenium`, `Playwright`) - **Storage**: - Flat files (`CSV`, `JSON`) - Databases (`SQLite`, `PostgreSQL`, `MongoDB`) - Cloud storage (`AWS S3`, `Google Cloud Storage`) - **Cleaning & Preprocessing**: - `pandas`, `numpy` → For tabular data. - `OpenRefine` → For large-scale data cleaning. - `NLTK`, `spaCy` → For text preprocessing. - `OpenCV`, `Pillow` → For image preprocessing > ++The right data sources and tools ensures the **foundation of a reliable AI system**.++ ## A. Self-Collected Data Collecting data manually, suitable for specific domains or projects. ### 1. Methods - **Surveys / User feedback** - **Experimental data** (e.g., sensors, IoT devices) - **Business systems** (CRM, ERP, log files) - **Manual annotation**: e.g., image classification, sentiment analysis ### 2. Pros - Data is tailored to the task - High quality (controlled annotation) ### 3. Cons - High cost (time, labor, devices) - Hard to scale quickly ## B. Open Datasets Large-scale datasets for training, evaluating, and testing models through online sources. ### 1. [Kaggle 🔗]() - Largest open data science platform. - Covers image, text, finance, medical, etc. - Download with `kaggle` CLI: ```bash pip install kaggle kaggle datasets download -d <dataset-owner>/<dataset-name> ``` ### 2. Academic Datasets - **UCI Machine Learning Repository** - **Google Dataset Search** - **ImageNet** (image classification) - **COCO Dataset** (image + annotation) - **LibriSpeech** (speech recognition) ### 3. Government & Open Data - [data.gov](https://www.data.gov/) (USA) - [data.gov.tw](https://data.gov.tw/) (Taiwan) - [EU Open Data Portal](https://data.europa.eu/) (European Union) ## C. APIs ### Common sources - **Social media** (Twitter/X API, Reddit API, YouTube Data API) - **Finance** (Yahoo Finance API, Alpha Vantage) - **Maps & Geodata** (Google Maps API, OpenStreetMap API) - **News** (NewsAPI, GDELT) ### Example - Python Twitter API ```python import requests bearer_token = "YOUR_BEARER_TOKEN" headers = {"Authorization": f"Bearer {bearer_token}"} query = "AI lang:en -is:retweet" url = f"https://api.twitter.com/2/tweets/search/recent?query={query}&max_results=10" response = requests.get(url, headers=headers) tweets = response.json() print(tweets) ``` ## D. Web Scraping If no API exists, data can be collected with scraping. ### Tools - **BeautifulSoup** (Python) - **Selenium / Playwright** (dynamic pages) - **Scrapy** (large-scale framework) ### Example - python `BeautifulSoup` ```python import requests from bs4 import BeautifulSoup url = "https://news.ycombinator.com/" res = requests.get(url) soup = BeautifulSoup(res.text, "html.parser") titles = [a.get_text() for a in soup.select(".storylink")] print(titles) ``` ## Data Storage & Cleaning After collecting, data must be stored and cleaned: **Storage** — `.CSV`, `.JSON`, `.SQL`, `.NoSQL` (MongoDB) **Cleaning** — handle missing values, duplicates, outliers **Tools** — [Pandas](/AhWjfIcIReqHAQcfSe1Qcw), [Numpy](/ckWTq6gBQR6xJovnlF5ZVg), OpenRefine #### Pandas Example ```python import pandas as pd df = pd.read_csv("data.csv") df = df.dropna() # remove missing values df = df.drop_duplicates() # remove duplicates print(df.head()) ``` ## Summary 1. **Self-Collected Data** → High quality but costly 2. **Open Datasets** → Fast and free, but may not fit perfectly 3. **APIs** → Real-time, great for dynamic data 4. **Web Scraping** → Flexible, but requires legal caution 5. **Data Cleaning & Storage** → Ensure reliability :::warning 📌 Key point — Data quality is more important than quantity! ::: ## Next Topic :::info ⏭️ Next Topic:   **[AI - Data Preprocessing](/PsKQKpehQz6QF1oku0qYFg)** ::: :::spoiler Relevant Resource [Kaggle.com](https://www.kaggle.com/) [Kaggle Note](/F4z_vYeVRNmFhQC8cyB5Nw) :::