Skip to content

Scraping data with python

According to Wikipedia Web scraping is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

🚧Data wrangling overview 🚧

🚧Manage static files in python 🚧

Encoding

🚧Generator 🚧

Concept: Generators are a type of iterable, like lists or tuples, but they generate items on-the-fly and don't store them in memory. This makes them more memory-efficient for large datasets. Code Example: None for this lesson, as it's theoretical. Lesson 2: Creating Generators

Concept: Generators are created using a function but instead of returning a value once, they can yield multiple values over time. Code Example: python Copy code def count_up_to(max): count = 1 while count <= max: yield count count += 1

counter = count_up_to(5) for num in counter: print(num) # Outputs 1, then 2, then 3, ..., up to 5 Lesson 3: Generator Expressions

Concept: Similar to list comprehensions but for generators. They look like list comprehensions but use parentheses instead of square brackets. Code Example: python Copy code nums = (num for num in range(5)) # Generator expression for num in nums: print(num) # Outputs 0, 1, 2, 3, 4 Module 2: Working with Generators

Lesson 1: Iterating Over Generators

Concept: Generators are iterables and can be used in a loop or by calling next(). They remember the state between iterations. Code Example: python Copy code gen = count_up_to(3) print(next(gen)) # Outputs 1 print(next(gen)) # Outputs 2 Lesson 2: Generator State and Efficiency

Concept: Generators maintain their state and only produce values as needed, which is why they are memory efficient, especially for large datasets. Code Example: No specific code example needed, as it's more of a conceptual understanding. Lesson 3: Advanced Generator Features

Concept: Advanced methods of generators like send(), throw(), and close() allow more complex interactions. Code Example: python Copy code def simple_gen(): yield "Hello" yield "World"

gen = simple_gen() print(next(gen)) # Outputs 'Hello' gen.close() Module 3: Practical Applications of Generators

Lesson 1: Generators for Large Datasets

Concept: Generators are ideal for processing large datasets like big files, as they only load data into memory as needed. Code Example: python Copy code def read_large_file(file_name): with open(file_name, 'r') as file: for line in file: yield line

for line in read_large_file('large_file.txt'): print(line) Lesson 2: Generators and Concurrency

Concept: Generators can be used in conjunction with concurrency for efficient data processing. Code Example: This topic is more advanced and would typically involve integrating generators with Python's threading or asyncio modules, which might be beyond the scope of this course. Lesson 3: Best Practices and Common Pitfalls

Concept: Understanding when and how to use generators effectively, and common mistakes to avoid. Code Example: No specific code, but discuss guidelines like not using generators for small datasets, being cautious with stateful generators, etc. Final Thoughts This course aims to give you a solid understanding of Python generators, from their basic syntax and usage to more advanced applications. Generators are a powerful tool in Python, especially useful when working with large data sets or streams of data, due to their efficient use of memory. As you work through this course and experiment with the code examples, you'll gain a deeper understanding of how and when to use generators in your Python projects.

Web data

HTTP in a nutshell

HTTP stands for HyperText Transfer Protocol. It's the foundation of data communication on the World Wide Web. Essentially, it's a protocol used for transmitting data over a network. Most of the information that you receive through your web browser is delivered via HTTP.

An HTTP request is a message sent by a client (like a web browser or a mobile app) to a server to request a specific action. This action can be fetching a web page, submitting form data, downloading a file, etc.

As you can see on the schema above an HTTP request is formed with :

  • Method: Indicates what type of action you're requesting. Common methods include GET (retrieve data), POST (submit data), PUT (update data), and DELETE (remove data).
  • URL (Uniform Resource Locator): Specifies the location of the resource (like a web page or an image) on the server.
  • Headers: Provide additional information (like the type of browser making the request, types of response formats that are acceptable, etc.).
  • Body: Contains data sent to the server. This is typically used with POST and PUT requests.

How does it work ?

When you type a URL into your browser and press Enter, your browser sends an HTTP GET request to the server that hosts that URL. The server processes the request, and if everything goes well, it sends back a response. This response usually contains the HTML content of the web page you requested.

The response from the server includes a status code (like 200 for a successful request, 404 for not found, etc.), headers (similar to request headers but providing information from the server), and usually, a body (which contains the requested data, if any). See http codes on wikipedia.

HTTP is a stateless protocol, meaning each request-response pair is independent. Servers don't retain information about previous interactions. Techniques like cookies are used to "remember" state across requests.

Secure HTTP - HTTPS

When security is a concern, HTTPS (HTTP Secure) is used. It encrypts the request and response, protecting the data from being read or tampered with by intermediaries. See more details about basic certificates into de docker https section.

Request module

Simple relationnal databases

View online db : https://inloop.github.io/sqlite-viewer/

Beautiful Soup