Scraping data with python
According to Wikipedia Web scraping is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
🚧Data wrangling overview 🚧
🚧Manage static files in python 🚧
Encoding
🚧Generator 🚧
Concept: Generators are a type of iterable, like lists or tuples, but they generate items on-the-fly and don't store them in memory. This makes them more memory-efficient for large datasets. Code Example: None for this lesson, as it's theoretical. Lesson 2: Creating Generators
Concept: Generators are created using a function but instead of returning a value once, they can yield multiple values over time. Code Example: python Copy code def count_up_to(max): count = 1 while count <= max: yield count count += 1
counter = count_up_to(5) for num in counter: print(num) # Outputs 1, then 2, then 3, ..., up to 5 Lesson 3: Generator Expressions
Concept: Similar to list comprehensions but for generators. They look like list comprehensions but use parentheses instead of square brackets. Code Example: python Copy code nums = (num for num in range(5)) # Generator expression for num in nums: print(num) # Outputs 0, 1, 2, 3, 4 Module 2: Working with Generators
Lesson 1: Iterating Over Generators
Concept: Generators are iterables and can be used in a loop or by calling next(). They remember the state between iterations. Code Example: python Copy code gen = count_up_to(3) print(next(gen)) # Outputs 1 print(next(gen)) # Outputs 2 Lesson 2: Generator State and Efficiency
Concept: Generators maintain their state and only produce values as needed, which is why they are memory efficient, especially for large datasets. Code Example: No specific code example needed, as it's more of a conceptual understanding. Lesson 3: Advanced Generator Features
Concept: Advanced methods of generators like send(), throw(), and close() allow more complex interactions. Code Example: python Copy code def simple_gen(): yield "Hello" yield "World"
gen = simple_gen() print(next(gen)) # Outputs 'Hello' gen.close() Module 3: Practical Applications of Generators
Lesson 1: Generators for Large Datasets
Concept: Generators are ideal for processing large datasets like big files, as they only load data into memory as needed. Code Example: python Copy code def read_large_file(file_name): with open(file_name, 'r') as file: for line in file: yield line
for line in read_large_file('large_file.txt'): print(line) Lesson 2: Generators and Concurrency
Concept: Generators can be used in conjunction with concurrency for efficient data processing. Code Example: This topic is more advanced and would typically involve integrating generators with Python's threading or asyncio modules, which might be beyond the scope of this course. Lesson 3: Best Practices and Common Pitfalls
Concept: Understanding when and how to use generators effectively, and common mistakes to avoid. Code Example: No specific code, but discuss guidelines like not using generators for small datasets, being cautious with stateful generators, etc. Final Thoughts This course aims to give you a solid understanding of Python generators, from their basic syntax and usage to more advanced applications. Generators are a powerful tool in Python, especially useful when working with large data sets or streams of data, due to their efficient use of memory. As you work through this course and experiment with the code examples, you'll gain a deeper understanding of how and when to use generators in your Python projects.
Web data
HTTP in a nutshell
HTTP stands for HyperText Transfer Protocol. It's the foundation of data communication on the World Wide Web. Essentially, it's a protocol used for transmitting data over a network. Most of the information that you receive through your web browser is delivered via HTTP.
An HTTP request is a message sent by a client (like a web browser or a mobile app) to a server to request a specific action. This action can be fetching a web page, submitting form data, downloading a file, etc.
As you can see on the schema above an HTTP request is formed with :
- Method: Indicates what type of action you're requesting. Common methods include GET (retrieve data), POST (submit data), PUT (update data), and DELETE (remove data).
- URL (Uniform Resource Locator): Specifies the location of the resource (like a web page or an image) on the server.
- Headers: Provide additional information (like the type of browser making the request, types of response formats that are acceptable, etc.).
- Body: Contains data sent to the server. This is typically used with POST and PUT requests.
How does it work ?
When you type a URL into your browser and press Enter, your browser sends an HTTP GET request to the server that hosts that URL. The server processes the request, and if everything goes well, it sends back a response. This response usually contains the HTML content of the web page you requested.
The response from the server includes a status code (like 200 for a successful request, 404 for not found, etc.), headers (similar to request headers but providing information from the server), and usually, a body (which contains the requested data, if any). See http codes on wikipedia.
HTTP is a stateless protocol, meaning each request-response pair is independent. Servers don't retain information about previous interactions. Techniques like cookies are used to "remember" state across requests.
Secure HTTP - HTTPS
When security is a concern, HTTPS (HTTP Secure) is used. It encrypts the request and response, protecting the data from being read or tampered with by intermediaries. See more details about basic certificates into de docker https section.
Request module
Simple relationnal databases
View online db : https://inloop.github.io/sqlite-viewer/