Monday 12 August 2024

Amazon DocumentDB

 

Amazon DocumentDB



Amazon DocumentDB is a fast, secure, scalable, and fully managed database service that is compatible with MongoDB. It allows you to store and query JSON data, as well as set up, operate, and scale MongoDB-compatible databases in the cloud. Amazon DocumentDB also supports the same application code, drivers, and tools as MongoDB.

Hevo uses DocumentDB Change Streams to ingest data from your Amazon DocumentDB database and replicate it into the Destination of your choice.


Prerequisites


Perform the following steps to configure your Amazon DocumentDB Source:

Step 1
Whitelist Hevo’s IP Address

You must whitelist Hevo’s IP address in your existing Amazon EC2 instance in order to connect to Hevo. Read Creating an Amazon EC2 instance if you have not created one already. Hevo needs this EC2 instance to create an SSH tunnel to connect to your DocumentDB cluster and replicate data from it.

Perform the following steps to whitelist Hevo’s IP address in your existing EC2 instance:

  1. Log in to your Amazon EC2 console.

  2. In the left navigation pane, under Network & Security, click Security Groups.

    Navigation Pane

  3. Click on the security group linked to your EC2 instance.

    Select Security Group

  4. In the Inbound rules tab, click Edit inbound rules.

    Edit Inbound Rules

  5. In the Edit inbound rules page, in the Source column, select Custom from the drop-down, and enter Hevo’s IP address for your region.

    Save Rules

  6. Click Save rules.


Step 2
Create User and Set up Permissions to Read DocumentDB Databases

Perform the following steps to create a database user, and grant READ privileges to that user:

  1. Open your mongo shell.

    Note: Ensure that your mongo shell is connected to your DocumentDB cluster before executing any commands.

  2. Run the following command to create a user and grant READ permissions to that user.

    use admin
    db.createUser({
        user: "<username>",
        pwd: "<password>",
        roles: [ "readAnyDatabase" ]
    });
    
    

    Note: Replace the placeholder values in the command above with your own. For example, <username> with jacobs.


Step 3
Enable Streams

You need to enable Streams on the DocumentDB collections and databases whose data you want to replicate to the Destination through Hevo.

To do this:

  1. Open your mongo shell.

    Note: Ensure that your mongo shell is connected to your DocumentDB cluster before executing any commands.

  2. Depending on the collections and databases you want to sync, run one of the following commands:

    • To enable change streams for a specific collection in a specific database:

      db.adminCommand({
          modifyChangeStreams: 1,
          database: "<database_name>",
      collection: "<collection_name>",
      enable: true
      });
      
      

      Note: Replace the placeholder values in the command above with your own. For example, <database_name> with hevosalesdata.

    • To enable change streams for all collections in a specific database:

      db.adminCommand({
          modifyChangeStreams: 1,
          database: "<database_name>",
          collection: "",
          enable: true
      });
      
      

      Note: Replace the placeholder values in the command above with your own. For example, <database_name> with hevosalesdata.

    • To enable change streams for all collections in all databases:

      db.adminCommand({
          modifyChangeStreams: 1,
          database: "",
          collection: "",
          enable: true
      });
      
      

Step 4
Modify the Change Stream Log Retention Duration

The change stream retention duration is the period for which Events are held in the change stream logs. If an Event is not read within that period, then it is lost.

This may happen if:

  • The change stream log is full, and the database has started discarding the older Event entries to write the newer ones.

  • The timestamp of the Event is older than the change stream retention duration.

The change stream log retention duration directly impacts the change stream log size that you must maintain to hold the entries.

By default, Amazon DocumentDB retains the Events for three hours after recording them. You must maintain an adequate size or retention duration of the change stream log for Hevo to read the Events without losing them. Hevo recommends that you modify the retention duration to 72 hours (259200 seconds).

To extend the change stream log retention duration:

  1. Log in to your Amazon DocumentDB console.

  2. In the left navigation pane, click Parameter Groups.

    Parameter Groups

  3. Select the cluster parameter group associated with your cluster. Read Determining an Amazon DocumentDB Cluster’s Parameter Group for more information.

    Note: You cannot edit a default cluster parameter group. Hence, if your DocumentDB cluster is using the default parameter group, you must either create a new group or make a copy of the default group and assign it to the cluster.

    Cluster Parameter Groups

  4. In the Cluster parameters section, search and select change_stream_log_retention_duration, and then click Edit.

    Log duration

  5. Modify the Value to 259200 seconds.

    Note: The Value should be in seconds only.

    Change value

  6. Click Modify cluster parameter.


Step 5
Configure Amazon DocumentDB Connection Settings

Perform the following steps to configure Amazon DocumentDB as the Source in your Pipeline:

  1. Click PIPELINES in the Navigation Bar.

  2. Click + CREATE in the Pipelines List View.

  3. On the Select Source Type page, select Amazon DocumentDB.

  4. On the Configure your Amazon DocumentDB Source page, specify the following:

    DocumentDB settings

    • Pipeline Name: A unique name for your Pipeline, not exceeding 255 characters.

    • Database Host: The IP address or Domain Name System (DNS) of your primary instance in the AWS console. You can find the primary instance for your cluster under the Role column of your AWS console.

    • Database Port: The port on which your Amazon DocumentDB server listens for connections. Default value: 27017.

    • Database User: The database user that you created. This authenticated user has permission to read collections in your database.

    • Database Password: The password of your database user.

    • Authentication Database Name: The database that stores the user’s information. The user name and password entered in the preceding steps are validated against this database. Default value: admin.

    • SSH IP: The IP address or DNS of the SSH server.

    • SSH Port: The port of the SSH server as seen from the public internet. Default port: 22.

    • SSH User: The username on the SSH server. For example, hevo.

      The SSH IP, port, and user credentials must be obtained from the AWS EC2 instance where you whitelisted Hevo’s IP address. Hevo connects to your DocumentDB cluster using these SSH credentials instead of directly connecting to your Amazon DocumentDB database instance. This method provides an additional level of security to your database by not exposing your Amazon DocumentDB setup to the public. Read Connecting Through SSH.

    • Use SSL: Enable this option if you have activated the TLS setting for your DocumentDB instance.

    • Advanced Settings:

      • Load All Databases: If enabled, Hevo fetches data from all the databases you have access to on the specified host. If disabled, provide the comma-separated list of the database names from which you want to fetch the data.

      • Merge Collections: If enabled, collections with the same name across different databases are merged into a single Destination table. If disabled, separate tables are created and prefixed with the respective database name.

      • Load Historical Data: If enabled, the entire table data is fetched during the first run of the Pipeline. If disabled, Hevo loads only the data that was written in your database after the time of the creation of the Pipeline.

      • Include New Tables in the Pipeline: If enabled, Hevo automatically ingests data from tables created after the Pipeline is built. If disabled, the new tables are listed in the Pipeline Detailed View in Skipped state, and you can manually include the ones you want and load their historical data.

        You can change this setting later.

  5. Click TEST & CONTINUE.

  6. Proceed to configuring the data ingestion and setting up the Destination.


Wednesday 7 August 2024

Complete Data Science Bootcamp : Step By Step Hands-On Labs

 Complete Data Science Bootcamp : Step By Step Hands-On Labs


It involves the use of techniques from statistics, computer science, and domain-specific knowledge to analyze and interpret data in order to make predictions, discover patterns, and gain insights. Data science is used in a wide range of industries, including healthcare, finance, marketing, and manufacturing, to make data-driven decisions and improve business outcomes.

This blog post helps you with your self-paced learning as well as with your team learning. There are many Hands-On Labs in this course.
Here’s a quick sneak-peak of how to start learning Data Science For Beginners by doing Hands-on.

Learning Path PythonModule 1: Python for Data Science

1) Environment Setup: Install Jupyter Notebooks

There are two ways to Install the Jupyter Notebook.

1. Using the pip command
We can use pip to install Jupyter Notebook using the following command:

$ pip install jupyter

2. Anaconda
We can also use Anaconda, which is a Python data science platform. Anaconda has its own installer named conda that we can use to install Jupyter Notebook.

2) Try Jupyter Notebook: Hello World!

We can print anything in python jupyter notebook by using ‘print(” “)‘ Syntax.

k21 academy

3) Working with Variables

Python has no command for declaring a variable. A variable is created the moment you first assign a value to it.

k21 academy

Python supports the usual logical conditions from mathematics:

  • Equals: a == b
  • Not Equals: a != b
  • Less than: a < b
  • Less than or equal to: a <= b
  • Greater than: a > b
  • Greater than or equal to: a >= b

These conditions can be used in several ways, most commonly in “if statements” and loops.

4) Understand the if-loop statement

An “if statement” is written by using the if keyword.

k21 academy

In this example, we use two variables, a and b, which are used as part of the if statement to test whether b is greater than a. As a is 33, and b is 200, we know that 200 is greater than 33, and so we print to screen that “b is greater than a“.

5) Understand For loop statement

A for loop is used for iterating over a sequence (that is either a list, a tuple, a dictionary, a set, or a string). This is less like the for a keyword in other programming languages and works more like an iterator method as found in other object-orientated programming languages.  With the for loop we can execute a set of statements, once for each item in a list, tuple, set etc.
Print each fruit in a fruit list:

k21 academy

The for loop does not require an indexing variable to set beforehand.

6) Understand While loop statement

With the while loop we can execute a set of statements as long as a condition is true.

k21 academy

Note: remember to increment i, or else the loop will continue forever.
The while loop requires relevant variables to be ready, in this example, we need to define an indexing variable, i, which we set to 1.

Module 2: Operators and Keywords

1) Create & Work with Lists

Lists are one of 4 built-in data types in Python used to store collections of data. Lists are used to store multiple items in a single variable.
Lists are created using square[] brackets:

k21 academy

List items are ordered, changeable, and allow duplicate values. List items are indexed, the first item has index [0], the second item has index [1] etc.

2) Working with Tuples

Tuples are used to store multiple items in a single variable. A tuple is a collection that is ordered and unchangeable.
Tuples are written with round() brackets.

k21 academy

Tuple items allow duplicate values.

3) Sets & Exercises

Sets are used to store multiple items in a single variable. A set is a collection that is both unordered and unindexed.
Sets are written with curly{} brackets.

k21 academy

Set items are unordered, unchangeable, and do not allow duplicate values.

4) Create & Understand Dictionaries

Dictionaries are used to store data values in key: value pairs. A dictionary is a collection which is ordered, changeable and does not allow duplicates.
Dictionaries are written with curly brackets, and have keys and values:

k21 academy

Dictionary items are presented in key: value pairs, and can be referred to by using the key name.

Module 3: NumPy & Pandas

1) Create & work with NumPy Arrays

The array object in NumPy is called ndarray. We can create a NumPy ndarray object by using the array() function. NumPy is a Python library used for working with arrays.

k21 academy

2) Create Pandas Dataframe

A Pandas DataFrame is a 2-dimensional data structure, like a 2-dimensional array, or a table with rows and columns.
Create a simple Pandas DataFrame:

k21 academy

3) Pandas Dataframe: load csv files

A simple way to store big data sets is to use CSV files (comma-separated files). CSV files contain plain text and are a good know format that can be read by everyone including Pandas. In our examples, we will be using a CSV file called ‘data.csv’.

k21 academy

Tip: use  to_string() to print the entire DataFrame.

Module 4: Function, Classes & Oops

1) Working with User-defined Methods

function is a block of code that only runs when it is called. You can pass data, known as parameters, into a function. A function can return data as a result.
In Python a function is defined using the def keyword:

k21 academy

2) Working with Inbuilt Methods

Inbuilt functions are functions that are already pre-defined. You just have to call the function and don’t worry about creating. In python there are many pre-defined functions, here we are gone pick one or two functions for understanding clearly.

  • abs(): Returns the absolute value of the given number and returns a magnitude of a complex number.

k21 academy

  • chr(): This Built-In function returns the character in python for an ASCII value.

k21 academy

and there are many more built-in functions.

3) Implementing User-defined Functions (Create, Call)

User-defined functions are functions that you use to organize your code in the body of a policy. Once you define a function, you can call it in the same way as the built-in functions.

 

k21 academy

To call a function, use the function name followed by a parenthesis.

4) Implementing Inbuilt Functions

Here we gonna see some important inbuilt functions which we are gonna use frequently.

The min() function returns the item with the lowest value or the item with the lowest value in an iterable. If the values are strings, an alphabetical comparison is done.
Return the item in a tuple with the lowest value:

k21 academy

5) Create Classes & Objects in Python

A Class is like an object constructor or a “blueprint” for creating objects. To create a class, use the keyword class.
Create a class named MyClass, with a property named x:

k21 academy

Now we can use the class named MyClass to create objects.
Create an object named p1, and print the value of x:

k21 academy

6) Understand the Inheritance Concept

Inheritance allows us to define a class that inherits all the methods and properties from another class. The parent class is the class being inherited from, also called base class. A child class is a class that inherits from another class, also called a derived class. Any class can be a parent class, so the syntax is the same as creating any other class.
Create a class named Person, with first name and last name properties, and a print name method:

k21 academy

To create a class that inherits the functionality from another class, send the parent class as a parameter when creating the child class.
Create a class named Student, which will inherit the properties and methods from the Person class:

k21 academy

Note: Use the pass keyword when you do not want to add any other properties or methods to the class.

Now the Student class has the same properties and methods as the Person class.
Use the Student class to create an object, and then execute the print name method:

k21 academy

Module 5: Data Science essential Libraries

There are many libraries available for data science, but some of the most essential libraries include:

  1. NumPy: This library provides support for large, multi-dimensional arrays and matrices of numerical data, as well as a large collection of mathematical functions to operate on these arrays.
    numpy.sum() in Python -k21 academy
  2. Pandas: This library provides fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It is a must-have for data cleaning, transformation, and manipulation.
    Python Pandas DataFrame - k21 academy
  3. Matplotlib: This library is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK.
    Matplotlib - k21 academy
  4. Scikit-learn: This library is a machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.
    AI Tools - k21 academy
  5. TensorFlow or PyTorch: These are open-source machine learning libraries that allow you to build and train neural networks. TensorFlow is developed by Google and PyTorch is developed by Facebook.
    What is Tensorflow | TensorFlow Introduction -k21 academy
  6. Seaborn: This library is a data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
    Introduction of Seaborn - k21 academy
  7. Plotly is a charting library that comes with over 40 chart types, 3D charts, statistical graphs, and SVG maps.
    which includes:

    • Scatter Plots
    • Line Graphs
    • Linear Graphs
    • Multiple Lines
    • Bar Charts
    • Horizontal Bar Charts
    • Pie Charts
    • Donut Charts
    • Plotting Equations

These libraries are widely used in the data science community and provide a solid foundation for many data science tasks. However, depending on the specific problem or task, other libraries may also be useful, such as NLTK for natural language processing or StatsModels for statistical modeling