Habeeb Kenny Shopeju – Linux Hint https://linuxhint.com Exploring and Master Linux Ecosystem Sun, 28 Feb 2021 23:40:34 +0000 en-US hourly 1 https://wordpress.org/?v=5.6.2 PostgreSQL FAQs https://linuxhint.com/postgresql-faqs/ Fri, 26 Feb 2021 19:01:20 +0000 https://linuxhint.com/?p=91943 According to StackOverflow’s 2020 Annual Developer Survey, PostgreSQL is the second most popular database management system available, and this is not without good reason. Since its initial release in 1996, PostgreSQL, or Postgres, has improved considerably, adding several useful features, including user-defined types, table inheritance, multi-version concurrency control, and more.

PostgreSQL is also very lightweight, easy to set up, and can be installed on several platforms, such as containers, VMs, or physical systems. Besides its default GUI, pgAdmin, Postgres also supports over 50 other IDEs, a third of which are free to use. This article will cover some of the most frequently asked questions (FAQs) about PostgreSQL.

Is PostgreSQL Free?

PostgreSQL is a free product that was released under the OSI-approved PostgreSQL license. This means that there is no fee required to use PostgreSQL, even for commercial purposes, though there are some third-party extensions and services that require a subscription or one-time fee.

Is PostgreSQL Open-Source?

Yes, PostgreSQL is open-source. PostgreSQL started out as a University of Berkeley project in 1986 and was released to the public on July 8, 1996, as a free and open-source relational database management system.

Is PostgreSQL Case-Sensitive?

PostgreSQL is case-sensitive by default, but in certain situations, it can be made case insensitive. For example, when creating a table in PostgreSQL, column and table names are automatically converted to lower-case to make them case insensitive. The same is also done for queries; this way, they match the already-converted column and table names.

Note that when you use quotes for the column or table name, such as “Amount,” the conversion does not occur. You will have to use quotes in your queries, as well, to prevent PostgreSQL from converting the queries to lowercase. You can also make column values case-insensitive using a PostgreSQL-specific keyword called CITEXT when creating columns. This keyword also allows a column declared as UNIQUE or PRIMARY KEY to be case-insensitive.

Is PostgreSQL Relational?

PostgreSQL was originally designed to be a relational database management system. It has since grown far beyond its original design, as PostgreSQL now supports some NoSQL capabilities, such as storing and retrieving data in JSON (JSONB), and key-value pairs (HSTORE). Unlike many NoSQL-only databases, the NoSQL capabilities of PostgreSQL are ACID-compliant and can be interfaced with SQL, like any other data type supported by PostgreSQL.

Why Should I Use PostgreSQL?

You must understand the needs of your product before choosing a database management system for that product. Usually, this choice comes down to whether to use a relational DBMS or a NoSQL database. If you are dealing with structured and predictable data with a static number of users or applications accessing the system, consider going for a relational database, such as PostgreSQL.

Besides choosing PostgreSQL because it is an RDBMS, there are several other features of this database management system that makes it one of the most popular systems available today. Some of these features include the following:

  • Support for various data types, such as JSON/JSONB, XML, key-value pairs (HSTORE), point, line, circle, and polygon. You can also create custom data types.
  • Foreign data wrappers that allow connection to other databases or streams, such as Neo4j, CouchDB, Cassandra, Oracle, and more, with a standard SQL interface.
  • Ability to build out custom functions.
  • Procedural languages, such as PL/PGSQL, Perl, Python, and more.
  • Access to many extensions that provide additional functionality, such as PostGIS.
  • Multi-version Concurrency Control.
  • Multi-factor authentication with certificates and an additional method.

And so much more. You can see a full list of the features offered by PostgreSQL here.

PostgreSQL vs MySQL: Is PostgreSQL Better Than MySQL?

MySQL is the most popular database management system available today. It is light, easy to understand and set up, and very fast, particularly when dealing with high-concurrent read-only functions. The ease of use of MySQL makes it easier to find database admins for this database management system.

Having said that, MySQL lacks several of the features that come with PostgreSQL databases. To start with, PostgreSQL is not just a relational database management system, it is also an object-relational database management system. This means that PostgreSQL supports unique features, such as table inheritance and function overloading.

It performs better when dealing with complex queries under heavy load. It does, however, slow down when dealing with read-only operations.

PostgreSQL also has a wider range of data types available, and it allows you to create custom data types for your database. Perhaps its greatest advantage over MySQL is PostgreSQL’s extensibility. You can create PostgreSQL extensions to suit your use case.

For the most part, PostgreSQL is a better DBMS than MySQL. But in the end, it all comes down to your use case. If you are making a simple website or web application and you only need to store data, you are better off using MySQL. But if you are dealing with more complex, high-volume operations, consider going with PostgreSQL.

PostgreSQL vs MongoDB: Is PostgreSQL Better Than MongoDB?

A comparison between PostgreSQL and MongoDB is simply a comparison between relational database management systems and NoSQL databases. And the answer of which is better boils down to your use case; how you want to use and structure your data. Each DBMS contains characteristics that are useful in different situations.

If you are building an application with an unpredictable and dynamic data structure, you will want to go for a NoSQL database like MongoDB. NoSQL database management systems are known for their schema-less databases, meaning that the database structure does not have to be defined on creation. This makes NoSQL databases very flexible and easily scalable.

PostgreSQL is a better fit if you are working with data with a fixed, static structure that changes infrequently. PostgreSQL also has the advantage of SQL, a powerful and well-established query language. Relational database management systems are more appropriate for applications that require referential integrity, such as Fintech applications.

In recent years, both DBMS types have been adopting key features from the other. For example, as explained above, PostgreSQL supports key-value pairs and JSON data types, key features of NoSQL database management systems (DBMS). MongoDB now claims to be ACID compliant, a key feature of relational database management systems (RBDMS).

However, neither feature works like in the original DBMS type that supports it. For example, according to this article, MongoDB still has several issues with its ACID compliance. Also, while PostgreSQL supports JSON data types and key-value pairs, this system is not schema-less. You are still required to declare the structure upon creation.

PostgreSQL: How to Connect to A Database Server

Before connecting to a database, make sure that you have downloaded and installed PostgreSQL on your operating system. Next, launch the psql application. This opens a dedicated command-line interface program for interfacing with the PostgreSQL database server.

Once the server has launched, you will be asked to fill in the following fields sequentially: server, database, port, username, and password. You can keep the default options that were set while installing PostgreSQL by hitting Enter for each query.

When you get to the password input field, enter the password you set during installation for the “postgres” user. Once that is done and your identity has been validated successfully, you will be connected to the database server.

Another way to connect to a database is by using pgAdmin. pgAdmin is PostgreSQL’s GUI for interfacing with its database servers. To use pgAdmin, launch the application. This should open a web application on your browser. Right-click Servers in the top-left corner of the web app, then hover over Create and select Server… from the menu that pops up.

You can also click Add New Server under Quick Links. Whichever option you choose, you should now see a dialog box requesting some information.

Enter a name for the server, then navigate to the Connection tab. Under the Connection tab, input “localhost” as your Host name/address, then type in the postgres user’s password that was set up during the installation. Click Save to save the server. The dialog box will close, and you will be connected to the database server automatically.

Where Are PostgreSQL Databases Stored?

By default, PostgreSQL databases are stored in a data folder, but the location of this folder varies with the OS. On Windows, you will usually find it in either of the following locations: C:\Program Files (x86)\PostgreSQL\<version number>\data or C:\Program Files\PostgreSQL\<version number>\data.

On a Mac, if you installed PostgreSQL via homebrew, you will find it in /usr/local/var/postgres/data. Otherwise, it will be located in /Library/PostgreSQL/<version number>/data.

For Linux, the location varies with the Linux flavor. Sometimes, it is found in /usr/local/pgsql/data or /var/lib/postgresql/[version]/data.

To determine the location of the databases more accurately, enter the following command in psql:

SHOW data_directory;

PostgreSQL: How to Start the Database Server

Starting a PostgreSQL server is slightly different for each operating system. To start the server on Windows, first, locate the directory of the database. This is usually something like “C:\Program Files\PostgreSQL\10.4\data.” Copy the directory path, as you will need it in a moment. Then, launch Command Prompt and run the following command.

pg_ctl -D "C:\Program Files\PostgreSQL\13\data" start

The path should be the database directory path you copied. To stop the server, simply replace “start” with “stop” in the above command. You can also restart it by replacing “start with “restart”.

When you attempt to run this command, you may get the following error: “pg_ctl isn’t recognized as an internal or external command. To resolve this issue, add “C:\Program Files\PostgreSQL\9.5\bin” and “C:\Program Files\PostgreSQL\9.5\lib” to your system’s PATH environment variable.

For macOS, if you installed PostgreSQL with homebrew, use the following commands:

To start the database server manually, run the following command:

pg_ctl -D /usr/local/var/postgres start

Make sure that the directory path is that of your database.

To start the database server now and relaunch at login, run the following command:

brew services start postgresql

To stop the server for both scenarios, simply replace “start” with “stop.”

In Linux, before starting a database server, you must first set a password for the postgres user. No password is set by default on installation. You can set the password with the following command:

sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'postgres';"

Of course, your password can be anything you choose it to be. Once the password is set, to start the server, enter the following command in the terminal:

sudo service postgresql start

To stop the server, replace “start” with “stop” in the command, just like with Windows and macOS.

PostgreSQL: How to Create A Database

To create a database, make sure that you are already connected to a database server. Follow the instructions above to do so. If you connected to the server via psql, enter the following command to create a database:

CREATE DATABASE new_database;

If you want to connect to your recently-created database, enter the following command:

\c new_database

You should now be connected to it.

If you connected to the server via pgAdmin, on the web app, right-click on Databases, hover over Create, and select Database…

You should see a dialog box appear requesting certain details to create the database. You will need to input at least the name of the database to create the database. Enter a name in the Database field and click Save. You should now be able to see your recently-created database under Databases.

Where Are PostgreSQL Logs?

By default, PostgreSQL logs are stored in the log folder under the data folder, the default location for PostgreSQL databases. To confirm this, run the following command in psql:

SHOW log_directory;

Note that this command will only display a relative path, but the path should be located in the data folder.

Does PostgreSQL Have Stored Procedures?

Although PostgreSQL has always supported user-defined functions, it was not until its v11.0 release that it included support for Stored Procedures. To create a stored procedure in PostgreSQL, use the CREATE PROCEDURE statement. To execute a stored procedure, use the CALL statement.

Conclusion

PostgreSQL has seen active development for more than 30 years, having been created in the 1980s. During this time, PostgreSQL has matured significantly, and it is currently the second most popular database management system in the world, according to StackOverflow’s 2020 Annual Developer Survey.

Two major reasons for the popularity of PostgreSQL are its extensibility and the myriad of useful features available to its users. If you are selecting a DBMS for your project, and you have decided that you prefer an RDBMS over a NoSQL database, PostgreSQL would be an excellent choice for your application. ]]> A Beginner’s Guide To Docker Compose https://linuxhint.com/beginners_guide_docker_compose/ Tue, 26 Nov 2019 09:20:36 +0000 https://linuxhint.com/?p=50924 Docker Compose is one of the most useful tools for Software Developers and System Administrators. Many jobs require someone with knowledge of this technology, so Docker and Docker Compose are hot in the DevOps space. Without doubt, knowing how to use these technologies will benefit your IT career.

If you are a beginner to Docker Compose, but have some knowledge of Docker, this article is for you. You’ll get to learn about:

  • What is Docker Compose?
  • Popular Comparisons
  • Docker Compose vs Kubernetes
  • Docker Compose vs Docker Swarm
  • Installing Docker Compose
  • The Docker-Compose.yml File
  • Docker-Compose Commands

Before diving into the juicy parts of this article, a little background on the tech should be awesome.

Containerization has become a key part of software infrastructure, and this applies to large, medium or small-scale projects. While containers are not new, Docker has made them popular. With containers, dependency issues become a thing of the past. Containers also play a huge role in making the micro-services architecture very effective. Software applications are made of smaller services, so it is easy to have these services in containers, and they communicate.

The issue with doing this, is that there will be so many containers running. Such that managing them becomes complex. This creates a need for a tool helps run multiple containers, which Docker Compose does. At the end of the article, you’ll understand basic Docker Compose concepts and be able to use it too.

What is Docker Compose?

Without all the complexity, Docker Compose is a tool that lets you manage multiple Docker containers. Remember micro-services? The concept of splitting a web application into different services? Well, those services will run in individual containers which need to be managed.

Imagine a web application has some of these services:

  • Sign up
  • Sign in
  • Reset password
  • History
  • Chart

Following a microservice-like architecture, these services will be split and run in separate containers. Docker Compose makes it easy to manage all these containers, instead of managing them individually. It is important to note that Docker Compose doesn’t explicitly build Docker images. The job of building images is done by Docker through the Dockerfile.

Popular Comparisons

It is common to have many solutions to a problem. Docker Compose solves this problem of managing multiple containers. As a result, there are often comparisons with other solutions. You should note that most of these comparisons are the wrong ones. While they are often not valid, it is best you learn about them as it helps you understand Docker Compose better.

The two comparisons to be discussed are:

  • Docker Compose vs Kubernetes
  • Docker Compose vs Docker Swarm

Docker Compose vs Kubernetes

Kubernetes is often compared to Docker Compose. But, the similarities in both tools are minute, with large dissimilarities. These technologies are not at the same level or scale. Hence, comparing both tools is outrightly wrong.

Kubernetes popularly known as k8s is an open-source tool that can be used to automate containers (not restricted to Docker). With k8s, you can deploy and administer containers, ensuring they scale at different loads. Kubernetes ensures that containers are fault-tolerant and work optimally by causing them to self-heal, which you won’t get from Docker Compose.

Kubernetes is a more powerful tool. It is more suitable for administering containers for large-scale applications in production.

Docker Compose vs Docker Swarm

Docker Compose is also often compared to Docker Swarm, and it is as wrong as the Kubernetes comparison. Instead, Docker Swarm should be the one being compared to Kubernetes.

Docker Swarm is an open-source tool that lets you perform container orchestration just as you would Kubernetes. Both have their pros and cons, but that is not the topic of discussion. You’ll do fine knowing that both are similar and neither is an alternative to Docker Compose.

Installing Docker Compose

Docker Compose is an official Docker tool, but it doesn’t come with the Docker installation. So, you need to install it as a separate package. The installation process of Docker Compose for Windows and Mac is available on the official site.

To install Docker Compose on Ubuntu, you can use the following command:

sudo apt-get install docker-compose

To install Docker Compose on other Linux distros, you can use curl. Simply run the following commands:

sudo curl -L
https://github.com/docker/compose/releases/download/1.18.0/docker-compose-`uname
-s`-`uname -m` -o /usr/local/bin/docker-compose

Then:

sudo chmod +x /usr/local/bin/docker-compose

The first command downloads the latest version of Docker Compose to the directory dedicated for packages. The second one sets the file permissions, making it executable.

The Docker-Compose.yml File

It won’t be awfully wrong to say that a Docker Compose file is to Docker Compose, what a Dockerfile is to Docker. Inside the Docker Compose file, lies all the instructions that Docker Compose follows when managing the containers. Here, you define the services which end up being containers. You also define the networks and volumes that the services depend on.

The Docker Compose file uses the YAML syntax, and you have to save as docker-compose.yml. You can have services for the backend, frontend, database and message queues in a web app. These services will need specific dependencies. Dependencies such as the networks, ports, storage for optimal operation. Everything needed for the entire application will be defined in the Docker Compose file.

You need a basic understanding of the YAML syntax to write your compose file. If you aren’t familiar with that, it should take less than an hour to grasp. There’ll be a lot of key-value pairings or directives in your file. The top-level ones are:

  • Version
  • Services
  • Network
  • Volumes

However, only the version and services will be discussed, as you can define the other two in the services directive.

Version

When writing your file, you’ll define the version first. As at the time of writing, Docker Compose only has versions 1, 2 and 3. It’s not surprising that it is the recommended version to use as it has certain differences from the older versions.

You can specify the version to use for Docker Compose in the file as seen below:

  • Version: “3”
  • Version: “2.4”
  • Version: “1.0”

Services

The service key is arguably the most important key in a Docker Compose file. Here, you specify the containers you want to create. There are a lot of options and tons of combinations for configuring containers in this section of the file. These are some options you can define under the services key:

  • Image
  • Container_name
  • Restart
  • Depends_on
  • Environment
  • Ports
  • Volumes
  • Networks
  • Entrypoint

In the rest of this section, you’ll learn how each of these options affect the containers.

Image

This option defines what image as service uses. It uses the same convention as you use when pulling an image from Dockerhub in a Dockerfile. Here’s an example:

image: postgres:latest

However, there is no restriction to using Dockerhub files alone. You can also build images from your machine through your Docker Compose file, using a Dockerfile. You can use the “build”, “context” and “dockerfile” directives to do this.

Here’s an example:

build:
    context: .
    dockerfile: Dockerfile

“Context” should contain the path to the directory with the Dockerfile. Then “dockerfile” contains the name of the Dockerfile to be used. It is conventional to always name your Dockerfiles as “Dockerfile”, but this gives an opportunity to use something different. You should note that this is not the only way to use an image through a Dockerfile.

Container_name

Docker assigns random names to containers. But you may desire to have customized names for the containers. With the “container_name” key, you can give specific names to containers, instead of Dockers randomly generated names.

Here’s an example:

container_name: linuxhint-app

However, there’s one thing you should be careful about: do not give the same name to multiple services. Container names have to be unique; doing so will cause the services to fail.

Restart

Software infrastructure is doomed to fail. With the knowledge of this, it is easier to plan towards recovering from this failure. There are many reasons for a container to fail, so the restart key tells the container to wake or not. You have the following options, no, always, on-failure and unless-stopped. These options imply that a container will never restart, will always restart, only restart on failure or only when stopped.

Here’s an example:

restart: always

Depends_on

Services run in isolation. But practically, services can’t do much in isolation. There needs to be a dependency on other services. For example, the backend service of a web app will depend on databases, caching services, etc. At the “depends_on” key, you can add the dependencies.

Here’s an example:

 depends_on:
    - db

Doing this means that Docker Compose will start those services before the current one. However, it doesn’t ensure that those services are ready for use. The only guarantee is that the containers will start.

Environment

Applications depend on certain variables. For security and ease of use, you extract them from the code and set them up as environment variables. Examples of such variables are API keys, passwords, and so on. These are common in web applications. Note that this key only works if there is no “build” directive in that service. Hence, create the image beforehand.

Look at this:

environment:
    API-KEY: 'the-api-key'
    CONFIG: 'development'
    SESSION_SECRET: 'the-secret'

If you intend to use the “build” directive regardless, you’ll need to define the environment variables in an “args” directive. The “args” directive is a sub-directive of “build”.

Here’s an example:

build:
    context: .
    args:
        api-key: 'the-api-key'
        config: 'development'
        session_secret: 'the-secret'

Ports

No container works in isolation despite running separately from the others. To provide a link to communicate with the “outside world”, you need to map ports. You map the Docker container’s port to the actual host port. From Docker, you may have come across the “-p” argument that is used to map ports. The ports directive works similar to the “-p” argument.

ports:
    - "5000:8000"

Volumes

Docker containers have no means of storing data persistently, so they lose data when they restart. With volumes, you can work around this. Volumes makes it possible to create a persistent data storage. It does this by mounting a directory from the docker host into the docker container’s directory. You can also setup volumes as top level services.

Here’s an example:

volumes:
    - host-dir:/test/directory

There are many options available when configuring volumes, you can check them out.

Networks

Networks can also be created in services. With the networks key, you can setup the networking for individual services. Here, you can setup the driver the network uses, if it allows IPv6, etc. You can setup networks like services too, just like volumes.

Here’s an example:

networks:
    - default

There are many options when configuring networks, you can check them out.

Entrypoint

When you start a container, you often must run certain commands. For example, if the service is a web application, you must start the server. The entrypoint key lets you do this. Entrypoint works like ENTRYPOINT in Dockerfile. The only difference in this case is that whatever you define here overrides the ENTRYPOINT configurations in the Dockerfile.entrypoint: flask run

Here’s an example:

entrypoint: flask run

Docker Compose Commands

After creating a Docker-Compose file, you need to run certain commands to get Compose to work. In this section, you’ll learn about some major Docker Compose commands. They are:

  • Docker-compose up
  • Docker-compose down
  • Docker-compose start
  • Docker-compose stop
  • Docker-compose pause
  • Docker-compose unpause
  • Docker-compose ps

Docker-compose up

This Docker-compose command helps builds the image, then creates and starts Docker containers. The containers are from the services specified in the compose file. If the containers are already running and you run docker-compose up, it recreates the container. The command is:

docker-compose up

Docker-compose start

This Docker-compose command starts Docker containers, but it doesn’t build images or create containers. So it only starts containers if they have been created before.

Docker-compose stop

You’ll often need to stop the containers after creating and starting them up. Here’s where the Docker-compose stop command comes in handy. This command basically stops the running services, but the setup containers and networks remain intact.
The command is:

docker-compose stop

Docker-compose down

The Docker-compose down command also stops Docker containers like the stop command does. But it goes the extra mile. Docker-compose down, doesn’t just stop the containers, it also removes them. The networks, volumes and actual Docker images can also be removed if you use certain arguments. The command is:

docker-compose down

If you intend to remove volumes, you specify by adding –volumes. For example:

docker-compose down --volumes

If you intend to remove images, you specify by adding –rmi all or –rmi local. For example:

docker-compose down --rmi all
docker-compose down --rmi local

Where all causes Docker Compose to remove all images, and local causes Docker Compose to remove only images without a custom tag set by the ‘image’ field.

Docker-compose pause

There are scenarios where you have to suspend a container, without killing or deleting it. You can achieve this with the Docker-compose pause command. It pauses the activities of that container, so you can resume them when you want to. The command is:

docker-compose pause

Docker-compose unpause

The docker-compose unpause is the opposite of the docker-compose pause command. You can use it to resume suspended processes as a result of using Docker-compose pause. The command is:

docker-compose unpause

Docker-compose ps

Docker-compose ps lists all the containers created from the services in the Docker-Compose file. It is similar to docker ps which lists all containers running on the docker host. However, docker-compose ps is specific to the containers from the Docker Compose file. The command is:

docker-compose ps

Bringing It All Together

Now that you have seen some of the key concepts behind a Docker Compose file, let’s bring it all together. Below is a sample Docker-Compose file for a Python Django web application. You’ll see a breakdown of every line in this file and see what they do.

version: '3'

services:
  db:
    image: postgres
  web:
    build: .
    command: python manage.py runserver 0.0.0.0:8000
    volumes:
      - .:/code
    ports:
      - "8000:8000"
    depends_on:
      - db

The short story is that with this Docker-Compose file, a PostgreSQL database is created and a django server is started.

The long story is:

  1. This file uses the version 3 of Docker-Compose.
  2. It creates two services. The db and web services.
  3. The db service uses the official docker postgres image.
  4. The web service builds its own image from the current directory. Since it does not define the context and Dockerfile keys, Dockerfile is expected to be named “Dockerfile” by convention.
  5. The command that will run after the container starts is defined.
  6. The volume and ports are defined. Both use the convention of host:container mapping.
  7. For volume, the current directory “.” is mapped to “/code” directory inside the container. This helps data in the container become persistent, so it is not lost every time the container starts.
  8. For port, the host’s port 8000 is mapped to the container’s port 8000. Note that the web app runs on the port 8000. Hence, the web app can be accessed on the host through that port.
  9. Finally, the web service depends on the db service. Hence, the web service will only start when the db container has started.
  10. More on the Dockerfile for the Django application and Docker Compose file can be gotten from the documentation.

Conclusion

You do not need to be an expert with Docker to use Docker Compose. As a beginner not intending to master this tool, it is fine to learn what you need alone. In this article, you’ve learnt the basics of Docker Compose. Now, you understand why Docker Compose is needed, the wrong comparisons, how to setup a Docker Compose config file and the commands too. It’s exciting knowing these things, but the real joy comes from putting them to practice. It’s time to get to work.

]]>
How to Parse XML Files Using Python’s BeautifulSoup https://linuxhint.com/parse_xml_python_beautifulsoup/ Tue, 10 Sep 2019 09:10:56 +0000 https://linuxhint.com/?p=47080 Data is literally everywhere, in all kinds of documents. But not all of it is useful, hence the need to parse it to get the parts that are needed. XML documents are one of such documents that hold data. They are very similar to HTML files, as they have almost the same kind of structure. Hence, you’ll need to parse them to get vital information, just as you would when working with HTML.

There are two major aspects to parsing XML files. They are:

  • Finding Tags
  • Extracting from Tags

You’ll need to find the tag that holds the information you want, then extract that information. You’ll learn how to do both when working with XML files before the end of this article.

Installation

BeautifulSoup is one of the most used libraries when it comes to web scraping with Python. Since XML files are similar to HTML files, it is also capable of parsing them. To parse XML files using BeautifulSoup though, it’s best that you make use of Python’s lxml parser.

You can install both libraries using the pip installation tool, through the command below:

pip install bs4 lxml

To confirm that both libraries are successfully installed, you can activate the interactive shell and try importing both. If no error pops up, then you are ready to go with the rest of the article.

Here’s an example:

$python
Python 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20)
[MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import bs4
>>> import lxml
>>>

Before moving on, you should create an XML file from the code snippet below. It’s quite simple, and should suit the use cases you’ll learn about in the rest of the article. Simply copy, paste in your editor and save; a name like sample.xml should suffice.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root testAttr="testValue">
The Tree
<children>
<child name="Jack">First</child>
<child name="Rose">Second</child>
<child name="Blue Ivy">
Third
<grandchildren>
<data>One</data>
<data>Two</data>
<unique>Twins</unique>
</grandchildren>
</child>
<child name="Jane">Fourth</child>
</children>
</root>

Now, in your Python script; you’ll need to read the XML file like a normal file, then pass it into BeautifulSoup. The remainder of this article will make use of the bs_content variable, so it’s important that you take this step.

# Import BeautifulSoup
from bs4 import BeautifulSoup as bs
content = []
# Read the XML file
with open("sample.xml", "r") as file:
    # Read each line in the file, readlines() returns a list of lines
    content = file.readlines()
    # Combine the lines in the list into a string
    content = "".join(content)
    bs_content = bs(content, "lxml")

The code sample above imports BeautifulSoup, then it reads the XML file like a regular file. After that, it passes the content into the imported BeautifulSoup library as well as the parser of choice.

You’ll notice that the code doesn’t import lxml. It doesn’t have to as BeautifulSoup will choose the lxml parser as a result of passing “lxml” into the object.

Now, you can proceed with the rest of the article.

Finding Tags

One of the most important stages of parsing XML files is searching for tags. There are various ways to go about this when using BeautifulSoup; so you need to know about a handful of them to have the best tools for the appropriate situation.

You can find tags in XML documents by:

  • Names
  • Relationships

Finding Tags By Names

There are two BeautifulSoup methods you can use when finding tags by names. However, the use cases differ; let’s take a look at them.

find

From personal experience, you’ll use the find method more often than the other methods for finding tags in this article. The find tag receives the name of the tag you want to get, and returns a BeautifulSoup object of the tag if it finds one; else, it returns None.

Here’s an example:

>>> result = bs_content.find("data")
>>> print(result)
<data>One</data>
>>> result = bs_content.find("unique")
>>> print(result)
<unique>Twins</unique>
>>> result = bs_content.find("father")
>>> print(result)
None
>>> result = bs_content.find("mother")
>>> print(result)
None

If you take a look at the example, you’ll see that the find method returns a tag if it matches the name, else it returns None. However, if you take a closer look at it, you’ll see it only returns a single tag.

For example, when find(“data”) was called, it only returned the first data tag, but didn’t return the other ones.

GOTCHA: The find method will only return the first tag that matches its query.

So how do you get to find other tags too? That leads us to the next method.

find_all

The find_all method is quite similar to the find method. The only difference is that it returns a list of tags that match its query. When it doesn’t find any tag, it simply returns an empty list. Hence, find_all will always return a list.

Here’s an example:

>>> result = bs_content.find_all("data")
>>> print(result)
[<data>One</data>, <data>Two</data>]
>>> result = bs_content.find_all("child")
>>> print(result)
[<child>First</child>, <child>Second</child>, <child>
Third
<grandchildren>
<data>One</data>
<data>Two</data>
<unique>Twins</unique>
</grandchildren>
</child>, <child>Fourth</child>]
>>> result = bs_content.find_all("father")
>>> print(result
[]
>>> result = bs_content.find_all("mother")
>>> print(result)
[]

Now that you know how to use the find and find_all methods, you can search for tags anywhere in the XML document. However, you can make your searches more powerful.

Here’s how:

Some tags may have the same name, but different attributes. For example, the child tags have a name attribute and different values. You can make specific searches based on those.

Have a look at this:

>>> result = bs_content.find("child", {"name": "Rose"})
>>> print(result)
<child name="Rose">Second</child>
>>> result = bs_content.find_all("child", {"name": "Rose"})
>>> print(result)
[<child name="Rose">Second</child>]
>>> result = bs_content.find("child", {"name": "Jack"})
>>> print(result)
<child name="Jack">First</child>
>>> result = bs_content.find_all("child", {"name": "Jack"})
>>> print(result)
[<child name="Jack">First</child>]

You’ll see that there is something different about the use of the find and find_all methods here: they both have a second parameter.

When you pass in a dictionary as a second parameter, the find and find_all methods further their search to get tags that have attributes and values that fit the provided key:value pair.

For example, despite using the find method in the first example, it returned the second child tag (instead of the first child tag), because that’s the first tag that matches the query. The find_all tag follows the same principle, except that it returns all the tags that match the query, not just the first.

Finding Tags By Relationships

While less popular than searching by tag names, you can also search for tags by relationships. In the real sense though, it’s more of navigating than searching.

There are three key relationships in XML documents:

  • Parent: The tag in which the reference tag exists.
  • Children: The tags that exist in the reference tag.
  • Siblings: The tags that exist on the same level as the reference tag.

From the explanation above, you may infer that the reference tag is the most important factor in searching for tags by relationships. Hence, let’s look for the reference tag, and continue the article.

Take a look at this:

>>> third_child = bs_content.find("child", {"name": "Blue Ivy"})
>>> print(third_child)
<child name="Blue Ivy">
Third
<grandchildren>
<data>One</data>
<data>Two</data>
<unique>Twins</unique>
</grandchildren>
</child>

From the code sample above, the reference tag for the rest of this section will be the third child tag, stored in a third_child variable. In the subsections below, you’ll see how to search for tags based on their parent, sibling, and children relationship with the reference tag.

Finding Parents

To find the parent tag of a reference tag, you’ll make use of the parent attribute. Doing this returns the parent tag, as well as the tags under it. This behaviour is quite understandable, since the children tags are part of the parent tag.

Here’s an example:

>>> result = third_child.parent
>>> print(result)

<children>
<child name="Jack">First</child>
<child name="Rose">Second</child>
<child name="Blue Ivy">
Third
<grandchildren>
<data>One</data>
<data>Two</data>
<unique>Twins</unique>
</grandchildren>
</child>

<child name="Jane">Fourth</child>
</children>

Finding Children

To find the children tags of a reference tag, you’ll make use of the children attribute. Doing this returns the children tags, as well as the sub-tags under each one of them. This behaviour is also understandable, as the children tags often have their own children tags too.

One thing you should note is that the children attribute returns the children tags as a generator. So if you need a list of the children tags, you’ll have to convert the generator to a list.

Here’s an example:

>>> result = list(third_child.children)
>>> print(result)

['\n        Third\n     ', <grandchildren>
<data>One</data>
<data>Two</data>
<unique>Twins</unique>
</grandchildren>, '\n']

If you take a closer look at the example above, you’ll notice that some values in the list are not tags. That’s something you need to watch out for.

GOTCHA: The children attribute doesn’t only return the children tags, it also returns the text in the reference tag.

Finding Siblings

The last in this section is finding tags that are siblings to the reference tag. For every reference tag, there may be sibling tags before and after it. The previous_siblings attribute will return the sibling tags before the reference tag, and the next_siblings attribute will return the sibling tags after it.

Just like the children attribute, the previous_siblings and next_siblings attributes will return generators. So you need to convert to a list if you need a list of siblings.

Take a look at this:

>>> previous_siblings = list(third_child.previous_siblings)
>>> print(previous_siblings)

['\n', <child name="Rose">Second</child>, '\n',
<child name="Jack">First</child>, '\n']

>>> next_siblings = list(third_child.next_siblings)
>>> print(next_siblings)

['\n', <child name="Jane">Fourth</child>]

>>> print(previous_siblings + next_siblings)

['\n', <child name="Rose">Second</child>, '\n', <child name="Jack">First</child>,
 '\n', '\n', <child name="Jane">Fourth</child>, '\n']

The first example shows the previous siblings, the second shows the next siblings; then both results are combined to generate a list of all the siblings for the reference tag.

Extracting From Tags

When parsing XML documents, a lot of the work lies in finding the right tags. However, when you find them, you may also want to extract certain information from those tags, and that’s what this section will teach you.

You’ll see how to extract the following:

  • Tag Attribute Values
  • Tag Text
  • Tag Content

Extracting Tag Attribute Values

Sometimes, you may have a reason to extract the values for attributes in a tag. In the following attribute-value pairing for example: name=”Rose”, you may want to extract “Rose.”

To do this, you can make use of the get method, or accessing the attribute’s name using [] like an index, just as you would when working with a dictionary.

Here’s an example:

>>> result = third_child.get("name")
>>> print(result)

Blue Ivy

>>> result = third_child["name"]
>>> print(result)

Blue Ivy

Extracting Tag Text

When you want to access the text values of a tag, you can use the text or strings attribute. Both will return the text in a tag, and even the children tags. However, the text attribute will return them as a single string, concatenated; while the strings attribute will return them as a generator which you can convert to a list.

Here’s an example:

>>> result = third_child.text
>>> print(result)

'\n    Third\n      \nOne\nTwo\nTwins\n\n'

>>> result = list(third_child.strings)
>>> print(result)

['\n  Third\n      ', '\n', 'One', '\n', 'Two', '\n', 'Twins', '\n', '\n']

Extracting Tag Content

Asides extracting the attribute values, and tag text, you can also extract all of a tags content. To do this, you can use the contents attribute; it is a bit similar to the children attribute and will yield the same results. However, while the children attribute returns a generator, the contents attribute returns a list.

Here’s an example:

>>> result = third_child.contents
>>> print(result)

['\n        Third\n     ', <grandchildren>
<data>One</data>
<data>Two</data>
<unique>Twins</unique>
</grandchildren>, '\n']

Printing Beautiful

So far, you’ve seen some important methods and attributes that are useful when parsing XML documents using BeautifulSoup. But if you notice, when you print the tags to the screen, they have some kind of clustered look. While appearance may not have a direct impact on your productivity, it can help you parse more effectively and make the work less tedious.

Here’s an example of printing the normal way:

>>> print(third_child)

<child name="Blue Ivy">
Third
<grandchildren>
<data>One</data>
<data>Two</data>
<unique>Twins</unique>
</grandchildren>
</child>

However, you can improve its appearance by using the prettify method. Simply call the prettify method on the tag while printing, and you’ll get something visually pleasing.

Take a look at this:

Conclusion

Parsing documents is an important aspect of sourcing for data. XML documents are pretty popular, and hopefully you are better equipped to take them on, and extract the data you want.

From this article, you are now able to:

  • search for tags either by names, or relationships
  • extract data from tags

If you feel quite lost, and are pretty new to the BeautifulSoup library, you can check out the BeautifulSoup tutorial for beginners. ]]> OpenCV Crash Course for Python Developers https://linuxhint.com/opencv_course_python_developers/ Tue, 30 Jul 2019 11:20:59 +0000 https://linuxhint.com/?p=44476 Computer Vision and Image Processing can be applied in a lot of areas, and to carry out such tasks a powerful library like OpenCV will always come in handy.

The Open Computer Vision Library known as OpenCV for short is very popular among Machine Learning engineers and Data Scientists. There are many reasons for this, but the major one is that OpenCV makes it easy to get started with working on challenging Computer Vision tasks.

As a Python developer, this crash course will equip you with enough knowledge to get started. You will learn how to:

  • Install OpenCV
  • Work with Images & Windows in OpenCV
  • Edit Images with OpenCV
  • Work with Videos in OpenCV

At the end of the article, you’ll be skilled enough to work with images and videos, and be able to work on image processing, computer vision tasks or even build your own photoshop with basic features by combining with a GUI library!

Installing OpenCV

Python, Java, and C++ are some of the languages with an OpenCV library, but this article will look into Python’s OpenCV.

OpenCV is cross platform, but you’ll need to have Python installed on your computer to get started. For Linux and Mac OS users, Python comes with the OS by default, so you do not have to bother about getting it installed. For Windows users, you’ll need to download and install the executable from the official Python Site.

Tip: Do not forget to tick the “Add to Path” directive you get when installing Python to make it easier to access it from the Command Prompt.

Open the terminal or command prompt and type in:

python

The command above will activate the interactive shell, which indicates a successful installation process.

Next step is to install the OpenCV and Numpy libraries; the Numpy library will come in handy at some point in this crash course.

The pip command below can help with installing both libraries:

pip install opencv-python numpy

OpenCV may have installation issues, but the command above should do the magic and install both libraries. You can import OpenCV and Numpy in the interactive shell to confirm a successful installation process.

Python 3.6.7 (default, Oct 22 2018, 11:32:17)
[GCC 8.2.0] on linux

Type “help”, “copyright”, “credits” or “license” for more information.

>>> import cv2
>>> import numpy

You can move on with the rest of this crash course if you do not face any error, the show is about to get started.

Working with Images & Windows in OpenCV

Windows are the fundamentals of OpenCV as a lot of tasks depend on creating windows. In this section, you’ll learn how to create, display and destroy windows. You’ll also see how to work with images too.

Here are the things to be looked at in this section

  • Creating Windows
  • Displaying Windows
  • Destroying Windows
  • Resizing Windows
  • Reading Images
  • Displaying Images
  • Saving Images

The code samples and images used in this section can be found on the Github repository.

Creating Windows

You’ll create windows almost every time when working with OpenCV, one of such reasons is to display images. As you’ll come to see, to display an image on OpenCV, you’ll need to create a window first, then display the image through that window.

When creating a window, you’ll use OpenCV’s namedWindow method. The namedWindow method requires you to pass in a window name of your choice and a flag; the flag determines the nature of the window you want to create.

The second flag can be one of the following:

  • WINDOW_NORMAL: The WINDOW_NORMAL flag creates a window that can be manually adjustable or resizeable.
  • WINDOW_AUTOSIZE: The WINDOW_AUTOSIZE flag creates a window that can’t be manually adjustable or resizeable. OpenCV automatically sets the size of the window in this case and prevents you from changing it.

There are three flags you can use for the OpenCV window, but the two above remain the most popular, and you’d often not find a use for the third.

Here’s how you call the namedWindow method:

cv2.namedWindow(name, flag)

Here’s an example:

cv2.namedWindow('Normal', cv2.WINDOW_NORMAL)
cv2.namedWindow('Autosize', cv2.WINDOW_AUTOSIZE)

The example above will create a resizable window with the name “Normal,” and an unresizable window with the name “Autosize.” However, you won’t get to see any window displaying; this is because simply creating a window doesn’t get it to display automatically, you’ll see how to display a window in the next section.

Displaying Windows

Just as there’s no point creating a variable if you won’t be using it, there’s no point creating a window too, if you won’t be displaying it. To display the window, you’ll need OpenCV’s waitKey method. The waitKey method requires you to pass in the duration for displaying the window, which is in milliseconds.

In essence, the waitKey method displays the window for a certain duration waiting for a key to be pressed, after which it closes the window.

Here’s how you call the waitKey method:

cv2.waitKey(milliseconds)

Here’s an example:

cv2.namedWindow('Normal', cv2.WINDOW_NORMAL)
cv2.waitKey(5000)
cv2.namedWindow('Normal II', cv2.WINDOW_NORMAL)
cv2.waitKey(0)

When you run the code sample above, you’ll see that it creates a window called “Normal”, which deactivates after five seconds; then it creates a window called “Normal II” and something strange happens.

The “Normal II” window refuses to close. This behavior is due to the use of the argument value 0 which causes the window to stay up “forever” until a key is pressed. Pressing a key causes the waitKey method to immediately return the integer which represents the Unicode code point of the character pressed, so it doesn’t have to wait till the specified time.

Gotcha: When the waitKey method times out or returns a value, the window becomes inactive, but it doesn’t get destroyed; so you’ll still see it on your screen. In the next section, you’ll see how to close a window after it becomes inactive.

Destroying Windows

To completely close a window, you’ll need to destroy it, and OpenCV provides the destroyWindow and destroyAllWindows methods which can help with this, though with different use cases.

You’ll use the destroyWindow to close a specific window as the method requires you to pass in the name of the window you intend destroying as a string argument. On the other hand, you’ll use the destroyAllWindows method to close all windows, and the method doesn’t take in any argument as it destroys all open windows.

Here’s how you call both methods:

cv2.destroyWindow(window_name)
cv2.destroyAllWindows()

Here’s an example:

cv2.namedWindow('Sample One', cv2.WINDOW_NORMAL)
cv2.waitKey(5000)
cv2.destroyWindow('Sample One')
cv2.namedWindow('Sample Two', cv2.WINDOW_AUTOSIZE)
cv2.namedWindow('Sample Three', cv2.WINDOW_NORMAL)
cv2.waitKey(5000)
cv2.destroyAllWindows()

When you run the code sample above, it will create and display a window named “Sample One” which will be active for 5 seconds before the destroyWindow method destroys it.

After that, OpenCV will create two new windows: “Sample Two” and “Sample Three.” Both windows are active for 5 seconds before the destroyAllWindows method destroys both of them.

To mention it again, you can also get to close the window by pressing any button; this deactivates the window in display and calls the next destroy method to close it.

Tip: When you have multiple windows open and want to destroy all of them, the destroyAllWindows method will be a better option than the destroyWindow method.

Resizing Windows

While you can pass in the WINDOW_NORMAL attribute as a flag when creating a window, so you can resize it using the mouse; you can also set the size of the window to a specific dimension through code.

When resizing a window, you’ll use OpenCV’s resizeWindow method. The resizeWindow method requires you to pass in the name of the window to be resized, and the x and y dimensions of the window.

Here’s how you call the resizeWindow method:

cv2.resizeWindow(name, x, y)

Here’s an example:

cv2.namedWindow('image', cv2.WINDOW_AUTOSIZE)
cv2.resizeWindow('image', 600, 300)
cv2.waitKey(5000)
cv2.destroyAllWindows()

The example will create a window with the name “image,” which is automatically sized by OpenCV due to the WINDOW_AUTOSIZE attribute. The resizeWindow method then resizes the window to a 600-by-300 dimension before the window closes five seconds after.

Reading Images

One key reason you’ll find people using the OpenCV library is to work on images and videos. So in this section, you’ll begin to see how to do that and the first step will be reading images.

When reading images, you’ll use OpenCV’s imread method. The imread method requires you to pass in the path to the image file as a string; it then returns the pixel values that make up the image as a 2D or 3D Numpy array.

Here’s how you call the imread method:

cv2.imread(image_path)

Here’s an example:

image = cv2.imread("./images/testimage.jpg")
print(image)

The code above will read the “testimage.jpg” file from the “images” directory, then print out the Numpy array that makes up the image. In this case, the image is a 3D array. It’s a 3D array because OpenCV reads images in three channels (Blue, Green, Red) by default.

The Numpy array gotten from the image takes a format similar to this:

[[[255 204   0]
[255 204   0]
[255 204   0]
...,
[255 204   0]
[255 204   0]
[255 204   0]]
...

Gotcha: Always ensure to pass the right file path into the imread method. OpenCV doesn’t raise errors when you pass in the wrong file path, instead it returns a None data type.

While the imread method works fine with only one argument, which is the name of the file, you can also pass in a second argument. The second argument will determine the color mode OpenCV reads the image in.

To read the image as Grayscale instead of BGR, you’ll pass in the value 0. Fortunately, OpenCV provides an IMREAD_GRAYSCALE attribute that you can use instead.

Here’s an example:

image = cv2.imread("./images/testimage.jpg", cv2.IMREAD_GRAYSCALE)
print(image)

The code above will read the “testimage.jpg” file in Grayscale mode, and print the Numpy array that makes up the image.
The result will take a format similar to this:

[[149 149 149 ..., 149 149 149]
[149 149 149 ..., 149 149 149]
[149 149 149 ..., 149 149 149]
...,
[149 149 149 ..., 148 148 149]
[149 149 149 ..., 148 148 149]
[149 149 149 ..., 148 148 149]]

The Numpy array you’ll get from reading an image in Grayscale mode is a 2D array; this is because Grayscale images have only one channel compared to three channels from BGR images.

Displaying Images

All this while, you’ve created windows without images in them; now that you can read an image using OpenCV, it’s time to display images through the windows you create.

When displaying images, you’ll use OpenCV’s imshow method. The imshow method requires the name of the window for displaying the image, and the Numpy array for the image.

Here’s how you call the imshow method:

cv2.imshow(window_name, image)

Here’s an example:

image = cv2.imread('./images/testimage.jpg')
cv2.namedWindow('Cars', cv2.WINDOW_NORMAL)
cv2.imshow('Cars', image)
cv2.waitKey(5000)
image = cv2.imread('./images/testimage.jpg', cv2.IMREAD_GRAYSCALE)
cv2.imshow('Cars', image)
cv2.waitKey(5000)
cv2.destroyWindow('Cars')

The code sample above will read the image, create a window named “Cars” and display the image through the window for five seconds using the imshow method. When the 5-second limit elapses, OpenCV will read the image again but this time in Grayscale mode; the same window displays the Grayscale image for five seconds then closes.

Image of Cars

Saving Images

In the latter part of this crash course, you’ll get to modify, add watermarks, and draw shapes on images. So you’d need to save your images so as not to lose the changes.

When saving images, you’ll use OpenCV’s imwrite method. The imwrite method requires you to pass in the path where you intend to save the image file, and the Numpy array that makes up the image you want to save.

Here’s how you call the imwrite method:

cv2.imwrite(path, image)

Here’s an example:

gray_image = cv2.imread("./images/testimage.jpg", cv2.IMREAD_GRAYSCALE)
cv2.imwrite("./images/grayimage.jpg", gray_image)

The code above will read the “testimage.jpg” image in Grayscale mode, then save the Grayscale image as “grayimage.jpg” to the “images” directory. Now, you’ll have copies of the original and Grayscale image saved in storage.

Editing Images with OpenCV

It’s about time to go a bit in depth into the world of image processing with OpenCV, you’ll find the knowledge of creating windows, reading and displaying images from the previous section useful; you also need to be comfortable with working with Numpy arrays.

Here are the things to be looked at in this section

  • Switching Color Modes
  • Editing Pixel Values
  • Joining Images
  • Accessing Color Channels
  • Cropping Images
  • Drawing on Images
  • Blurring Images

The code samples and images used in this section can be found on the Github repository.

Switching Color Modes

When processing images for tasks such as medical image processing, computer vision, and so on, you’ll often find reasons to switch between various colour modes.

You’ll use OpenCV’s cvtColor method when converting between color modes. The cvtColor method requires you to pass in the Numpy array of the image, followed by a flag that indicates what color mode you want to convert the image to.

Here’s how you call the cvtColor method:

cvtColor(image, flag)

Here’s an example:

image_mode = cv2.cvtColor(image, 36)
cv2.imshow('Cars', image_mode)
cv2.waitKey(5000)
cv2.destroyAllWindows()

The code sample above will convert the image from the BGR to YCrCb color mode; this is because of the use of the integer value 36 which represents the flag for BGR to YCrCb conversions.

Here’s what you’ll get:

A YCrCb Image of Cars

OpenCV provides attributes that you can use to access the integer value that corresponds to the conversion you want to make; this makes it easier to convert between different modes without memorizing the integer values.

Here are some of them:

  • COLOR_RGB2GRAY: The COLOR_RGB2GRAY attribute is used to convert from the RGB color mode to Grayscale color mode.
  • COLOR_RGB2BGR: The COLOR_RGB2BGR attribute is used to convert from the RGB color mode to BGR color mode.
  • COLOR_RGB2HSV: The COLOR_RGB2HSV attribute is used to convert from the RGB color mode to HSV color mode.

Here’s an example that converts an image from the RGB to Grayscale color mode

image = cv2.imread('./images/testimage.jpg')
image_gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
cv2.imshow('Cars', image_gray)
cv2.waitKey(5000)
cv2.destroyAllWindows

The code sample above will read the image using the imread method, then convert it from the default BGR to Grayscale mode before displaying the image for 5 seconds.

Here’s the result:

A Grayscale Image of Cars

Editing Pixel Values

Images are made up of picture elements known as pixels, and every pixel has a value that gives it color, based on the colour mode or channel. To make edits to an image, you need to alter its pixel values.

There is no specific method for editing pixel values in OpenCV; however, since OpenCV reads the images as Numpy arrays, you can replace the pixel values at different positions in the array to get the desired effect.

To do this, you need to know the image’s dimensions and number of channels; these can be gotten through the shape attribute.

Here’s an example:

image = cv2.imread("./images/testimage.jpg")
print(image.shape)

The code sample above will yield the result:

(720, 1280, 3)

From the result, you can see that the image has a 720 (height) by 1280 (width) dimension and three channels. Don’t forget that OpenCV reads image by default as a BGR (Blue, Green and Read) channel.

Here’s a second example:

image_gray = cv2.imread("./images/testimage.jpg", cv2.IMREAD_GRAYSCALE)
print(image_gray.shape)

The code sample above will yield the result:

(720, 1280)

From the result, you can see that the image has a 720 (height) by 1280 (width) dimension and it has one channel. The image has only one channel because the first line of code reads the image as a Grayscale image. Grayscale images have only one channel.

Now that you have an idea of the image’s properties by dimension and channels, you can alter the pixels.
Here’s a code sample:

image = cv2.imread('./images/testimage.jpg', cv2.IMREAD_GRAYSCALE)
edited_image = image.copy()
edited_image[:, :640] = 0
cv2.namedWindow('Cars',cv2.WINDOW_NORMAL)
cv2.imshow('Cars', edited_image)
cv2.waitKey(5000)
cv2.destroyWindow('Cars')

The code sample above makes the left half of the image black. When you learn about colour modes, you’ll see that the value 0 means black, while 255 means white with the values in-between being different shades of grey.

Here’s the result:

Left Side of Image Filled With Black

Since the image has a 720-by-1280 dimension, the code makes half of the pixels in the x-axis zero (from index 0 to 640), which has an effect of turning all pixels in that region black.

Gotcha: OpenCV reads images as columns first, then rows instead of the conventional rows before columns, so you should watch out for that.

The use of the copy method is to ensure that OpenCV copies the image object into another variable. It is important to copy an image because when you make changes to the original image variable, you can’t recover its image values.

In summary, the concept of editing pixel values involves assigning new values to the pixels to achieve the desired effect.

Joining Images

Have you ever seen an image collage? With different images placed side by side. If you have, then you’d have a better understanding of the need to join images.

OpenCV doesn’t provide methods that you can use to join images. However, the Numpy library will come in handy in this scenario.

Numpy provides the hstack and vstack methods which you can use to stack arrays side by side horizontally or vertically.

Here’s how you call both methods:

np.hstack((image1, image2, ..., imagen))
np.vstack((image1, image2, ..., imagen))

Here’s an example of both in action:

image = cv2.imread("./images/logo.jpg")
hcombine = np.hstack((image, image, image))
cv2.imshow("Cars Combined", hcombine)
cv2.waitKey(5000)
vcombine = np.vstack((image, image, image))
cv2.imshow("Cars Combined", vcombine)
cv2.waitKey(5000)
cv2.destroyAllWindows()

The code sample above will read the image, join (stack) the resulting Numpy array horizontally in three places, then display it for five seconds. The second section of the code sample joins (stacks) the image array from the first section vertically in three places and displays it too.

Here’s the result:

Horizontal Stack of Three Images

 

Accessing Color Channels

In the last two sections, the concept of joining images and editing image pixel values (for Grayscale images) was viewed. However, it may be a bit complex when the image has three channels instead of one.

When it comes to images with three channels, you can access the pixel values of individual colour channels. While OpenCV doesn’t provide a method to do this, you’ll find it to be an easy task with an understanding of Numpy arrays.

When you read an image having three channels, the resulting numpy array is a 3D numpy array. So one way to go about viewing individual channels is to set the other channels to zero.

So you can view the following channels by:

  • Red channel: Setting the Blue and Green channels to zero.
  • Blue channel: Setting the Red and Green channels to zero.
  • Green channel: Setting the Red and Blue Channels to zero.

Here’s an example:

image_r = image.copy()
image_r[:, :, 0] = 0
image_r[:, :, 1] = 0
cv2.imshow("Red Channel", image_r)
cv2.waitKey(5000)
cv2.destroyAllWindows()

The code sample above will copy the image’s Numpy array, set the Blue and Green channel to zero, then display an image with only one active channel (the Red channel).

Here’s a code sample to display the other channels side-by-side on the same window

image = cv2.imread("./images/logo.jpg")
image_b = image.copy()
image_b[:, :, 1] = 0
image_b[:, :, 2] = 0
image_g = image.copy()
image_g[:, :, 0] = 0
image_g[:, :, 2] = 0
image_r = image.copy()
image_r[:, :, 0] = 0
image_r[:, :, 1] = 0
numpy_horizontal = np.hstack((image_b, image_g, image_r))
cv2.namedWindow('image',cv2.WINDOW_NORMAL)
cv2.resizeWindow('image', 800, 800)
cv2.imshow("image", numpy_horizontal)
cv2.waitKey(5000)
cv2.destroyAllWindows()

The code sample above reads the image, extracts the corresponding colour channels, then stacks the results horizontally before displaying to the screen.

Horizontal Stack of an Image’s Blue, Green and Red Channels

Cropping Images

There are many reasons for which you may want to crop an image, but the end goal is to extract the desired aspect of the image from the complete picture. Image cropping is popular, and it’s a feature you’ll find on almost every image editing tool. The good news is that you can pull it off using OpenCV too.

To crop an image using OpenCV, the Numpy library will be needed; so an understanding of Numpy arrays will also come in handy.

The idea behind cropping images is to figure out the corners of the image you intend cropping. In the case of Numpy, you only need to figure out the top-left and bottom-right corners, then extract them using index slicing.

Going by the explanation above, you’ll be needing four values:

  • X1
  • X2
  • Y1
  • Y2

Below is a code sample to show the concept of cropping images:

image = cv2.imread('./images/testimage.jpg')
cv2.namedWindow('Cars',cv2.WINDOW_NORMAL)
edited_image = image.copy()
edited_image = edited_image[30:190, 205:560]
cv2.imshow('Cars', edited_image)
cv2.waitKey(5000)
cv2.destroyWindow('Cars')

Here’s the result:

Drawing on Images

OpenCV allows you to alter images by drawing various characters on them such as inputting text, drawing circles, rectangles, spheres, and polygons. You’ll learn how to do this in the rest of this section, as OpenCV provides specific functions that will help you draw a couple of characters on images.

You’ll see how to add the following to images in this section:

  • Text
  • Lines
  • Circles

Text

OpenCV provides the putText method for adding text to images. The putText method requires you to pass in the image’s Numpy array, the text, positioning coordinates as a tuple, the desired font, text’s size, color, and width.

Here’s how you call the putText method:

cv2.putText(image, text, (x, y), font, text_size, color, text_width)

For the fonts, OpenCV provides some attributes that you can use for selecting fonts instead of memorizing the integer values.

Here are some of them:

  • FONT_HERSHEY_COMPLEX
  • FONT_HERSHEY_DUPLEX
  • FONT_HERSHEY_PLAIN
  • FONT_ITALIC
  • QT_FONT_BOLD
  • QT_FONT_NORMAL

You can experiment with the different font types to find the one that best suits your purpose.

Here’s a code example that adds text to an image:

image = cv2.imread('./images/croppedimage.jpg')
font = cv2.FONT_HERSHEY_COMPLEX
cv2.putText(image,'LinuxHint',(85,32), font, 0.8,(0, 0, 0),1)
cv2.namedWindow('Car',cv2.WINDOW_NORMAL)
cv2.imshow('Car', image)
cv2.waitKey(5000)
cv2.destroyWindow('Car')

The code above reads the passed in the image, which is the cropped image from the previous section. It then accesses the flag for the font of choice before adding the text to the image and displaying the image.

Here’s the result:

“LinuxHint” on a Vehicle

Lines

OpenCV provides the line method for drawing lines on images. The line method requires you to pass in the image’s Numpy array, positioning coordinates for the start of the line as a tuple, positioning coordinates for the end of the line as a tuple, the line’s color and thickness.

Here’s how you call the line method:

cv2.line(image, (x1, y1), (x2, y2), color, thickness)

Here’s a code sample that draws a line on an image:

image = cv2.imread('./images/testimage.jpg')
cv2.line(image,(0,380),(1280,380),(0,255,0),10)
cv2.namedWindow('Car',cv2.WINDOW_NORMAL)
cv2.imshow('Car', image)
cv2.waitKey(5000)
cv2.destroyWindow('Car')

The code sample above will read the image, then draw a green line on it. In the code sample’s second line, you’ll see the coordinates for the start and end of the line passed in as different tuples; you’ll also see the color and thickness.

Here’s the result:

A Green Line Drawn at The Middle of the Image

Drawing Circles

OpenCV provides the circle method for drawing circles on images. The circle method requires you to pass in the image’s Numpy array, center coordinates (as a tuple), the circle’s radius, color, and thickness.

Here’s how you call the circle method:

cv2.circle(image,(x, y), radius, color, thickness)

Tip: To draw a circle with the least thickness, you’ll pass in the value 1, on the other hand, passing in the value -1 will cover up the circle completely, so you should watch out for that.

Here’s a code sample to show the drawing of a circle on an image:

image = cv2.imread('./images/testimage.jpg')
cv2.circle(image,(110,125), 100, (0,0,255), -1)
cv2.circle(image,(1180,490), 80, (0,0,0), 1)
cv2.namedWindow('Car',cv2.WINDOW_NORMAL)
cv2.imshow('Car', image)
cv2.waitKey(5000)
cv2.destroyWindow('Car')

The code sample above draws two circles on the image. The first circle has a thickness value of -1, so it has full thickness. The second has a thickness value of 1, so it has the least thickness.

Here’s the result:

Two Circles Drawn on an Image

You can also draw other objects such as rectangles, ellipses, or polygons using OpenCV, but they all follow the same principles.

Blurring Images

So far, you’ve seen OpenCV’s ability to perform some tasks you’d find on a powerful photo-editing tool such as Photoshop on a fundamental level. That’s not all; you can also blur images using OpenCV.

OpenCV provides the GaussianBlur method, which you can use for blurring images using Gaussian Filters. To use the GaussianBlur method, you’ll need to pass in the image’s Numpy array, kernel size, and sigma value.

You don’t have to worry so much about the concept of the kernel size and sigma value. However, you should note that kernel sizes are usually in odd numbers such as 3×3, 5×5, 7×7 and the larger the kernel size, the greater the blurring effect.

The sigma value, on the other hand, is the Gaussian Standard Deviation and you’ll work fine with an integer value of 0. You may decide to learn more about the sigma value and kernels for image filters.

Here’s how you call the GaussianBlur method:

cv2.GaussianBlur(image, kernel_size, sigma)

Here’s a code sample that performs the blurring of an image:

image = cv2.imread('./images/testimage.jpg')
blurred = cv2.GaussianBlur(image, (5,5), 0)
cv2.namedWindow('Cars', cv2.WINDOW_NORMAL)
cv2.imshow('Cars', blurred)
cv2.waitKey(5000)
cv2.destroyWindow('Cars')

The code sample above uses a kernel size of 5×5 and here’s the result:

A Little Blurring on the Image

Tip: The larger the kernel size, the greater the blur effect on the image.

Here’s an example:

image = cv2.imread('./images/testimage.jpg')
blurred = cv2.GaussianBlur(image, (25,25), 0)
cv2.namedWindow('Cars', cv2.WINDOW_NORMAL)
cv2.imshow('Cars', blurred)
cv2.waitKey(5000)
cv2.destroyWindow('Cars')

As you’ll see with the result, the image experiences more blur using a kernel size of 25×25. Here it is:

Increased Blurring on an Image

Working with Videos in OpenCV

So far, you’ve seen how powerful OpenCV can be with working with images. But, that’s just the tip of the iceberg as this is a crash course.

Moving forward, you’ll learn how to make use of OpenCV when working with videos.

Here are the things to be looked at in this section:

  • Loading Videos
  • Displaying Videos
  • Accessing the WebCam
  • Recording Videos

The same way there was a specified video for the sections when working with images, you’ll find the video for this tutorial in the “videos” directory on the GitHub repository with the name “testvideo.mp4.” However, you can make use of any video of your choice.

If you take a closer look at videos, you’ll realize they are also images with a time dimension, so most of the principles that apply to images also apply to videos.

Loading Videos

Just as with images, loading a video doesn’t mean displaying the video. However, you’ll need to load (read) the video file before you can go ahead to display it.

OpenCV provides the VideoCapture method for loading videos. The VideoCapture method requires you to pass in the path to the image and it’ll return the VideoCapture object.

Here’s how you call the VideoCapture method:

cv2.VideoCapture(file_path)

Here’s a code sample that shows how you load a video:

video = cv2.VideoCapture('./videos/testvideo.mp4')

Gotcha: The same pitfall with loading images applies here. Always ensure to pass in the right file path as OpenCV won’t raise errors when you pass in a wrong value; however, the VideoCapture method will return None.

The code sample above should correctly load the video. After the video loads successfully, you’ll still need to do some work to get it to display, and the concept is very similar to what you’ll do when trying to display images.

Displaying Videos

Playing videos on OpenCV is almost the same as displaying images, except that you are loading images in a loop, and the waitKey method becomes essential to the entire process.

On successfully loading a video file, you can go ahead to display it. Videos are like images, but a video is made up of a lot of images that display over time. Hence, a loop will come in handy.

The VideoCapture method returns a VideoCapture object when you use it to load a video file. The VideoCapture object has an isOpened method that returns the status of the object, so you’ll know if it’s ready to use or not.

If the isOpened method returns a True value, you can proceed to read the contents of the file using the read method.

OpenCV doesn’t have a displayVideo method or something in that line to display videos, but you can work your way around using a combination of the available methods.

Here’s a code sample:

video = cv2.VideoCapture('./videos/testvideo.mp4')
while(video.isOpened()):
    ret, image = video.read()
    if image is None:
        break
    cv2.imshow('Video Frame', image)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

video.release()
cv2.destroyAllWindows()

The code sample loads the video file using the VideoCapture method, then checks if the object is ready for use with the isOpened method and creates a loop for reading the images.

The read method in the code works like the read method for reading files; it reads the image at the current position and moves to the next waiting to be called again.

In this case, the read method returns two values, the first showing the status of the attempt to read the image⁠—True or False⁠⁠⁠—and the second being the image’s array.

Going by the explanation above, when the read method gets to a point where there’s no image frame to read, it simply returns (False, None) and the break keyword gets activated. If that’s not the case, the next line of code displays the image that the read method returns.

Remember the waitKey method?

The waitKey method displays images for the number of milliseconds passed into it. In the code sample above, it’s an integer value 1, so each image frame only displays for one millisecond. The next code sample below uses the integer value 40, so each image frame displays for forty milliseconds and a lag in the video becomes visible.

The code section with 0xFF == ord(‘q’) checks if the key “q” is pressed on the keyboard while the waitKey method displays the image and breaks the loop.

The rest of the code has the release method which closes the VideoCapture object, and the destroyAllWindows method closes the windows used in displaying the images.

Here’s the code sample with the argument value of 40 passed into the waitKey method:

video = cv2.VideoCapture('./videos/testvideo.mp4')
while(video.isOpened()):
    ret, image = video.read()
    if image is None:
        print(ret)
        break
    cv2.imshow('Video Frame', image)
    if cv2.waitKey(40) & 0xFF == ord('q'):
        break
video.release()
cv2.destroyAllWindows()

Accessing the WebCam

So far, you’ve seen how to load a video file from your computer. However, such a video won’t display in real-time. With the webcam, you can display real-time videos from your computer’s camera.

Activating the webcam requires the VideoCapture method, which was used to load video files in the previous section. However, in this case, you will be passing the index value of the webcam into the VideoCapture method instead of a video file path.

Hence, the first webcam on your computer has the value 0, and if you have a second one, it’ll have the value 1.

Here’s a code sample below that shows how you can activate and display the contents of your computer’s webcam:

video = cv2.VideoCapture(0)
while(video.isOpened()):
    ret, image = video.read()
    cv2.imshow('Live Cam', image)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break
video.release()
cv2.destroyAllWindows()

The value 1 is used for the waitKey method because a real-time video display needs the waitKey method to have the smallest possible wait-time. Once again, to make the video display lag, increase the value passed into the waitKey method.

Recording Videos

Being able to activate your computer’s webcam allows you to make recordings, and you’ll see how to do just that in this section.

OpenCV provides the VideoWriter and VideoWriter_fourcc methods. You’ll use the VideoWriter method to write the videos to memory, and the VideoWriter_fourcc to determine the codec for compressing the frames; the codec is a 4-character code which you’ll understand better with the knowledge of codecs.

Here’s how you call the VideoWriter_fourcc method:

cv2.VideoWriter_fourcc(codes)

Here are some examples you’ll find:

cv2.VideoWriter_fourcc('H','2','6','4')
cv2.VideoWriter_fourcc('X','V','I','D')

The VideoWriter method, on the other hand, receives the name you wish to save the video with, the fourcc object from using the VideoWriter_fourcc method, the video’s FPS (Frame Per Seconds) value and frame size.

Here’s how you call the VideoWriter method:

cv2.VideoWriter(filename, fourcc, fps, frame_size)

Below is a code sample that records video using the webcam and saves it as “out.avi”:

video = cv2.VideoCapture(0)
fourcc = cv2.VideoWriter_fourcc('X','V','I','D')
writer = cv2.VideoWriter('out.avi',fourcc, 15.0, (640,480))
while(video.isOpened()):
    ret, image = video.read()
    writer.write(image)
    cv2.imshow('frame',image)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

video.release()
writer.release()
cv2.destroyAllWindows()

The code sample above activates the computer’s webcam and sets up the fourcc to use the XVID codec. After that, it calls the VideoWriter method by passing in the desired arguments such as the fourcc, 15.0 for FPS and (640, 480) for the frame size.

The value 15.0 is used as FPS because it provides a realistic speed for the video recording. But you should experiment with higher or lower values to get a desirable result.

Conclusion

Congratulations on getting to the end of this crash course, you can check out the Github repository to check out the code for reference purposes. You now know how to make use of OpenCV to display images and videos, crop and edit images, create a photo collage by combining images, switch between color modes for computer vision and image processing tasks among other newly gained skills.

In this OpenCV crash course, you’ve seen how to:

  • Setup the library
  • Work with Images & Windows
  • Edit Images
  • Work with Videos

Now you can go ahead to take on advanced OpenCV tasks such as face recognition, create a GUI application for editing images or check out Sentdex’s OpenCV series on YouTube. ]]> How to Set up IRC on Ubuntu https://linuxhint.com/irc_ubuntu/ Mon, 27 May 2019 17:42:01 +0000 https://linuxhint.com/?p=41023 Internet Relay Chat (IRC) is a tool for communicating in plain text, without the use of other media such as images or videos. IRC uses a Client-Server model; you can connect to the server by making use of an IRC Client, and this is what you’ll be learning in this article.

There are a lot of IRC Clients out there; some are available through a browser, but the issue with them is that you’d lose your chat history when you close the page and try to return to it. By using an IRC Client, you can prevent this and have a better experience when using IRC for chats.

You won’t only be seeing how to install these IRC Clients, you’ll also see how you can set them up and join “chat rooms” or ”channels” to begin communicating.

IRC Clients

To make use of IRC on your computer, you need to make use of an IRC Client. In this article, you’ll learn how to set up your IRC Client using two Ubuntu applications.

The IRC Clients you’ll learn to set up are:

  • Polari
  • Pidgin

Both IRC Clients serve the same purpose, but Pidgin can work for other purposes besides being an IRC Client, while Polari is mainly an IRC Client.

Polari

Polari is a powerful IRC Client that lets users easily connect to IRC servers and rooms; it has a good user interface and feels like a modern messenger app.

You can install Polari through three methods:

  • The App Store
  • The Package Management Tool
  • Flatpak

The App Store

The easiest means of installing Polari is through the App Store as you can simply search for “Polari” on the App Store and click the install button to have the amazing tool installed on your computer.

The Package Management Tool

The package management tool is also another method of installing Polari, this can be through the apt or apt-get utility tool.

If you prefer using apt, you can install Polari with the command below:

apt install polari

You can also install Polari through the apt-get tool; this can be done with the command below:

apt-get install polari

Using any of both tools will produce the same results; hence it’s just a case of your preference.

Flatpak

Flatpak is also another alternative for installing Polari; to use Flatpak, you need to have it installed on your machine.

The command below is used to install Polari through Flatpak:

flatpak install flathub org.gnome.Polari

If you have Flatpak installed, it is likely that you’ll find different versions of Polari on the App Store; however you’ll be able to differentiate between the Polari app provided by the App Store and the one provided by Flatpak.

Connecting to a Server

After installing Polari on your computer, you can easily connect to an IRC Server as the app is easy to use.

You can do that in the following steps:

Click on the Add (+) button at the top left corner of the application.

Select the network of choice; you’ll find different networks available to be selected such as Freenode, EFnet, GNOME.

By default, Polari makes selecting a chatroom easy as it loads popular chat rooms under that network.
Click on the chat room of interest to join.

It is easy as that; you should however note that chat rooms will require you to be registered and will provide steps for you to go about the registration process.

That aside, Polari has done its job.

Pidgin

Unlike Polari, Pidgin serves other purposes asides being an IRC Client. Pidgin is a tool that also works for other protocols such as Instant Messaging, AIM, ICQ, Gadu-Gadu.

You can install Pidgin through three methods:

  • The App Store
  • The Package Management Tool
  • Flatpak

The App Store

The App Store serves as a quick method of installing Pidgin. You can search for “Pidgin” on the store, you’ll find it having a purple color with a bird looking like a pigeon. When you find it, click the install button to install Pidgin.

The Package Management Tool

You can also install Pidgin using the package management tools, apt or apt-get. Using any of these tools will install the same Pidgin app on your Ubuntu machine, so your choice is simply a matter of preference.

You can install Pidgin through apt with the command below:

apt install pidgin

You can also install Pidgin through apt-get with the command below:

apt-get install pidgin

Flatpak

Like Polari, you can also install Pidgin through the Flatpak software utility tool. You need to have Flatpak installed on your Ubuntu machine to install apps through it.

The command below is used to install Pidgin through Flatpak:

flatpak install flathub im.pidgin.Pidgin

Applications available through Flatpak will often show up in the App Store, so you may find multiple versions of Pidgin when you search for it on the App Store.

Connecting to Server

When Pidgin is installed, you can set it up to use as an IRC Client or any other Client by following the steps below:

Click on the “Accounts” menu from the “Buddy List” page.


Click on “Add” to add a server to the IRC Client.

Choose the protocol, username and password; in this case “IRC” will be chosen since the goal is to connect to an IRC Server, then click on “Add”.

You’ll get to see a popup after a while, then you can join chat rooms by clicking on “Conversation” and “Join a Chat.”

If you have a channel name in mind, you can type it in and click “Join.”
You can also click “Room List” and Pidgin will provide a list of Chat Rooms for that server.

Conclusion

IRC is popular amongst software developers especially those involved in open source projects, however, you may also find it being used among other groups of people. While some people think IRC is dead, it is not as it is performs well in areas of low bandwidth and is fault tolerant.

In this article, you’ve seen how to install two powerful IRC Clients on your Ubuntu machine and the step-by-step process in setting up these IRC Clients.

]]>
How to Install GNU Octave and External Packages https://linuxhint.com/install_gnu_octave_packages/ Mon, 29 Apr 2019 07:41:20 +0000 https://linuxhint.com/?p=39333 Numerical computations are essential in a lot of industries. Today, machine learning and deep learning are the driving force of different technologies, and mathematical computations help in data processing, before running machine learning or deep learning models on available data.

MATLAB is one of the most popular tools for numerical computations. MATLAB means MatrixLaboratory and is used primarily for numerical computations and symbolic computing.

The downside to MATLAB is that it’s proprietary software and is not a free tool; this discourages a lot of people from using it or forces them to use programming languages for processing.

What is GNU Octave?

GNU Octave is a tool for performing numerical computations just like MATLAB. GNU means “GNU’s Not Unix!”, and GNU software is free of charge.

While there are other software inspired by MATLAB, GNU Octave’s syntax is very similar to that of MATLAB; hence you can use it as a direct replacement for MATLAB.

You should note that Octave is developed to be superior to MATLAB, so it has certain syntax that won’t work on MATLAB. If you can pay for MATLAB, you should go ahead, but if you can’t, you’d do just fine with GNU Octave. Just ensure you stick to MATLAB syntax instead of making use of GNU Octave-only syntax if you intend importing the code into MATLAB environment.

Installation Methods

There are different methods you can use for installing GNU Octave. All methods are relatively easy as they do not require you fiddling with configuration files before installation. Choose that which suits you as they should all work properly.

In this section, you’ll see how you can install GNU Octave through the following methods:

  • FlatPak
  • Ubuntu Software Manager
  • Apt Install

FlatPak

Just like Snaps, FlatPak can be used to quickly install Linux packages. FlatPak is used for software deployment, package management and provides a sandbox for running applications.

Steps for installing GNU Octave through FlatPak:

  1. Ensure you have FlatPak installed. You can check if FlatPak is installed by running the command flatpak –version on the commandline. An error message indicates that FlatPak is not installed yet. Move to step two to install FlatPak, and step three if already installed.
  2. To install FlatPak, you can make use of the apt-get You can install FlatPak with the following command sudo apt-get install flatpak.
  3. Since FlatPak is installed, you need to add the Flathub repository. Flathub is the Appstore for Linux apps, and you’ll be installing GNU Octave from the store. The command flatpak remote-add –if-not-exists flathub https://flathub.org/repo/flathub.flatpakrepo is used to add the Flathub repository.
  4. Now that the Flathub repository has been added, you can now install GNU Octave. The command flatpak install flathub org.octave.Octave will be used to install GNU Octave. Note that if the Flathub repository has not been added to the repository list, FlatPak will not find GNU Octave.

Ubuntu Software Manager

The Ubuntu Software Manager can be considered to be the official Appstore for the Ubuntu OS. Installing GNU Octave with the Ubuntu Software Manager is arguably the simplest method on this list.

Steps for installing GNU Octave through the Ubuntu Software Manager:

  1. Launch the Ubuntu Software Manager
  2. Search for GNU Octave
  3. Select the GNU Octave icon in the results
  4. Select “Install”

As you can see, the steps required to install GNU Octave through the Ubuntu Software Manager are very minimal, so you may decide to go with this section.

Apt Install

Asides the options discussed earlier in the article, Octave can also be installed using the apt keyword with the command below:

sudo apt-get install octave

While you should be able to launch Octave by typing in Octave  into the command-line, it may not launch the Graphical User INterface in all cases so you can force it to launch the GUI by adding the commands –force-gui.

This can be seen below:

octave --force-gui
Octave Packages

GNU Octave does come with a lot of built-in features, but these features can be extended using external packages.

In this section, you’ll learn how to install and remove Octave packages. Some of these packages provide extensions for Arduino Microcontrollers, Databases, Fuzzy Logic Toolkit, Image Processing functions, etc.

Before diving into the process of installing Octave packages, you’ll need to install a package on your Debian/Ubuntu machine.

GNU Octave depends on the liboctave-dev package to install external packages.

You can install liboctave-dev with the command below:

sudo apt install liboctave-dev

Installing the Package

To use external packages to extend the functionality of GNU Octave, you need to download the package’s file from the package list.

After download you can run the command below in GNU Octave’s command window to install:

pkg install package-name.tar.gz

For example, after downloading the Image Processing package; it can be installed with the command:

pkg install image-2.10.0.tar.gz

The message displayed after running the command is:

>> pkg install image-2.10.0.tar.gz

For information about changes from previous versions of the image package, run ‘news image’

Loading the Package

After installing your package, you can’t immediately have access to the functions that the package provides; hence you need to load it first.

To load a package, you have to make use of the “load” keyword with the pkg command.

pkg load package-name

You do not have to include the version of the package to use it.

For example, to load the image processing package installed earlier, the command below is used:

pkg load image

The image package should be loaded, and you can access the functions provided by the image package.

Uninstalling the Package

You can uninstall packages just as you installed them; the difference here is that the argument is “uninstall” to remove a package instead of “install” for installing a package.

pkg uninstall package-name

For example, to remove the image processing package you can run:

pkg uninstall image

Conclusion

The installation process of GNU Octave and its packages isn’t complicated. It’s as simple as typing in the commands discussed in this article, and you’re ready to go.

There’s a lot more you can do with GNU Octave packages asides installing, loading and removing, but these simple tasks should be sufficient when working with the tool.

]]>
How to Setup Raspberry Pi in Headless Mode on Ubuntu https://linuxhint.com/raspberry_pi_headless_mode_ubuntu/ Tue, 09 Apr 2019 13:38:43 +0000 https://linuxhint.com/?p=38633 Different people have different reasons for getting the Raspberry Pi; but for a large percentage, it’s for carrying out amazing projects. Setting up the Raspberry Pi is the first step in this direction, and you’re going to get that done in a couple of minutes.

In this article, you’ll see how you can make use of your Raspberry Pi in headless mode using Ubuntu. By headless mode, it means that the Raspberry Pi is running without a monitor keyboard and a mouse.

You’ll be making use of a WiFi connection, so you should get one setup as you’ll need it in the later parts of this article.

Using the Raspberry Pi

Since the Raspberry Pi is a microcomputer, it can actually work like a computer does despite doing so with limited resources. One way to use it is to connect it to a monitor, keyboard and mouse.

Not everybody has access to the accessories, so alternatives are sought. Running the Raspberry Pi is a well-known alternative, as you can make use of the Pi through another computer, where a monitor, keyboard and mouse are available.

Getting an Operating System

Hardware is nothing without software. Your Raspberry Pi is no useful than a piece of paper without having software on it. You need software working on it—in this case an operating system—to get anything done.

You can download the Raspbian OS and write the image to the SD Card you intend using for the Raspberry Pi. Doing this is beyond the scope of this article, but you can use a USB SD Card reader and follow the steps taken to create a bootable USB.

Enabling SSH on the Pi

SSH should be active on the Raspberry Pi since that’s the method you intend using to run the device in headless mode. Unfortunately, this utility doesn’t come enabled by default on the Pi so you’ll need to enable it yourself.

After writing the image to the sd card, you need to create an empty file in the boot directory of the SD Card. The file should be named ssh, without any extension. You can do this in the terminal by using the touch command in that partition.

touch ssh

When you get to use the Raspberry Pi, it checks for this file. It sees it, then enables SSH and deletes the file.

Setting Up the WiFi

To use the Raspberry Pi in headless mode, you can make use of an Ethernet connection. But in this case, you’ll see how to set it up using a WiFi connection.

For this to work, you computer has to be connection to a WiFi—the WiFi source doesn’t need to have internet access.

Just like you added an ssh file to the boot directory of the SD Card, you’ll add a file called wpa_supplicant.conf to the root folder of the same directory.

Simply copy the following content into the file:

ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev
update_config=1
country=«your_ISO-3166-1_two-letter_country_code»
 
network={
ssid="«your_SSID»"
psk="«your_PSK»"
key_mgmt=WPA-PSK
}

You’ll replace <<your_SSID>> with the name of the WiFi being used and <<your_PSK>>  with the password to the WiFi. The «your_ISO-3166-1_two-letter_country_code» should be replaced with a suitable code from here.

Fetching the IP Address

To SSH into the Raspberry Pi, you need to know its IP address. In this section, you’ll see how to do that.

You need to have the nmap tool installed on your machine to be able to follow up with this part of the tutorial. If you do not already have it installed, you can quickly do that using:

sudo apt-get install nmap

Nmap is a security network scanner that will help you scan for the Raspberry Pi’s IP address. After installing Nmap, find your computer’s IP address by using the hostname command.

Simply type into the terminal:

hostname -I

This command is used to check for all the IP addresses available on the host. You’ll get a result similar to 192.168.x.x which is the IP address of the WiFi.

To find the specific IP address for the Raspberry Pi, type in the following command:

nmap -sP 192.168.x.0/24

Replace x with the actual value you can see on your machine after running the hostname -I command.

You’ll see that there are a couple of IP addresses showing up. The IP addresses should have a name, so it’s easy to figure which of the addresses belong to the Raspberry Pi.

Accessing the Pi

Now that you know the ip address, you can use it to access the Raspberry Pi through SSH. The default username for the Raspberry Pi is pi and the default password is raspberry.

Now that this is known, you can ssh into it with the command below:

ssh pi@piaddress

Replace the piaddress with the actual IP address of the Raspberry Pi and you should be in the Raspberry Pi.

Remember the default password is a generic one, so you need to change it. The Raspbian OS is a Debian based Linux distro, so you can perform similar tasks on it like you would on a regular Linux distro like the Ubuntu.

As an example, you can change the default password by simply typing in the passwd command into the terminal.

You’ll receive a prompt requesting you to type in your current password and then the new password.

Accessing the Pi’s Graphical User Interface

You are accessing the Pi through SSH so you are only able to use it through the terminal. But you can’t do much with it from just the terminal, so you’ll need to gain access to the GUI for a more fulfilling experience.

To do this, you need to use a tool called RealVNC. Note that you’ll be installing RealVNC inside the Raspberry Pi and not your Ubuntu machine.

Therefore, ensure that you have pi@raspberrypi showing on your terminal prompt by SSH-ing into the Pi.

Before installing, update the package repository list using:

sudo apt-get update

Then you can install with the command below:

sudo apt-get install -y realvnc-vnc-server realvnc-vnc-viewer

Now that RealVNC has been installed on the Raspbian, you can fetch the IP address it uses to connect to your Ubuntu server.

To do that, type in the command below:

vncserver

You’ll get an IP address which should be noted or copied somewhere as you’ll need it soon.

Since RealVNC on the Raspberry Pi needs to work with the Ubuntu machine through an IP address, you need to also have it installed on your Ubuntu machine.

You can download RealVNC for Ubuntu here and it’s a deb file. You can install it through the terminal or using the “Software Install” package on Ubuntu.

After the install, you’ll find the icon of the VNC Viewer on your application list. Click on the “File” menu and the “New Connection” option.

You’ll find a box to input the IP address you copied a while ago, and you can also give a name to it. Click “OK” to save.

Now you can right click on the newly created option and connect to it. There you have it, the GUI of the Raspberry Pi should display on your screen.

 

Conclusion

After following the various points of this article, your Raspberry Pi should be accessible on your computer, through the terminal and the Graphical User Interface (GUI).

In this article, a step-by-step approach has been taken to setting up Raspberry Pi to run headless on the Ubuntu machine. Tools such as Nmap, RealVNC are important in this process, and the Pi can now be used by “SSH-ing” into the Raspberry Pi through a WiFi connection.

You shouldn’t have any hiccups while going this route, but if you do; kindly ask questions. It’s time to get started with amazing projects.

]]>
Building A Web Crawler Using Octoparse https://linuxhint.com/octoparse_web_crawler/ Fri, 08 Mar 2019 19:53:45 +0000 https://linuxhint.com/?p=37187 Welcome friends, remember the write up on the top twenty web scraping tools? Octoparse made the list as one of the most powerful tools.

Recently, I picked up the tool and I was impressed with how much stuff Octoparse allows the users do. In this article, you’ll see what Octoparse is about, an introduction to it’s built-in scraper and also how you can build your own scraper from scratch.

Octoparse is a tool used in scraping data from websites. It is an easy to use web crawler application to fetch data without having to write any additional line of code.

Octoparse is not complicated to use, and in just three steps, you can do great stuff with this powerful web crawling tool. All you require is the URL you need to extract data from and a couple of clicks.

It does not have any limitation as to what kind of website it can scrape data from. Also, exporting data is made easier in form of a CSV file or an API.

You can take advantage of Octoparse features. Some of them are:

  • It lets you build web crawlers fast without writing a line of code
  • It provides a cloud service for scheduled data extraction and IP rotation
  • It offers unlimited storage
  • It allows you hire professional data scraping experts from Octoparse to do the job for you

With this, you have a solid concept as to what Octoparse is, its purpose and how to get started with it.

Getting Started With Octoparse

Before building our first web crawler, let’s set up our environment for development. We start by downloading Octoparse from their official website. I recommend you download the Octoparse 7.1 version.

Why Octoparse 7.1?

Octoparse 7.1 comes with features you won’t find on older versions to the tool:

  • Task templates which aid with predefined templates when scraping data from websites such as Amazon or eBay.
  • The dashboard has a structured new look which provides more information to the user.
  • Ability to scrape data from multiple URLs by importing them from an excel sheet, CSV or text file.
  • An anti-blocking feature to bypass protections that prevent users from scraping data from a website.

You can download the Octoparse version 7.1 executable. It only works on Windows operating systems, so you’ll need the VirtualBox to run on your Linux machine. Octoparse provides a guide on using the tool for users of Linux machines.

Introduction To Task Template

Task template is a feature introduced into the latest version of Octoparse, designed to make web scraping easier for everybody regardless of technical knowledge.

How To Use Task Template

To save you the time, there is really no lengthy process towards using task templates. However, some data are required, which includes the target URL, keywords to search for and many more parameters you need to extract the required data of your choice from the website.

Octoparse already has some built-in templates when you need to scrape data from them, most of which include Google, Amazon, eBay and Walmart amongst others. Let’s try to use one of the built-in task templates.

You start off by selecting a template of your choice, in this case, let’s use the eBay task template. After selecting the template, you will be prompted to input your parameters based on the needed data. These parameters are target URL or a keyword to search for.

Within our parameter box, input “Nike shoes as the keyword. With this, Octoparse does the rest of the task by fetching all data based on your parameters, in this case, all Nike shoes. This data is ready to be utilized for whatever purpose you have in mind.

For further analysis on your scraped data, navigate to the data field tab of your task template to view extra information on all contents on the web page, which includes Nike shoe images, the seller name, the price and number of inventory.

You can also navigate to the sample output tab to view information about the data such as product name, product URL and many more data virtually related to all Nike shoes on eBay.

You’ve seen how easy it is to scrape data with task template. Play around with the task template and scrape data from eBay. Try out other built-in task templates such as Walmart or Google with Octoparse.

Building A Web Crawler With Octoparse

You’ve come this far to build a web crawler with Octoparse. You do have a piece of foundational knowledge and all there is to know about in scraping data from a website with the use of a task template. However, you can build a web crawler yourself.

In building a web crawler with Octoparse, there are two approaches. They are:

  • Wizard Mode
  • Advanced Mode

Building A Web Crawler With Octoparse Wizard Mode

The Wizard Mode approach is actually an easier and faster way to scrape data from a website. With a smooth step by step interface, you can have your web crawler up and running in no time. However, you are advised to use Advanced Mode for more complex data scraping.

With Wizard Mode, you can scrape data from tables, links or items in pages. Limited to the scope of this tutorial, you’ll learn to build a web crawler for a single web page.

To begin with, launch your Octoparse application and create a new task from the Wizard Mode and enter the URL you would like to scrape data from. You can rename the Group input field to anything that seems cool to you and click the next button.

You will be navigated to a new page to select extraction type, and since you are working on scraping data from a single web page, you’ll the single page. With your extraction data type very much defined, you can now define our fields.

To define your fields, you select the target data from the single web page and once you do, it auto-fills the data into the fields, now you can edit the fields property into whatever you like, and you can add more data by clicking the add more fields button.

By following these steps, you will be able to extract data from a single web page in less than five minutes.

Building A Web Crawler With Octoparse Advanced Mode

The Wizard Mode can be used in scraping simple websites with easy structure, but websites designed with more complex structures will be a tougher task. The Advanced Mode is the tool you’ll use to scrape such websites.

Go ahead and launch your Octoparse application, under the Advanced Mode, create a new task and enter the URL you’ll like to scrape data from and hit the save button. This navigates you to the task configuration workflow.

The task configuration workflow interface gives you more flexibility towards how you would want to extract data. The predefining workflow feature is turned off by default, so turn it on to get started with it.

In Advanced Mode, when you select data on the webpage, you are provided with action tips to perform for the selected data.

From the webpage you want to crawl data from, when you click on an item, you’ll see the action tips at the bottom right of the page. The action tips allow you select what you want to do such as extracting data.

With Advanced Mode, you can spend most of your time creating your workflow on how to extract data and once you are past this stage, your task workflow will be ready for use. Simply click on the start extraction button for Octoparse to work according to your workflow.

Working with Advanced Mode might seem a bit difficult to comprehend for first timers, but you’ll become more comfortable with it over time.

Conclusion

You can scrape websites by writing code for web scrapers, but this can be time consuming. Octoparse gives you great results, without you writing code or spending time working on the scraper logic.

In this article, you’ve seen what Octoparse is about, how it saves you time and effort. You’ve also seen how you can make use of the built-in task templates to scrape data from certain websites, and also build your own powerful web scrapers.

Octoparse is currently available only as a Windows executable, so you’ll need the VirtualBox to use it on your Linux machine.

You can visit the Octoparse official website to know more about the Advanced Mode and Wizard Mode so you can web scrape a lot of websites. ]]> Understanding The Dockerfile https://linuxhint.com/understand_dockerfile/ Mon, 18 Feb 2019 10:39:40 +0000 https://linuxhint.com/?p=36628 You’ll agree with me that the impact Docker is having on the world of technology is massive. It is saving software developers and system administrators alike a lot of headache.

In this article, you’ll be learning about a very crucial part of the whole Docker setup, the Dockerfile. The Dockerfile uses a simple structure. While this simplicity is a good thing, it gives room for individuals to just hack commands together, without fully understanding the impact.

At the end of this article, you’ll have a better understanding of the Dockerfile. So, you’ll be able to write Dockerfiles that you understand.

Inside The Dockerfile

The Dockerfile is basically a text file. But, unlike regular text files, you’ll see that it doesn’t have a .txt file extension. The Dockerfile is a file that you’ll save as Dockerfile, with no file extensions.

In this Dockerfile exists all the commands used to assemble a Docker image. While you can pass these commands into the Docker CLI when building an image, you’ll agree that it is better practice to have a file for it, so things can be better organized.

The commands in the Dockerfile are vital to building a Docker image.

Here’s why:

Every line of command in the Dockerfile creates the layers that make up the Docker image. Provided the Dockerfile remains the same, every time you build an image off it, it’s certain you’d get the same results. However, when you add a new line of command, Docker simply builds that layer and adds it to the existing layers.

Just like the compiler or interpreter does to programming languages, Docker reads the Dockerfile from top to bottom. Hence, the placement of the commands matter a lot.

Unlike most programming languages, the commands in the Dockerfile are not case sensitive. But, you’ll see from sample Dockerfiles that the commands are written in UPPERCASE. This is nothing but a convention, which you should follow too.

Like programming languages, you can write comments in your Dockerfiles. Comments in Dockerfiles are denoted by using the hash or pound symbol # at the beginning of the line. You should note that it only supports one-line comments, hence to write multi-line comments, you’ll use the hash symbol on each line.

Careful though, not all hash symbols you see in a Dockerfile are comments. Hash symbols could also indicate parser directives. Parser directives are commands in the Dockerfile that indicate the way the Dockerfile should be read.

Only two parser directives are available on Docker as at the time of writing this article. They are the escape and syntax parser directives. The syntax directive is only available on Docker when it’s running on a BuildKit backend.

The escape directive does work everywhere. The escape directive allows you decide what symbol Docker uses as an escape character.

You can have in your Dockerfile, a line similar to the one below:

COPY index.html C:\\Documents

You shouldn’t bother about what the command does yet, focus on the file location. Using the command above in a Windows based Docker image, is valid. But, you’ll recall that Docker is Linux based, so it uses the backslash \ as an escape character due to Linux conventions. Therefore, when Docker reads through the Dockerfile, it’ll escape the backslash instead of reading it as a file path.

To change this behaviour, you’ll use the escape parser directive as seen below:

# escape=`

This directive causes Docker to use the backtick as an escape character, instead of the backslash. To use the parser directive, you’ll have to put it at the top of the Dockerfile, else it’ll only count as a comment—you have to place it even above comments, if you have the comments at the top of the file.

Dockerfile Instructions

Docker relies on each line of command in the Dockerfile and executes them, building a layer for each line in the process.

You’ll need an understanding of the commands to write Dockerfiles. A point of caution though: a lot of the Dockerfile commands do similar stuff. You don’t have to worry, you’ll get to understand those commands too.

Here’s a list of the commands you’ll learn about:

  • FROM
  • LABEL
  • ENV
  • EXPOSE
  • RUN
  • COPY
  • WORKDIR
  • CMD

FROM

Remember that the main aim of Docker is to virtualize things at the Operating System (OS) level, by creating containers. Therefore, whatever image Docker builds from your Dockerfile needs to be based on an existing OS—except you are building a base image.

The FROM command is used to state what OS you intend to use as the base image. If you intend building on a base image, the FROM command must be the first command in the Dockerfile—asides parser directives and comments.

LABEL

The Dockerfile needs metadata, and the LABEL command is what you’d use to create them. After building an image and running a container off it, you can use the docker inspect command to find information on the container.

ENV

Environment variables. Familiar words? Well, the ENV command is used to set environment variables while building the Docker image. You’ll also get to see that those set environment variables are also accessible after launching the container.

Dockerfile has a command similar to ENV, known as ARG. However, whatever environment variable is set using ARG is only available while building the image, but not after launching the container.

EXPOSE

The same way your Docker host—your local machine is the docker host in this case—has ports for communication such as 8080, 5000, etc. is the same way Docker containers have ports.

You’ll use the EXPOSE command to choose what ports should be available to communicate with a container.

When running Docker containers, you can pass in the -p argument known as publish, which is similar to the EXPOSE command.

Here’s the subtle difference: you use the EXPOSE command to open ports to other Docker containers, while the -p argument is used to open ports to the external environment i.e. outside the Docker container.

If you do not make use of EXPOSE or -p at all, then the Docker container won’t be accessible through any ports from outside the container or other Docker containers.

RUN

While building a Docker image, you may need to run commands for reasons such as installing applications and packages to be part of the image.

Using the RUN command, you can do all of that. But remember: commands are run only when you’re building the Docker image.

COPY

There are different reasons to copy files from your Docker host to your Docker image. Some files you may like to copy could be configuration files, or the source code—if you’d be running it in your Docker container.

To copy files from your Docker host to a Docker image, you can use the COPY command.

There is the ADD command that is similar to COPY, and is a bit different. While COPY can only copy files from your Docker host to the Docker image, ADD can copy files from a URL and also extract compressed files to the Docker image.

Why use COPY instead of ADD? Well, you’ll figure out copying files from a URL is a task you can run with Curl using the RUN command. You can also extract files in the Docker image using the RUN command too.

However, there is nothing wrong with using ADD to directly extract compressed files into the Docker image.

WORKDIR

Remember the RUN command? You can use the RUN command to execute commands in your Docker image. However, sometimes you’ll have a reason to run a command in certain directories. As an example, to unzip a file, you have to be in the directory of the zip file or point to it.

That’s where WORKDIR comes in handy. WORKDIR allows you change directory while Docker builds the image, and the new directory remains the current directory for the rest of the build instructions.

CMD

Your Docker container is usually set up to run one process. But how does it know what process to run? It’s through the CMD command. The CMD command is used to execute commands as Docker launches the Docker container from the image.

While you can specify the command to be run when launching from the command-line, the commands stated at the CMD instruction remain the default.

Docker can run only one CMD command. Therefore, if you insert two or more CMD instructions, Docker would only run the last one i.e. the most recent one.

ENTRYPOINT is similar to CMD, however, you can run commands while launching and it wouldn’t override the instructions you’ve defined at ENTRYPOINT.

Example

In this example, you’ll see an implementation of almost all the commands discussed above. You’ll see how a Flask application would be run in a Docker container. If you don’t know what Flask is, Flask is a web framework written in Python for building web applications.

It’s quite simple, so you don’t need to have any knowledge of the language to run the example.

To start with, you’ll need to install Git on your machine. After installing Git, you’ll clone the source code from the GitHub repository here.

First, create a new directory. You’ll have the source code and the Dockerfile in this directory. You can create a directory—you can call it docker-sample—and the Dockerfile using the commands below:

mkdir docker-sample && cd docker-sample
touch Dockerfile

Remember the Dockerfile is just a plain text file? You also remember that it shouldn’t have the .txt extension? You’ll find that discussion at the beginning of the “Inside The Dockerfile” section, if you missed it.

Next, you’ll download the source code from GitHub using the git clone command as seen below:

git clone https://github.com/craigkerstiens/flask-helloworld.git

You can check the contents of the flask-helloworld directory:

ls flask-helloworld

You’ll see the following files:

  • Markdown.rst: It contains the details of the project, but not important to this example. You shouldn’t be worried about it.
  • Procfile: It contains commands to run the projects on a server. You shouldn’t be worried about it either.
  • app.py: It contains the code you’ll run in the Docker container.
  • Requirements.txt: It contains the dependencies the app.py file needs to run successfully.

Writing The Dockerfile

This Dockerfile has all of the Docker instructions discussed above. It also has comments in it, to help you understand what each line does.

# FROM instruction chooses the parent image for Docker.
# This example uses Alpine.
# Alpine is a minimal Docker image very small in size
FROM alpine:3.3
# LABEL instruction creates labels.
# The first label is maintainer with the value Linux Hint.
# The second label is appname with the value Flask Hello. World
# You can have as many key-to-value pairs as you want.
# You can also choose any name for the keys.
# The choice of maintainer and appname in this example
# is a personal choice.
LABEL "maintainer"="Linux Hint" "appname"="Flask Hello World"
# ENV instruction assigns environment variables.
# The /usr/src directory holds downloaded programs,
# be it source or binary before installing them.
ENV applocation /usr/src
# COPY instruction copies files or directories,
# from the Docker host to the Docker image.
# You'll copy the source code to the Docker image.
# The command below uses the set environment variable.
COPY flask-helloworld $applocation/flask-helloworld
# Using the ENV instruction again.
ENV flaskapp $applocation/flask-helloworld
# WORKDIR instruction changes the current directory in Docker image.
# The command below changes directory to /usr/src/flask-helloworld.
# The target directory uses the environment variable.
WORKDIR $flaskapp/
# RUN instruction runs commands,
# just like you do on the terminal,
# but in the Docker image.
# The command below installs Python, pip and the app dependencies.
# The dependencies are in the requirements.txt file.
RUN apk add --update python py-pip
RUN pip install --upgrade pip
RUN pip install -r requirements.txt
# EXPOSE instruction opens the port for communicating with the Docker container.
# Flask app uses the port 5000, so you'll expose port 5000.
EXPOSE 5000
# CMD instruction runs commands like RUN,
# but the commands run when the Docker container launches.
# Only one CMD instruction can be used.
CMD ["python", "app.py"]

Building the Docker image

After writing the Dockerfile, you can build the Docker image with the command below:

sudo docker build -t sample_image .

Here, sample_image is the name of the Docker image. You can give it another name. The dot (.) at the end of the command indicates that the files you’re working with are in the current directory.

Running the Docker container

To run the Docker container, you can use the docker run command below:

sudo docker run -ip 5000:5000 sample_image:latest

The -i parameter ensures the Docker container runs in interactive mode and the -p parameter binds the Docker host’s port to the Docker container’s port. Think of it as: docker-host:docker-container.

After launching the Docker container, you can visit localhost:5000 in your browser to see the results of the Flask application.

Conclusion

The Dockerfile is the blueprint for a Docker image. Understanding how Dockerfiles work, and being able to write them comfortably would make your Docker experience an enjoyable one.

Working towards this through this article, you’ve seen how Dockerfiles work. Hopefully, you also understand what the major Docker instructions mean and can be able to use them in building your own Docker images.

Any question you have relating to Dockerfiles would be welcome. Thanks for reading.

]]>
Apt Package Management Tool https://linuxhint.com/primer_apt_package_management_tool/ Thu, 20 Dec 2018 13:36:06 +0000 https://linuxhint.com/?p=34065 Your Linux machine is only as good as you make it. To make it into a powerful machine, you need to install the right packages, use the right configurations among a host of other things. Talking about packages; in this article I would be taking a primer on the APT package management tool. Similar to YUM for RHEL(RedHat Enterprise Linux) based Linux distributions—which was discussed here—APT(Advanced Packaging Tool) is for managing packages on Debian and Ubuntu based Linux distributions.This article isn’t planned to discuss all the powers of the APT package management tool, instead it is intended to give you a quick look into this tool and how you can use it. It would serve well for reference purposes and understanding how the tool works. Without much ado, let’s get started.

Location

Just like many Linux tools, apt is stored in the /etc directory—contains the configuration files for all the programs that run on Linux systems—and can be viewed by navigating to the directory.

Apt also has a configuration file which can be found in the /etc/apt directory with the file name apt.conf.

You would be doing a lot of package installations with apt, therefore it would go a long way to know that package sources are stored in a sources.list file. Basically, apt checks this file for packages and attempt to install from the list of packages—let’s call it a repository index.

The sources.list file is stored in the /etc/apt directory and there is a similar file, named sources.list.d. It isn’t actually a file, but a directory which keeps other sources.list files. The sources.list.d directory is used by Linux for keeping some sources.list files in a separate place—outside the standard /etc/apt directory.

The confusion: APT vs APT-GET

Yes, a lot of people actually mistake apt to be the same as apt-get. Here’s a shocker: they are not the same.

In truth, apt and apt-get work similarly however the tools are different. Let’s consider apt to be an upgrade on apt-get.

Apt-get has been in existence before apt. However apt-get doesn’t exist in isolation as it works together with other apt packages such as apt-cache and apt-config. These tools when combined are used to manage linux packages and have different commands as well. Also these tools are not the easiest to use as they work at a low level, which an average Linux user couldn’t care less about.

For this reason, apt was introduced. The version 1.0.1 of APT has the following on the man page, “The apt command is meant to be pleasant for end users and does not need to be backward compatible like apt-get.”

Apt works in isolation and doesn’t need to be combined with other tools for proper Linux administration, plus it is easy to use.

The Commands

For an average Linux user, the commands are all that matter. Through the commands, tasks are executed and actual work can be done. Let’s take a look at the major apt commands.

Get Help

The most important of all the commands to be discussed in this article is the command used to get help. It makes the tool easy to use and ensures you do not have to memorize the commands.

The help provides enough information to carry out simple tasks and can be accessed with the command below:

apt --help

You would get a list of various command combinations from the result, you should get something similar to the image below:

If you desire, you could check out the apt man pages for more information. Here’s the command to access the man pages:

man apt

Search for package

For a lot of operations, you would need to know the exact name of a package. This and many more uses are reasons to make use of the search command.

This command checks all the packages in the repository index, searches the keyword in the package descriptions and provides a list of all packages with the keyword.

apt search <keyword>

Check package dependencies

Linux packages have dependencies, these dependencies ensure they function properly as the packages break when the dependencies break.

To view a package’s dependencies, you use the depends command.

apt depends <package name>

Display package information

Displaying a package’s dependencies is one information you would find useful. However, there are other package details you can get. For me, it would be less productive to memorize all the commands to access other details such as the package’s version, download size etc.

You can get all of a package’s information in one attempt using the apt command as seen below:

apt show <package name>

Install package

One of Linux’s strongest points is the availability of lots of powerful packages. You can install packages in two ways: either through the package name or through a deb file—deb files are debian software package files.

To install packages using the package name, the command below is used:

apt install <package name>

As stated earlier, you need to know the package name before using it. For example, to install Nginx the command would be apt install nginx.

The other means of installing packages is the through the deb file if available. When installing a package through its deb file, apt fetches the package dependencies itself and downloads it so you do not have to worry about them.

You can install deb files using the absolute path to the files with the command below:

apt install </path/to/file/file_name.deb>

Download package

If for some reason, you need to download a package without having it installed, you can do so using the download command.

This would download the package’s deb file into the directory where the command was run. You can download packages using the command below:

apt download <package name>

If you are then interested in installing the .deb file, you can then install using the install command.

Update repository index

Remember we talked about sources.list earlier? Well, when a new version of a package is released, your linux machine is not able to install it yet because it would not indicate. To have it indicate, it needs to reflect in the sources.list file and this can be done using the update command.

apt update

This command refreshes the repository index and keeps it up-to-date with the latest changes to the listed packages.

Remove packages

Packages break. Packages become obsolete. Packages need to be removed.

Apt makes it easy to remove packages. Here are different conditions to removing packages: removing the binary files and keeping the config files, removing the binary files and the config files.

To remove the binary files alone, the remove command is used.

apt remove <package name>

More than one package can be removed, so you can have apt remove nginx top to remove the Nginx and top packages at the same time.

To remove the configuration files, the purge command is used.

apt purge <package name>

If you wish to do both at once, the commands can be combined as seen below:

apt remove --purge <package name>

Before proceeding, it should be known that when packages are removed, their dependencies remain i.e. they are not removed too. To remove the dependencies while uninstalling, the autoremove command is used as seen below:

apt autoremove <package name>

List packages

Yes, you can have the packages on your Linux machine listed. You can have a list of all packages in the repository index, installed packages and upgradeable packages.

Regardless what you intend doing, the list command would be used.

apt list

The command above is used to list all the packages available in the repository index.

apt list --installed

The command above is used to list the packages installed on your Linux machine.

apt list --upgradeable

The command above is used to list the packages installed on your machine that have upgrades available.

Updating packages

When it comes to packages, it’s not all about installing and removing packages; they need to be updated too.

You can decide to upgrade a single package or all packages at once. To update a single package, the install command is going to be used. Surprising right? Yes, however we are going to be adding the –only-upgrade parameter.

apt install --only-upgrade <package name>

This works when you intend upgrading just one package. However, if you want to upgrade all the packages you would need to use the upgrade command.

The following command would be used to make such an upgrade:

apt upgrade

It should be noted that the upgrade command doesn’t remove dependencies and even if the upgraded packages do not need them anymore i.e. they are obsolete.

System upgrade

Unlike the regular upgrade, the full-upgrade command to be discussed here performs a complete system upgrade.

With the full-upgrade command, obsolete packages and dependencies are removed and all packages (including system packages) are upgraded to their latest versions.

The command for doing this, is full-upgrade as seen below:

apt full-upgrade

Conclusion

Apt is a powerful tool that makes the use of Debian and Ubuntu based Linux distributions a wonderful experience. Most of the apt commands listed here require root permissions, so you may need to add sudo to the start of the commands.

These commands are just a tip of the iceberg of the immense powers that the apt tool possesses, and they are powerful enough to get you comfortable with managing packages on your Linux machine.

]]>
Logging Into Websites With Python https://linuxhint.com/logging_into_websites_python/ Mon, 26 Nov 2018 19:10:00 +0000 https://linuxhint-com.zk153f8d-liquidwebsites.com/?p=33057 The login feature is an important functionality in today’s web applications. This feature helps keep special content from non-users of the site and is also used to identify premium users too. Therefore if you intend web scraping a website, you could come across the login feature if the content is only available to registered users.

Web scraping tutorials have been covered in the past, therefore this tutorial only covers the aspect of gaining access into websites by logging in with code instead of doing it manually by using the browser.

To understand this tutorial and be able to write scripts for logging into websites, you would need some understanding of HTML. Maybe not enough to build awesome websites, but enough to understand the structure of a basic web page.

Installation

This would be done with the Requests and BeautifulSoup Python libraries. Asides those Python libraries, you would need a good browser such as Google Chrome or Mozilla Firefox as they would be important for initial analysis before writing code.

The Requests and BeautifulSoup libraries can be installed with the pip command from the terminal as seen below:

pip install requests
pip install BeautifulSoup4

To confirm the success of the installation, activate Python’s interactive shell which is done by typing python into the terminal.

Then import both libraries:

import requests
from bs4 import BeautifulSoup

The import is successful if there are no errors.

The process

Logging into a website with scripts requires knowledge of HTML and an idea of how the web works. Let’s briefly look into how the web works.

Websites are made of two main parts, the client-side and the server-side. The client-side is the part of a website that the user interacts with, while the server-side is the part of the website where business logic and other server operations such as accessing the database are executed.

When you try opening a website through its link, you are making a request to the server-side to fetch you the HTML files and other static files such as CSS and JavaScript. This request is known as the GET request. However when you are filling a form, uploading a media file or a document, creating a post and clicking let’s say a submit button, you are sending information to the server side. This request is known as the POST request.

An understanding those two concepts would be important when writing our script.

Inspecting the website

To practice the concepts of this article, we would be using the Quotes To Scrape website.

Logging into websites requires information such as the username and a password.

However since this website is just used as a proof of concept, anything goes. Therefore we would be using admin as the username and 12345 as the password.

Firstly, it is important to view the page source as this would give an overview of the structure of the web page. This can be done by right clicking on the web page and clicking on “View page source”. Next, you inspect the login form. You do this by right clicking on one of the login boxes and clicking inspect element. On inspecting element, you should see input tags and then a parent form tag somewhere above it. This shows that logins are basically forms being POSTed to the server-side of the website.

Now, note the name attribute of the input tags for the username and password boxes, they would be needed when writing the code. For this website, the name attribute for the username and the password are username and password respectively.

Next, we have to know if there are other parameters which would be important for login. Let’s quickly explain this. To increase the security of websites, tokens are usually generated to prevent Cross Site Forgery attacks.

Therefore, if those tokens are not added to the POST request then the login would fail. So how do we know about such parameters?

We would need to use the Network tab. To get this tab on Google Chrome or Mozilla Firefox, open up the Developer Tools and click on the Network tab.

Once you are in the network tab, try refreshing the current page and you would notice requests coming in. You should try to watch out for POST requests being sent in when we try logging in.

Here’s what we would do next, while having the Network tab open. Put in the login details and try logging in, the first request you would see should be the POST request.

 

Click on the POST request and view the form parameters. You would notice the website has a csrf_token parameter with a value. That value is a dynamic value, therefore we would need to capture such values using the GET request first before using the POST request.

For other websites you would be working on, you probably may not see the csrf_token but there may be other tokens that are dynamically generated. Over time, you would get better at knowing the parameters that truly matter in making a login attempt.

The Code

Firstly, we need to use Requests and BeautifulSoup to get access to the page content of the login page.

from requests import Session
from bs4 import BeautifulSoup as bs
 
with Session() as s:
    site = s.get("http://quotes.toscrape.com/login")
    print(site.content)

 

This would print out the content of the login page before we log in and if you search for the “Login” keyword. The keyword would be found in the page content showing that we are yet to log in.

Next, we would search for the csrf_token keyword which was found as one of the parameters when using the network tab earlier. If the keyword shows a match with an input tag, then the value can be extracted every time you run the script using BeautifulSoup.

from requests import Session
from bs4 import BeautifulSoup as bs
 
with Session() as s:
    site = s.get("http://quotes.toscrape.com/login")
    bs_content = bs(site.content, "html.parser")
    token = bs_content.find("input", {"name":"csrf_token"})["value"]
    login_data = {"username":"admin","password":"12345", "csrf_token":token}
    s.post("http://quotes.toscrape.com/login",login_data)
    home_page = s.get("http://quotes.toscrape.com")
    print(home_page.content)

This would print the page’s content after logging in, and if you search for the “Logout” keyword. The keyword would be found in the page content showing that we were able to successfully log in.

Let’s take a look at each line of code.

from requests import Session
from bs4 import BeautifulSoup as bs

The lines of code above are used to import the Session object from the requests library and the BeautifulSoup object from the bs4 library using an alias of bs.

with Session() as s:

Requests session is used when you intend keeping the context of a request, so the cookies and all information of that request session can be stored.

bs_content = bs(site.content, "html.parser")
token = bs_content.find("input", {"name":"csrf_token"})["value"]

This code here utilizes the BeautifulSoup library so the csrf_token can be extracted from the  web page and then assigned to the token variable. You can learn about extracting data from nodes using BeautifulSoup.

login_data = {"username":"admin","password":"12345", "csrf_token":token}
s.post("http://quotes.toscrape.com/login", login_data)

The code here creates a dictionary of the parameters to be used for log in. The keys of the dictionaries are the name attributes of the input tags and the values are the value attributes of the input tags.

The post method is used to send a post request with the parameters and log us in.

home_page = s.get("http://quotes.toscrape.com")
print(home_page.content)

After a login, these lines of code above simply extract the information from the page to show that the login was successful.

Conclusion

The process of logging into websites using Python is quite easy, however the setup of websites are not the same therefore some sites would prove more difficult to log into than others. There is more that can be done to overcome whatever login challenges you have.

The most important thing in all of this is the knowledge of HTML, Requests, BeautifulSoup and the ability to understand the information gotten from the Network tab of your web browser’s Developer tools.

]]>
Primer on Yum Package Management Tool https://linuxhint.com/yum_package_management_tool/ Sun, 18 Nov 2018 03:54:05 +0000 https://linuxhint-com.zk153f8d-liquidwebsites.com/?p=32516 The Yum package management tool is very crucial to the management of Linux systems either you are a Linux systems admin or a power user. Different package management tools are available across different Linux distros and the YUM package management tool is available on the RedHat and CentOS Linux distros. In the background YUM (Yellowdog Updater Modified) is dependent on the RPM(Red Hat Package Manager), and was created to enable the management of packages as parts of a larger system of software repositories instead of individual packages.

How YUM Works

The configuration file for Yum is stored in the /etc/ directory, a file named yum.conf. This file can be configured and tweaked to suit certain needs of the system. Below is a sample of the contents of the yum.conf file:

[main]
cachedir=/var/cache/yum/$basearch/$releasever
keepcache=0
debuglevel=2
logfile=/var/log/yum.log
exactarch=1
obsoletes=1
gpgcheck=1
plugins=1
installonly_limit=5

This configuration file could be different from whatever you may get on your machine, but the configuration syntax follows the same rules. The repository of packages that can be installed with Yum are usually saved in the /etc/yum.repos.d/ directory, with each *.repo file in the directory serving as repositories of the various packages that can be installed.

The image below shows the structure of a CentOS base repository:

YUM works in a pattern similar to all Linux commands, using the structure below:

yum [options] COMMAND

With the command above, you can carry out all necessary tasks with YUM. You can get help on how to use YUM with the –help option:

yum --help

You should get a list of the commands and options that can be run on YUM, just as seen in the images below:

List of commands

List of options

For the rest of this article, we would be completing a couple of tasks with Yum. We would query, install, update and remove packages.

Querying packages with YUM

Let’s say you just got a job as a Linux system administrator at a company, and your first task is to install a couple of packages to help make your tasks easier such as nmap, top etc.

To proceed with this, you need to know about the packages and how well they will fit the computer’s needs.

Task 1: Getting information on a package

To get information on a package such as the package’s version, size, description etc, you need to use the info command.

yum info package-name

As an example, the command below would give information on the httpd package:

yum info httpd

Below is a snippet of the result from the command:

Name : httpd
Arch : x86_64
Version : 2.4.6
Release : 80.el7.centos.1

Task 2: Searching for existing packages

It is not in all cases you would know the exact name of a package. Sometimes, all you would know is a keyword affiliated with the package. In these scenarios, you can easily search for packages with that keyword in the name or description using the search command.

yum search keyword

The command below would give a list of packages that have the keyword “nginx” in it.

yum search nginx

Below is a snippet of the result from the command:

collectd-nginx.x86_64 :Nginx plugin for collectd
munin-nginx.noarch : NGINX support for Munin resource monitoring
nextcloud-nginx.noarch : Nginx integration for NextCloud
nginx-all-modules.noarch : A meta package that installs all available Nginx module

Task 3: Querying a list of packages

There are a lots of packages that are installed or are available for installation on the computer. In some cases, you would like to see a list of those packages to know what packages are available for installation.

There are three options for listing packages which would be stated below:

yum list installed: lists the packages that are installed on the machine.

yum list available: lists all packages available to be installed in from enabled repositories.

yum list all: lists all of the packages both installed and available.

Task 4: Getting package dependencies

Packages are rarely installed as standalone tools, they have dependencies which are essential to their functionalities. With Yum, you can get a list of a package’s dependencies with the deplist command.

yum deplist package-name

As an example, the command below fetches a list of httpd’s dependencies:

yum deplist httpd

Below is a snippet of the result:

package: httpd.x86_64 2.4.6-80.el7.centos.1
dependency: /bin/sh
provider: bash.x86_64 4.2.46-30.el7
dependency: /etc/mime.types
provider: mailcap.noarch 2.1.41-2.el7
dependency: /usr/sbin/groupadd
provider: shadow-utils.x86_64 2:4.1.5.1-24.el7

Task 6: Getting information on package groups

Through this article, we have been looking at packages. At this point, package groups would be introduced.

Package groups are collection of packages for serving a common purpose. So if you want to set up your machine’s system tools for example, you do not have to install the packages separately. You can install them all at once as a package group.

You can get information on a package group using the groupinfo command and putting the group name in quotes.

yum groupinfo “group-name”

The command below would fetch information on the “Emacs” package group.

yum groupinfo "Emacs"

Here is the information:

Group: Emacs
Group-Id: emacs
Description: The GNU Emacs extensible, customizable, text editor.
Mandatory Packages:
=emacs
Optional Packages:
ctags-etags
emacs-auctex
emacs-gnuplot
emacs-nox
emacs-php-mode

Task 7: Listing the available package groups

In the task above, we tried to get information on the “Emacs” package. However, with the grouplist command, you can get a list of available package groups for installation purposes.

yum grouplist

The command above would list the available package groups. However, some packages would not be displayed due to their hidden status. To get a list of all package groups including the hidden ones, you add the hidden command as seen below:

yum grouplist hidden

Installing packages with YUM

We have looked at how packages can be queried with Yum. As a Linux system administrator you would do more than query packages, you would install them.

Task 8: Installing packages

Once you have the name of the package you like to install, you can install it with the install command.

yum install package-name

Example:

yum install nginx

Task 9: Installing packages from .rpm files

While you have to install most packages from the repository, in some cases you would be provided with *.rpm files to install. This can be done using the localinstall command. The localinstall command can be used to install *.rpm files either they are available on the machine or in some external repository to be accessed by a link.

yum localinstall file-name.rpm

Task 10: Reinstalling packages

While working with configuration files, errors can occur leaving packages and their config files messed up. The install command can do the job of correcting the mess. However, if there is a new version of the package in the repository, that would be the version to be installed which isn’t what we want.

With the reinstall command, we can re install the current version of packages regardless the latest version available in the repository.

yum reinstall package-name

Task 11: Installing package groups

Earlier, we looked into package groups and how to query them. Now we would see how to install them. Package groups can be installed using the groupinstall command and the name of the package group in quotes.

yum groupinstall “group-name”

Updating packages with YUM

Keeping your packages updated is key. Newer versions of packages often contain security patches, new features, discontinued features etc, so it is key to keep your computer updated as much as possible.

Task 12: Getting information on package updates

As a Linux system administrator, updates would be very crucial to maintaining the system. Therefore, there is a need to constantly check for package updates. You can check for updates with the updateinfo command.

yum updateinfo

There are lots of possible command combinations that can be used with updateinfo. However we would use only the list installed command.

yum updateinfo list installed

A snippet of the result can be seen below:

FEDORA-EPEL-2017-6667e7ab29  bugfix     epel-release-7-11.noarch

FEDORA-EPEL-2016-0cc27c9cac  bugfix     lz4-1.7.3-1.el7.x86_64

FEDORA-EPEL-2015-0977       None/Sec.    novnc-0.5.1-2.el7.noarch

Task 13: Updating all packages

Updating packages is as easy as using the update command. Using the update command alone would update all packages, but adding the package name would update only the indicated package.

yum update : to update all packages in the operating system

yum update httpd : to update the httpd package alone.

While the update command will update to the latest version of the package, it would leave obsolete files which the new version doesn’t need anymore.

To remove the obsolete packages, we use the upgrade command.

yum upgrade : to update all packages in the operating system and delete obsolete packages.

The upgrade command is dangerous though, as it would remove obsolete packages even if you use them for other purposes.

Task 14: Downgrading packages

While it is important to keep up with latest package updates, updates can be buggy. Therefore in a case where an update is buggy, it can be downgraded to the previous version which was stable. Downgrades are done with the downgrade command.

yum downgrade package-name

Removing packages with YUM

As a Linux system administrator, resources have to be managed. So while packages are installed for certain purposes, they should be removed when they are not needed anymore.

Task 15: Removing packages

The remove command is used to remove packages. Simply add the name of the package to be removed, and it would be uninstalled.

yum remove package-name

While the command above would remove packages, it would leave the dependencies. To remove the dependencies too, the autoremove command is used. This would remove the dependencies, configuration files etc.

yum autoremove package-name

Task 15: Removing package groups

Earlier we talked about installing package groups. It would be tiring to begin removing the packages individually when not needed anymore. Therefore we remove the package group with the groupremove command.

yum groupremove “group-name”

Conclusion

The commands discussed in this article are just a little show of the power of Yum. There are lots of other tasks that can be done with YUM which you can check at the official RHEL web page. However, the commands this article has discussed should get anybody started with doing regular Linux system administration tasks. ]]> Using Google Search API With Python https://linuxhint.com/google_search_api_python/ Wed, 24 Oct 2018 11:49:05 +0000 https://linuxhint-com.zk153f8d-liquidwebsites.com/?p=31393 It is no news that Google is the largest search engine in the world. Lots of people will go the extra mile to have their content rank highly on Google before any other search engine. As a result of this, Google has lots of quality results for every search and with great ranking algorithms you can expect to get the best of search results on Google.

This has an implication. Its implication is that there exists lots of useful data on Google and that calls for a need to scrape this golden data. The scraped data can be used for quality data analysis and discovery of wonderful insights. It can also be important in getting great research information in one attempt.

Talking about scraping, this can be done with third party tools. It can also be done with a Python library known as Scrapy. Scrapy is rated to be one of the best scraping tools, and can be used to scrape almost any web page. You can find out more on the Scrapy library.

However, regardless of the strengths of this wonderful library. Scraping data on Google could be one difficult task. Google comes down hard on any web scraping attempts, ensuring that scraping scripts do not even make as many 10 scrape requests in an hour before having the IP address banned.  This renders third party and personal web scraping scripts useless.

Google does give the opportunity to scrape information. However, whatever scraping that would be done has to be through an Application Programming Interface (API).

Just incase you do not already know what an Application Programming Interface is, there’s nothing to worry about as I’ll provide a brief explanation. By definition, an API is a set of functions and procedures that allow the creation of applications which access the features or data of an operating system, application, or other service. Basically, an API allows you gain access to the end result of processes without having to be involved in those processes. For example, a temperature API would provide you with the Celsius/Fahrenheit values of a place without you having to go there with a thermometer to make the measurements yourself.

Bringing this into the scope of scraping information from Google, the API we would be using allows us access to the needed information without having to write any script to scrape the results page of a Google search. Through the API, we can simply have access to the end result (after Google does the “scraping” at their end) without writing any code to scrape web pages.

While Google has lots of APIs for different purposes, we are going to be using the Custom Search JSON API for the purpose of this article. More information on this API can be found here.

This API allows us make 100 search queries per day for free, with pricing plans available for making more queries if necessary.

Creating A Custom Search Engine

In order to be able to use the Custom Search JSON API, we would be needing a Custom Search Engine ID. However, we would have to create a Custom Search Engine first which can be done here.

When you visit the Custom Search Engine page, click on the “Add” button to create a new search engine.

In the “sites to search” box, simply put in “www.linuxhint.com” and in the “Name of the search engine” box, put in any descriptive name of your choice (Google would be preferable).

Now click “Create” to create the custom search engine and click the “control panel” button from the page to confirm the success of creation.

You would see a “Search Engine ID” section and an ID under it, that is the ID we would be needing for the API and we would refer to it later in this tutorial. The Search Engine ID should be kept private.

Before we leave, remember we put in “www.linuhint.com” earlier. With that setting, we would only get results from the site alone. If you desire to get the normal results from total web search, click “Setup” from the menu on the left and then click the “Basics” tab. Go to the “Search the Entire Web” section and toggle this feature on.

Creating An API Key

After creating a Custom Search Engine and getting its ID, next would be to create an API key. The API key allows access to the API service, and it should be kept safe after creation just like the Search Engine ID.

To create an API key, visit the site and click on the “Get A Key” button.

Create a new project, and give it a descriptive name. On clicking “next”, you would have the API key generated.

On the next page, we would have different setup options which aren’t necessary for this tutorial, so you just click the “save” button and we are ready to go.

Accessing The API

We have done well getting the Custom Search ID and the API Key. Next we are going to make use of the API.

While you can access the API with other programming languages, we are going to be doing so with Python.

To be able to access the API with Python, you need to install the Google API Client for Python. This can be installed using the pip install package with the command below:

pip install google-api-python-client

After successfully installing, you can now import the library in our code.

Most of what will be done, would be through the function below:

from googleapiclient.discovery import build
my_api_key = "Your API Key”
my_cse_id = "
Your CSE ID"

def google_search(search_term, api_key, cse_id, **kwargs):
    service = build("
customsearch", "v1", developerKey=api_key)
    res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
    return res

In the function above, the my_api_key and my_cse_id variables should be replaced by the API Key and the Search Engine ID respectively as string values.

All that needs to be done now is to call the function passing in the search term, the api key and the cse id.

result = google_search("Coffee", my_api_key, my_cse_id)
print(result)

The function call above would search for the keyword “Coffee” and assign the returned value to the result variable, which is then printed. A JSON object is returned by the Custom Search API, therefore any further parsing of the resulting object would require a little knowledge of JSON.

This can be seen from a sample of the result as seen below:

The JSON object returned above is very similar to the result from the Google search:

Summary

Scraping Google for information isn’t really worth the stress. The Custom Search API makes life easy for everyone, as the only difficulty is in parsing the JSON object for the needed information. As a reminder, always remember to keep your Custom Search Engine ID and API Key values private.

]]>
Puppeteer VS Selenium https://linuxhint.com/puppeteer_vs_selenium/ Mon, 08 Oct 2018 10:57:26 +0000 https://linuxhint-com.zk153f8d-liquidwebsites.com/?p=31079 Today when it comes to automated web testing, Puppeteer and Selenium are the two names that come up. One of the main reasons why they are well-known is their ability to execute headless browsers. Therefore before we proceed with the article, let’s have a quick look at what headless browsers are and their advantages.

In basic terms, headless browsers are browsers that can be used for testing usability of web pages and executing browser interactions just like you would with your regular browser. The only difference here is that there is no Graphical User Interface (GUI) and they are usually executed from the terminal.


Headless browsers:

  • help reduce resource usage greatly
  • they are faster
  • they are ideal for web scraping purposes
  • they can be used to monitor network application performance

Now that we have known a major factor for both tools, we can proceed.

Puppeteer

Puppeteer is a Node library from Google that provides a simple API to control headless Chrome. Through Puppeteer, common tasks such as typing in inputs, clicking on buttons, testing usability of web pages and even web scraping can be carried out easily.

Puppeteer is official from the Chrome team, and uses the Chrome Remote Debug Protocol, just as we would find with the Chrome Devtools. This library supports the modern JavaScript syntax available in Google Chrome.

Setup

Installing and getting started with Puppeteer is very easy. Since Puppeteer is a Node library, it can be installed using the npm tool.

Installation can be done with the command below:

npm i puppeteer

Running the command above installs Puppeteer. It is expected to also download a recent version of Chromium that would work with the API.

The size of Chromium is varies according to operating system:

  • ~170MB for Mac
  • ~282MB for Linux
  • ~280MB for Windows

After installation of Puppeteer, you can find out more information on how to get started, you can as well check out more code examples.

Features

While Puppeteer’s ability to launch a headless browser is one feature that has gained it some fame, that is not the only feature that makes it awesome. Puppeteer also has a couple of other features that makes it useful, let’s take a quick look at some of them.

Easy Automation:

While there are other tools that can be used for web automation, Puppeteer comes out tops. This is due to the fact that it works fine for one browser only, which is the Headless Chrome browser, therefore it carries out web automation tasks in the most efficient way possible. Puppeteer also works fine with popular unit testing libraries such as Mocha and Jasmine.

Screenshot Testing:

This is a vital feature for any automated web testing task. Screenshots are important, and help keep track of result of interactions with elements on a web page. Libraries such as Puppeteer-screenshot-tester also exist in Puppeteer that provides the capability of comparing screenshots generated while testing. Asides generating screenshots of tests, PDFs can also be generated from tested web pages in puppeteer.

Performance Testing:

Chrome provides DevTools that allow the recording of the Performance Timeline of web pages, and Puppeteer takes advantage of this too. With Puppeteer, timeline traces of websites can be captured to examine performance issues. Due to the Puppeteer’s high-level API control over Chrome Developers Tools Protocol, it gives users the ability to control service workers and test caching of websites.

Web Scraping:

A talk about features would not be completed without acknowledging the ability of Puppeteer to be used for web scraping purposes. Learning to use Puppeteer as a web scraper is quite easy, take a look at the API documentation.

Pros

  1. Works fine for visual testing.
  2. Great for end to end testing.
  3. Fast when compared to Selenium.
  4. Can take screenshots of webpages.
  5. More control over tests through Chrome.
  6. Can test offline mode.

Cons

  1. Supports only JavaScript (Node)
  2. Supports only Chrome

Selenium

Selenium is a powerful web testing framework, that has the capability of automating web applications for testing purposes. Selenium is also known for its ability to automate web based administration tasks.

Selenium comes in two parts; the Selenium WebDriver for creating powerful, browser based automation suites and test and the Selenium IDE for creating quick bug reproduction scripts.

Not forgetting that Selenium also supports headless browsers as seen with Puppeteer.

Setup

Unlike with Puppeteer, setting up Selenium is not straightforward. Selenium supports many languages and different browsers, therefore those possible conditions need to be taken care of.

Listed below are links to official tutorials on how to setup Selenium bindings for different languages.

Asides supporting different languages, Selenium also supports multiple browsers. Unlike Puppeteer which installs Chromium during installation, you may have to install web drivers for the web browser of your choice.

Here are links to web drivers for Mozilla Firefox and Google Chrome.

If you wish to use the Selenium IDE too, it also exists for multiple browsers. Here are links to Selenium IDE for Mozilla Firefox and Google Chrome.

Features

It’s ability to work with headless browsers has made it unarguably the most popular web automation tool, but there are other features that make it powerful.

Multi-Language Support:

This is one very important Selenium feature. With its multiple language support, more developers can get to use the tool for their web automation testing tasks.  While one may think its multi-language support would make it slow, Selenium still runs at a good speed as starting up a server in Web Driver is not required.

Multi-Platform Support:

The same way Selenium is not restricted by language barriers, it is also not restricted by platform barrier. It is no news that web application behave differently on multiple platforms. Selenium gives testers the ability to test across major web browsers to provide a smooth user experience for users across different browsers.  Asides browsers, Selenium can also be used to test on mobile such as Android, iOS, Windows, Blackberry apps.

Recording Tool:

With Selenium IDE, it is easy to record web automation tests. Selenium IDE allows testers make use of the recording capability as well as the autocomplete support and ability to navigate commands. The Recording Tool has stopped working on Firefox 55 and later versions, however there are other plugins on Firefox that serve the same purpose. Therefore, the ability to record tests remains a major Selenium feature.

Web Scraping:

While Selenium is used for testing web applications, it also scales well as a web scraper. Selenium can be used to scrape AJAX websites and the most difficult websites to scrape, provided you can understand the HTML structure. You can check out this tutorial on using Selenium for web scraping with Python.

Pros

  1. Multi-platform support.
  2. Multi-language support.
  3. Ability to record tests.
  4. Can take screenshots too.
  5. Huge community of users.

Cons

  1. Slow when compared to Puppeteer.
  2. Limited control over tests when compared to Puppeteer.

Conclusion

If you are not bothered about testing web pages on other platforms asides Chrome, then you are fine working with Puppeteer, provided you are able to work with JavaScript(Node). However if you are concerned about multiple platforms, then using Selenium is a no-brainer. Talking about their web scraping abilities, both tools even themselves out there. It should be noted though that Puppeteer could be faster than Selenium.

Any tool you choose at the end of the day should be fine, just enjoy writing your automation scripts.

]]>
How to Install Android Studio in Ubuntu 18.04 https://linuxhint.com/install_android_studio_ubuntu1804/ Wed, 12 Sep 2018 05:52:57 +0000 https://linuxhint-com.zk153f8d-liquidwebsites.com/?p=30485 Installation of software on ubuntu is not always straightforward and can be quite frustrating.For this reason, this tutorial would take us through a step-by-step approach for installing the software on our ubuntu based machines. While the steps discussed here are specifically for Ubuntu 18.04, it can also be tried on earlier versions of Ubuntu. We would be treating four methods of installation. The first two methods may not work out fine, but the third is sure to help install the software successfully.  However, the first two methods are way easier; therefore I would advise that you try them first.

Android Studio demands a lot of system resources, therefore your machine needs to meet up to a couple of requirements for it to run smoothly.

Here are some important specifications your machine needs to meet:

  • 64-bit distribution capable of running 32-bit applications
  • GNU C Library (glibc) 2.19 or later
  • 3 GB RAM minimum, 8 GB RAM recommended; plus 1 GB for the Android Emulator
  • 2 GB of available disk space minimum,
  • 4 GB Recommended (500 MB for IDE + 1.5 GB for Android SDK and emulator system image)
  • 1280 x 800 minimum screen resolution

Now that we are done checking the necessary details, we can proceed with the installation.

Method 1 (The Ubuntu Software Centre)

The Ubuntu Software Centre remains the easiest place to install ubuntu software from. However, this is only the case when the desired software exists on the software store.

To install Android Studio from the Ubuntu Software Centre, simply search for Android Studio in the search box and you should get a couple of results.

If you are able to find the software,its installation is as easy as clicking the install button. You would get a password prompt to confirm the installation process. After a successful installation process, you should have the Android Studio icon available in your application tray.

If it installed successfully, you can skip the remaining methods and checkup the final setup section.

Method 2 (The Snap Tool)

The Snap tool can come in very handy for installation of software packages, especially when available.

Snaps are containerized software packages that make installation of software easy for users. You do not have to modify any files or type in any scary commands.

However, you need to have Snap installed on your machine in the first place.

To install Snap, use the command below:

sudo apt-get install snapd

After installing Snap successfully, you can proceed to install Android Studio with the command below:

sudo snap install android-studio

This would take some time, therefore you have to wait for some minutes—go get a cup of coffee. It is expected to install successfully, but if for some reason installation fails due to an error like the one below:

error: This revision of snap “android-studio” was published using classic confinement and thus may perform arbitrary system changes outside of the security sandbox that snaps are usually confined to, which may put your system at risk.

You would have to add the –classic parameter to the command as seen below:

sudo snap install android-studio --classic

If it installed successfully, you can skip the remaining methods and checkup the final setup section.

Method 3 (The Zip File)

This is one trusted method of installing Android Studio. However, it may take some time as well as patience typing in the commands.

First: We would have to install the Java Development Kit from Oracle.

Installing the Java Development Kit requires some prerequisites which can be installed with the commands below:

sudo apt update
sudo apt install libc6:i386 libncurses5:i386 libstdc++6:i386 lib32z1 libbz2-1.0:i386 wget

Now we can proceed with installing the JDK with the command below:

sudo add-apt-repository ppa:webupd8team/java
sudo apt update
sudo apt install oracle-java8-installer

This would take a while, however you should stay close by. An Oracle License agreement prompt would come up asking you to confirm an agreement to their terms.

The prompt is usually about four lines long, with an option for you to choose “Yes” or “No”. Choose “Yes” and then proceed.

After a successful installation, you can check Java version with:

java -version

Also you can check the Java compiler’s version with:

javac -version

Next, we change directories to the Downloads directory and download the Android Studio zip file there.  This can be done with the commands below:

cd Downloads/
wget https://dl.google.com/dl/android/studio/ide-zips/3.1.3.0/android-studio
-ide-173.4819257-linux.zip

Just like our previous downloads, this could take some time.  After downloading, unzip the file into the /opt directory where our software files stay with the following command:

sudo unzip android-studio-ide-*-linux.zip -d /opt/

You should now have your android-studio directory unzipped in the /opt directory.

To run android studio, go to the bin directory in the unzipped android studio directory and run the

studio.sh file:
cd /opt/android-studio/bin
studio.sh

It should run fine, however close the launched application—do not proceed with the setup just yet.  You can symlink the studio.sh file to the /bin directory, so you can simply run android studio from any directory on the commandline.

You can do that with the command below:

sudo ln -sf /opt/android-studio/bin/studio.sh /bin/android-studio

However you won’t be able to access Android Studio from your list of applications just yet, we would cover this in the final setup.

Final setup

After finishing installations, launch  Android Studio again—if you used method three, type android-studio in the terminal—and proceed with the Android Studio Setup Wizard.

Running the setup wizard would take some time as the application is expected to make some other downloads.

After completion of all possible downloads, you should download the necessary SDK to develop software for your target android versions. It is expected that this comes up by default, but if it doesn’t you can download it through the following steps:

Click on “File”, then “Settings”, then “Android SDK”. You would see the Android SDKs for the different versions of android you plan to build for, then choose the ones you wish to download.

For those who installed using the third method, you can add the desktop icon to your app tray now. By clicking “Tools” and then “Create Desktop Entry.”

There you have it, Android Studio installed on your Ubuntu 18.04.

]]>
Finding Children Nodes With Beautiful Soup https://linuxhint.com/find_children_nodes_beautiful_soup/ Sun, 15 Jul 2018 06:04:36 +0000 https://linuxhint-com.zk153f8d-liquidwebsites.com/?p=28348 The task of web scraping is one that requires the understanding of how web pages are structured. To get the needed information from web pages, one needs to understand the structure of web pages, analyze the tags that hold the needed information and then the attributes of those tags.

For beginners in web scraping with BeautifulSoup, an article discussing the concepts of web scraping with this powerful library can be found here.

This article is for programmers, data analysts, scientists or engineers who already have the skillset of extracting content from web pages using BeautifulSoup. If you do not have any knowledge of this library,  I advise you to go through the BeautifulSoup tutorial for beginners.

Now we can proceed — I want to believe that you already have this library installed.  If not, you can do this using the command below:

pip install BeautifulSoup4

Since we are working with extracting data from HTML, we need to have a basic HTML page to practice these concepts on.  For this article, we would use this HTML snippet for practice. I am going to assign the following HTML snippet to a variable using the triple quotes in Python.

sample_content = """<html>
<head>
<title>LinuxHint</title>
</head>
<body>
<p>

To make an unordered list, the ul tag is used:
 
<ul>
Here's an unordered list
 
<li>First option</li>
<li>Second option</li>
</ul>
</p>
<p>

To make an ordered list, the ol tag is used:
 
<ol>

Here's an ordered list

<li>Number One</li>
<li>Number Two</li>
</ol>
</p>
<p>Linux Hint, 2018</p>
</body>
</html>"""

Now that we have sorted that, let’s move right into working with the BeautifulSoup library.

We are going to be making use of a couple of methods and attributes which we would be calling on our BeautifulSoup object. However, we would need to parse our string using BeautifulSoup and then assign to an “our_soup” variable.

from bs4 import BeautifulSoup as bso
our_soup = bso(sample_content, "lxml")

Henceforth, we would be working with the “our_soup” variable and calling all of our attributes or methods on it.

On a quick note, if you do not already know what a child node is, it is basically a node (tag) that exists inside another node. In our HTML snippet for example, the li tags are children nodes of both the “ul” and the “ol” tags.

Here are the methods we would be taking a look at:

  • findChild
  • findChildren
  • contents
  • children
  • descendants

findChild():

The findChild method is used to find the first child node of HTML elements. For example when we take a look at our “ol” or “ul” tags, we would find two children tags in it. However when we use the findChild method, it only returns the first node as the child node.

This method could prove very useful when we want to get only the first child node of an HTML element, as it returns the required result right away.

The returned object is of the type bs4.element.Tag. We can extract the text from it by calling the text attribute on it.

Here’s an example:

first_child = our_soup.find("body").find("ol")
print(first_child.findChild())

 The code above would return the following:

<li>Number One</li>

To get the text from the tag, we call the text attribute on it.

Like:

print(first_child.findChild().text)

To get the following result:

'Number One'
findChildren():

We have taken a look at the findChild method and seen how it works. The findChildren method works in similar ways, however as the name implies, it doesn’t find only one child node, it gets all of the children nodes in a tag.

When you need to get all the children nodes in a tag, the findChildren method is the way to go. This method returns all of the children nodes in a list, you can access the tag of your choice using its index number.

Here’s an example:

first_child = our_soup.find("body").find("ol")
print(first_child.findChildren())

This would return the children nodes in a list:

[<li>Number One</li>, <li>Number Two</li>]

To get the second child node in the list, the following code would do the job:

print(first_child.findChildren()[1])

To get the following result:

<li>Number Two</li>

That’s all BeautifulSoup provides when it comes to methods. However, it doesn’t end there. Attributes can also be called on our BeautifulSoup objects to get the child/children/descendant node from an HTML element.

contents:

While the findChildren method did the straightforward job of extracting the children nodes, the contents attributes does something a bit different.

The contents attribute returns a list of all the content in an HTML element, including the children nodes. So when you call the contents attribute on a BeautifulSoup object, it would return the text as strings and the nodes in the tags as a bs4.element.Tag object.

Here’s an example:

first_child = our_soup.find("body").find("ol")
print(first_child.contents)

This returns the following:

["\n   Here's an ordered list\n   ", <li>Number One</li>,
'\n', <li>Number Two</li>, '\n']

As you can see, the list contains the text that comes before a child node, the child node and the text that comes after the child node.

To access the second child node, all we need to do is to make use of its index number as shown below:

print(first_child.contents[3])

This would return the following:

<li>Number Two</li>

children:

Here is one attribute that does almost the same thing as the contents attribute. However, it has one small difference that could make a huge impact (for those that take code optimization seriously).

The children attribute also returns the text that comes before a child node, the child node itself and the text that comes after the child node. The difference here is that it returns them as a generator instead of a list.

Let’s take a look at the following example:

first_child = our_soup.find("body").find("ol")
print(first_child.children)

The code above gives the following results (the address on your machine doesn’t have to tally with the one below):

<list_iterator object at 0x7f9c14b99908>

As you can see, it only returns the address of the generator. We could convert this generator into a list.

We can see this in the example below:

first_child = our_soup.find("body").find("ol")
print(list(first_child.children))

This gives the following result:

["\n        Here's an ordered list\n        ", <li>Number One</li>,
'\n', <li>Number Two</li>, '\n']

descendants:

While the children attribute works on getting only the content inside a tag i.e. the text, and nodes on the first level, the descendants attribute goes deeper and does more.

The descendants attribute gets all of the text and nodes that exist in children nodes. So it doesn’t return only children nodes, it returns grandchildren nodes as well.

Asides returning the text and tags, it also returns the content in the tags as strings too.

Just like the children attribute, descendants returns its results as a generator.

We can see this below:

first_child = our_soup.find("body").find("ol")
print(first_child.descendants)

This gives the following result:

<generator object descendants at 0x7f9c14b6d8e0>

As seen earlier, we can then convert this generator object into a list:

first_child = our_soup.find("body").find("ol")
print(list(first_child.descendants))

We would get the list below:

["\n   Here's an ordered list\n   ", <li>Number One</li>,
'Number One', '\n', <li>Number Two</li>, 'Number Two', '\n']

Conclusion

There you have it, five different ways to access children nodes in HTML elements. There could be more ways, however with the methods and attributes discussed in this article one should be able to access the child node of any HTML element.

]]>
Top 20 Best Webscraping Tools https://linuxhint.com/top_20_webscraping_tools/ Mon, 04 Jun 2018 02:13:08 +0000 https://linuxhint-com.zk153f8d-liquidwebsites.com/?p=27040 Data lives more on the web than any other place. With the rise in social media activity and development of more web applications and solutions, the web would be generating a lot more data than you and I can envisage.

Wouldn’t it be a waste of resources if we couldn’t extract this data and make something out of it?

There’s no doubting that it would be great to extract this data, here is where web scraping steps in.

With web scraping tools we can get desired data from the web without having to do it manually(which is probably impossible in this day and time).

In this article, we would take a look at the top twenty web scraping tools available for use. These tools are not arranged in any specific order, but all of them stated here are very powerful tools in the hands of their user.

While some would require coding skills, some would be command line based tool and others would be graphical or point and click web scraping tools.

Let’s get into the thick of things.

Import.io:

This is one of the most brilliant web scraping tools out there. Using machine learning, Import.io ensures all the user needs to do is to insert the website URL and it does the remaining work of bringing orderliness into the unstructured web data.

Dexi.io:

A strong alternative to Import.io; Dexi.io allows you extract and transform data from websites into any file type of choice. Asides providing the web scraping functionality, it also provides web analytics tools.

Dexi doesn’t just work with websites, it can be used to scrape data from social media sites as well.

80 legs:

A Web Crawler as a Service (WCaaS), 80 legs it provides users with the ability to perform crawls in the cloud without placing the user’s machine under a lot of stress. With 80 legs, you only pay for what you crawl; it also provides easy to work with APIs to help make the life of developers easier.

Octoparse:

While other web scraping tools may struggle with JavaScript heavy websites, Octoparse is not to be stopped. Octoparse works great with AJAX dependent websites, and is user friendly too.

However, it is only available for Windows machines, which could be a bit of a limitation especially for Mac and Unix users. One great thing about Octoparse though, is that it can be used to scrape data from an unlimited number of websites. No limits!

Mozenda:

Mozenda is a feature filled web scraping service. While Mozenda is more about paid services than free ones, it is worth the pay when considering how well the tool handles very disorganized websites.

Making use of anonymous proxies always, you barely need to be concerned about being locked out a site during a web scraping operation.

Data Scraping Studio:

Data scraping studio is one of the fastest web scraping tools out there. However just like Mozenda, it is not free.

Using CSS and Regular Expresions (Regex), Mozenda comes in two parts:

  • a Google Chrome extension.
  • a Windows desktop agent for launching web scraping processes.

Crawl Monster:

Not your regular web crawler, Crawl Monster is a free website crawler tool that is used to gather data and then generate reports based on the gotten information as it affects Search Engine Optimization.

This tool provides features such as real time site monitoring, analysis on website vulnerabilities and analysis on SEO performance.

 Scrapy:

Scrapy is one of the most powerful web scraping tools that requires the skill of coding. Built on Twisted library, it is a Python library able to scrape multiple web pages at the same time.

Scrapy supports data extraction using Xpath and CSS expressions, making it easy to use. Asides being easy to learn and work with, Scrapy supports multi-platforms and is very fast making it perform efficiently.

Selenium:

Just like Scrapy, Selenium is another free web scraping tool that requires the coding skill. Selenium is available in a lot of languages, such as PHP, Java, JavaScript, Python etc. and is available for multiple operating systems.

Selenium isn’t only used for web scraping, it can also be used for web testing and automation, it could be slow but does the job.

Beautifulsoup:

Yet another beautiful web scraping tool. Beautifulsoup is a python library used to parse HTML and XML files and is very useful for extracting needed information from web pages.

This tool is easy to use and should be the one to call upon for any developer needing to do some simple and quick web scraping.

Parsehub:

One of the most efficient web scraping tools remains Parsehub. It is easy to use and works very well with all kinds of web applications from single-page apps to multi-page apps and even progressive web apps.

Parsehub can also be used for web automation. It has a free plan to scrape 200 pages in 40 minutes, however more advanced premium plans exist for more complex web scraping needs.

Diffbot:

One of the best commercial web scraping tools out there is Diffbot. Through the implementation of machine learning and natural language processing, Diffbot is able to scrape important data from pages after understanding the page structure of the website. Custom APIs can also be created to help scrape data from web pages as it suites the user.

However it could be quite expensive.

Webscraper.io:

Unlike the other tools already discussed in this article, Webscraper.io is more renowned for being a Google Chrome extension. This doesn’t mean it is any less effective though, as it uses different type selectors to navigate web pages and extract the needed data.

There also exists a cloud web scraper option, however that is not free.

Content grabber:

Content grabber is a Windows based web scraper powered by Sequentum, and is one of the fastest web scraping solutions out there.

It is easy to use, and barely requires a technical skill like programming. It also provides an API that can be integrated into desktop and web applications. Very much on the same level with the likes of Octoparse and Parsehub.

Fminer:

Another easy to use tool on this list. Fminer does well with executing form inputs during web scraping, works well with Web 2.0 AJAX heavy sites and has multi-browser crawling capability.

Fminer is available for both Windows and Mac systems, making it a popular choice for startups and developers. However, it is a paid tool with a basic plan of $168.

Webharvy:

Webharvy is a very smart web scraping tool. With it’s simplistic point and click mode of operation, the user can browse and select the data to be scraped.

This tool is easy to configure, and web scraping can be done through the use of keywords.

Webharvy goes for a single license fee of $99, and has a very good support system.

Apify:

Apify (formerly Apifier) converts websites into APIs in quick time. Great tool for developers, as it improves productivity by reducing development time.

More renowned for its automation feature, Apify is very powerful for web scraping purposes as well.

It has a large user community, plus other developers have built libraries for scraping certain websites with Apify which can be used immediately.

Common Crawl:

Unlike the remaining tools on this list, Common Crawl has a corpus of extracted data from a lot of websites available. All the user needs to do is to access it.

Using Apache Spark and Python, the dataset can be accessed and analysed to suite one’s needs.

Common Crawl is non-profit based so if after using the service, you like it; do not forget to donate to the great project.

Grabby io:

Here is a task specific web scraping tool. Grabby is used to scrape emails from websites, no matter how complex the technology used in development is.

All Grabby needs is the website URL and it would get all the email addresses available on the website. It is a commercial tool though with a $19.99 per week per project price tag.

Scrapinghub:

Scrapinghub is a Web Crawler as a Service (WCaaS) tool, and is made specially for developers.

It provides options such as Scrapy Cloud for managing Scrapy spiders, Crawlera for getting proxies that won’t get banned during web scraping and Portia which is a point and click tool for building spiders.

ProWebScraper:

ProWebScraper, no-code web scraping tool, you can build scrapers simply by points and clicks on data points of interest and ProWebScraper will scrape all data points within a few seconds. This tool helps you to extract millions of data from any website with its robust functionalities like Automatic IP rotation, Extract data after login, Extract data from Js rendered websites, Scheduler, and many more. It provides 1000 page scraping for free with access to all features.

Conclusion:

There you have it, the top 20 web scraping tools out there. However, there are other tools that could do a good job too.

Is there any tool you use for web scraping that didn’t make this list? Share with us.

]]>
Scrapy with XPath Selectors https://linuxhint.com/scrapy-with-xpath-selectors/ Wed, 11 Apr 2018 18:38:30 +0000 https://linuxhint-com.zk153f8d-liquidwebsites.com/?p=24987 HTML is the language of the web pages, and there is a lot of information hanging in between every web page‘s opening and closing html tag. There are lots of ways to access this, however in this article we would be doing so using Xpath selector through Python‘s Scrapy library.

The Scrapy library is a very powerful web scraping library, easy to use as well. If you are new to this, you can follow the available tutorial on using the Scrapy library.

This tutorial covers the use of Xpath selectors. Xpath uses path like syntax to navigate the nodes of XML documents. They are also useful in navigating HTML tags.

Unlike in the Scrapy tutorial, we are going to be doing all of our operations here on the terminal for simplicity sake. This doesn‘t mean that the Xpath can‘t be used with the proper Scrapy program though, they can be utilized in the parse library on the response parameter.

We are going to be working with the example.webscraping.com site, as it is very simple and would help understand the concepts.

To use scrapy in our terminal, type in the command below:

$ scrapy shell http://example.webscraping.com

It would visit the site and get the needed information, then leave us with an interactive shell to work with. You should see a prompt like:

In [1]:

From the interactive session, we are going to be working with the response object.

Here‘s what our syntax would look like for the majority of this article:

In [1]: response.xpath(‘xpathsyntax’).extract()

This command above is used to extract all of the matched tags according to the Xpath syntax and then stores it in a list.

In [2]: response.xpath(‘xpathsyntax’).extract_first()

This command above is used to extract only the first matched tag, and stores it in a list.
We can now start working on the Xpath syntax.

NAVIGATING TAGS

Navigating tags in Xpath is very easy, all that is needed is the forward-slash “/” followed by the name of the tag.

In [3]: response.xpath(/html’).extract()

The command above would return the html tag and everything it contains as a single item in a list.

If we want to get the body of the web page, we would use the following:

In [4]: response.xpath(/html/body’).extract()

Xpath also allows the wildcard character “*”, which matches everything in the level in which it is used.

In [5]: response.xpath(/*).extract()

The code above would match everything in the document. The same thing happens when we use ‘/html’.

In [6]: response.xpath(/html/*).extract()

Asides navigating tags, we can get all the descendant tags of a particular tag by using the “//”.

In [7]: response.xpath(/html//a’).extract()

The above code would return all the anchor tags under in the html tag i.e. it would return a list of all the descendant anchor tags.

TAGS BY ATTRIBUTES AND THEIR VALUES

Sometimes, navigating html tags to get to the required tag could be trouble. This trouble can be averted by simply finding the needed tag by its attribute.

In [8]: response.xpath('/html//div[@id = "pagination"]').extract()

The code above returns all the div tags under the html tag that have the id attribute with a value of pagination.

In [9]: response.xpath('/html//div[@class = "span12"]').extract()

The code above would return a list of all the div tags under the html tag, only if they have the class attribute with a value of span12.

What if you do not know the value of the attribute? And all you want is to get tags with a particular attribute, with no concern about it‘s value. Doing this is simple as well, all you need to do is to use only the @ symbol and the attribute.

In [10]: response.xpath('/html//div[@class]').extract()

This code would return a list of all the div tags that contain the class attribute regardless of what value that class attribute holds.

How about if you know only a couple of characters contained in the value of an attribute? It‘s also possible to get those type of tags.

In [11]: response.xpath('/html//div[contains(@id, "ion")]').extract()

The code above would return all the div tags under the html tag that have the id attribute, however we do not know what value the attribute holds except that we know it contains “ion”.

The page we are parsing has only one tag in this category, and the value is “pagination” so it would be returned.

Cool right?

TAGS BY THEIR TEXT

Remember we matched tags by their attributes earlier. We can also match tags by their text.

In [12]: response.xpath('/html//a[.=" Algeria"]').extract()

The code above would help us get all the anchor tags that have the “ Algeria” text in them. NB: It must be tags with exactly that text content.

Wonderful.

How about if we do not know in the exact text content, and we only know a few of the text content? We can do that as well.

In [13]: response.xpath('/html//a[contains (text(),"A")]').extract()

The code above would get the tags that have the letter “A” in their text content.

EXTRACTING TAG CONTENT

All along, we have been talking about finding the right tags. It‘s time to extract the content of the tag when we find it.

It‘s pretty simple. All we need to do is to add “/text()” to the syntax, and the contents of the tag would be extracted.

In [14]: response.xpath('/html//a/text()').extract()

The code above would get all the anchor tags in the html document, and then extract the text content.

EXTRACTING THE LINKS

Now that we know how to extract the text in tags, then we should know how to extract the values of attributes. Most times, the values of attributes that are of utmost importance to us are links.

Doing this is almost same as extracting the text values, however instead of using “/text()” we would be using the “/@” symbol and the name of the attribute.

In [15]:response.xpath(<a href="mailto:'/html//a/@href">'/html//a/@href</a>').extract()

The code above would extract all of the links in the anchor tags, the links are supposed to be the values of the href attribute.

NAVIGATING SIBLING TAGS

If you noticed, we have been navigating tags all this while. However, there’s one situation we haven’t tackled.

How do we select a particular tag when tags with the same name are on the same level?

<tr>
    <td><div>
<a href="/places/default/view/Afghanistan-1">
<img src="/places/static/images/flags/af.png"> Afghanistan</a>
</div></td>

    <td><div>
<a href="/places/default/view/Aland-Islands-2">
<img src="/places/static/images/flags/ax.png"> Aland Islands</a>
</div></td>
</tr>

In a case like the one we have above, if we are to look at it, we might say we’d use extract_first() to get the first match.

However, what if we want to match the second one? What if there are more than ten options and we want the fifth one? We are going to answer that right now.

Here is the solution: When we write our Xpath syntax we put the position of the tag we want in square brackets, just like we are indexing but the index starts at 1.

Looking at the html of the web page we are dealing with, you’d notice that there a lot of <tr> tags on the same level. To get the third <tr> tag, we’d use the following code:

In [16]: response.xpath('/html//tr[3]').extract()

You’d also notice that the <td> tags are in twos, if we want only the second <td> tags from the <tr> rows we’d do the following:

In [17]: response.xpath('/html//td[2]').extract()

CONCLUSION:

Xpath is a very powerful way to parse html files, and could help minimize the use of regular expressions in parsing them considering it has the contains function in its syntax.

There are other libraries that allow parsing with Xpath such as Selenium for web automation. Xpath gives us a lot of options while parsing html, but what has been treated in this article should be able to carry you through common html parsing operations.

]]>
Selenium Web Automation with Python https://linuxhint.com/selenium-web-automation-python/ Mon, 02 Apr 2018 10:59:17 +0000 https://linuxhint-com.zk153f8d-liquidwebsites.com/?p=24509 Everyone uses the web at one point or the other, so it‘s a huge call for developers to ensure their web applications are functioning as intended. In other to do this, web automation could be very helpful.

For any commercial software to be successful, it has to undergo a couple of tests. Automation could be useful for user tests, simulating the use of software just like a user would. It is also useful for penetration tests, such as trying to crack passwords, perform SQL injections etc.

Asides from testing, web automation could be very handy for scraping JavaScript heavy websites.

Selenium is one of the most efficient tools for web automation. It is very popular among different languages too, available in languages such as Java, JavaScript.

Installation

Selenium can be installed in python using the pip module as shown in the command below:

pip install selenium

It would install the library and needed dependencies, the installation can be confirmed by importing it in an interactive session.

$ python
Python 3.5.2 (default, Sep 14 2017, 22:51:06)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import selenium

Since no error occured, it means our installation was successful. However, it doesn‘t end there; this is because selenium works hand in hand with browsers such as Chrome and Firefox and it needs a driver from the browser to be able to proceed with its duties.

We are going to be taking a look at how to get the drivers installed. For Mozilla Firefox, you can download its driver known as geckodriver from the github page. If you are a Chrome user, you can download it‘s driver known as chromedriver from the official site.

After download, you then add the driver to the path. Personally I‘d like to keep such a file in my /usr/local/bin directory, and I‘d advise you do same.

If you‘d want to do same, the command below should move it from your current directory to the bin directory.

$ sudo mv geckodriver /usr/local/bin
$ sudo mv chromedriver /usr/local/bin

To add geckodriver or chromedriver to path from that directory, run the following command.

$ export PATH=$PATH:/usr/local/bin/geckodriver
$ export PATH=$PATH:/usr/local/bin/chromedriver

After adding the driver for your desired browser to the path, you can confirm if everything works fine by running the following from an interactive session.

For Firefox:

$ python
Python 3.5.2 (default, Sep 14 2017, 22:51:06)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from selenium import webdriver
>>> webdriver.Firefox()

For Chrome:

$ python
Python 3.5.2 (default, Sep 14 2017, 22:51:06)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from selenium import webdriver
>>> driver = webdriver.Chrome()

After running that, if a browser comes up then everything is working fine. Now we can proceed to do cool stuff with Selenium.

Most of the code for the rest of this article would be done in the interactive session, however you can write it in a file just like your usual python script.

Also, we would be working on the driver variable from the code above.

Visting web pages

After the webpage is open, you can visit any webpage by calling the get method on driver. The opened browser then loads the address passed in, just like it would when you do it yourself.

Do not forget to use http:// or https://, else you‘d have to deal with unpleasant errors.

>>> driver.get("http://google.com")

This would load the Google homepage.

Getting source code

Now that we have learnt to visit web pages, we can scrape data from the visited web page.

From the driver object, we can get the source code by calling the page_source attribute, you can then do what ever you want with the html using the BeautifulSoup library.

>> driver.page_source

Filling text boxes

If for example we have loaded Google‘s homepage, and we want to type in some information in the search box; it can easily be done.

To do this, we use the inspector element to check the source code and see the tag information of the search box. To do  this, simply right-click on the search box and select inspect element.

On my machine, I got the following:

<input class="gsfi" id="lst-ib" maxlength="2048" name="q" autocomplete="off" title="Search"
value="" aria-label="Search" aria-haspopup="false" role="combobox" aria-autocomplete="list"
style="border: medium none; padding: 0px; margin: 0px; height: auto; width: 100%;
background: transparent
url(&quot;data:image/gif;base64,R0lGODlhAQABAID/AMDAwAAAACH5BAEAAAAALAAAAAABAAEAAAICRAEA
Ow%3D%3D&quot;) repeat scroll 0% 0%; position: absolute; z-index: 6; left: 0px; outline:
medium none;"
dir="ltr" spellcheck="false" type="text">

With selenium, we can select elements either by tag name, id, class name etc.

They can be implemented with the following methods:

.find_element_by_id
.find_element_by_tag_name
.find_element_by_class_name
.find_element_by_name

From the google web page, the search box has an id lst-ib, so we would find element by id.

>>> search_box = driver.find_element_by_id("lst-ib")

Now that we have found the element and saved it in a search_box variable, we can get to perform some operations on the search box.

>>> search_box.send_keys("Planet Earth")

This would input the text “Planet Earth“ in the box.

>>> search_box.clear()

This would clear the entered text from the search box. You should use the send_keys method again, in the next section we would be clicking the search button so we have something to search.

Clicking the right buttons

Now that we have filled the search box with some information, we can go ahead and search.

The same way we found the search box is the same way we are going to find the search button.

On my machine, I got the following:

<input value="Google Search" aria-label="Google Search" name="btnK" jsaction="sf.chk"
type="submit">

Looking at this we can make use of the name attribute. We can get it by using the code below:

>>> search_button = driver.find_element_by_name("btnK")

After finding the desired tag, we can then click on the button using the click method.

>>> search_button.click()

Be careful though, due to Google‘s auto suggestions you may end up search for something else.

To bypass this, you need to make the keyboard hit the enter key immediately. Keys are beyond the scope of this article, but here‘s the code anyway.

>>> from selenium.webdriver.common.keys import Keys
>>> search_box = driver.find_element_by_id("lst-ib")
>>> search_box.send_keys("Planet Earth")
>>> search_box.send_keys(Keys.RETURN)

With the code above, we do not have to click the search button. It works just like it would when we hit the enter key after typing in the search values.

This method of clikcing buttosn doesn‘t only work with buttons, it also works with links.

Taking screenshots

You read that right! You can take screenshots using selenium, and it‘s as easy as the previous sections.

What we‘ll do is to call the save_screenshot method on the driver object, we‘d then pass in the name of the image and the screenshot would be taken.

>>> driver.save_screenshot("Planet-earth.png")

Ensure that the image name has a .png extension, else you might end up with a corrupted image.

When you are done with the operations, you can close the browser by running the following code:

>>> driver.close()

Conclusion

Selenium is known as a very powerful tool, and being able to use it is considered a vital skill for automation testers. Selenium can do much more than discussed in this article, keyboard movements can actually be replicated as shown with Keys.RETURN. If you wish to learn more about selenium you can check out it‘s documentation, it‘s quite clear and easy to use.

]]>
Web Scraping with Python Scrapy Module https://linuxhint.com/web-scraping-python-scrapy/ Sun, 25 Mar 2018 19:26:47 +0000 https://linuxhint-com.zk153f8d-liquidwebsites.com/?p=24160 The skill of web scraping has become golden today, so let‘s learn how we can get needed data from web pages. In this article, we would be talking about the Scrapy Python library, what it can do and how to use it. Let’s get started.

Why Scrapy?

Scrapy is a robust web scraping library, that provides the ability to download web pages, images and any data you could think of at lightning speed. Speed is of great importance in computation, and Scrapy works on this by visiting websites asynchronously and doing a lot of background work making the whole task look easy.

It should be said that Python has other libraries that can be used to scrape data from websites, but none is comparable to Scrapy when it comes to efficiency.

Installation

Let‘s have a quick look at how this powerful library can be installed on your machine.

As with the majority of Python libraries, you can install Scrapy using the pip module:

pip install Scrapy

You can check if the installation was successful by importing scrapy in Python‘s interactive shell.

$ python
Python 3.5.2 (default, Sep 14 2017, 22:51:06)
[GCC 5.4.0 20160609] on linux

Type “help”, “copyright”, “credits” or “license” for more information.

>>> import scrapy

Now that we are done with the installation, let‘s get into the thick of things.

Creating a Web Scraping Project

During installation, the scrapy keyword was added to path so we can use the keyword directly from the command line. We would be taking advantage of this, throughout our use of the library.

From the directory of your choice run the following command:

scrapy startproject webscraper

This would create a directory called webscraper in the current directory and scrapy.cfg file. In the webscraper  directory would have __init__.py, items.py, middlewares.py, pipelines.py, settings.py files and a directory called spiders.

Our spider files i.e. the script that does the webscraping for us would be stored in the spiders directory.

Writing Our Spider

Before we go ahead to write our spider, it is expected that we already know what website we want to scrape. For the purpose of this article, we are scraping a sample webscraping website: http://example.webscraping.com.

This website just has country names and their flags, with different pages and we are going to be scrapping three of the pages. The three pages we would be working on are:

http://example.webscraping.com/places/default/index/0
http://example.webscraping.com/places/default/index/1
http://example.webscraping.com/places/default/index/2

Back to our spider, we are going to create a sample_spider.py in the spiders directory. From the terminal, a simple touch sample_spider.py command would help create a new file.

After creating the file, we would populate it with the following lines of code:

import scrapy
 
class SampleSpider(scrapy.Spider):
  name = "sample"
  start_urls = [
      "http://example.webscraping.com/places/default/index/0",
      "http://example.webscraping.com/places/default/index/1",
      "http://example.webscraping.com/places/default/index/2"
  ]
 
  def parse(self, response):
      page_number = response.url.split('/')[-1]
      file_name = "page{}.html".format(page_number)
      with open(file_name, 'wb') as file:
       file.write(response.body)

From the top level of the project‘s directory, run the following command:

scrapy crawl sample

Recall that we gave our SampleSpider class a name attribute sample.

After running that command, you would notice that three files named page0.html, page1.html, page2.html are saved to the directory.

Let‘s take a look at what happens with the code:

import scrapy

First we import the library into our namespace.

class SampleSpider(scrapy.Spider):
  name = "sample"

Then we create a spider class which we call SampleSpider. Our spider inherits from scrapy.Spider. All our spiders have to inherit from scrapy.Spider. After creating the class, we give our spider a name attribute, this name attribute is used to summon the spider from the terminal. If you recall, we ran the scrapy crawl sample command to run our code.

start_urls = [
 
   "http://example.webscraping.com/places/default/index/0",
   "http://example.webscraping.com/places/default/index/1",
   "http://example.webscraping.com/places/default/index/2"
]

We also have a list of urls for the spider to visit. The list must be called start_urls. If you want to give the list a different name we would have to define a start_requests function which gives us some more capabilities. To learn more you can check out the scrapy documentation.

Regardless, do not forget to include the http:// or https:// for your links else you would have to deal with a missing scheme error.

def parse(self, response):

We then go ahead to declare a parse function and give it a response parameter. When the code is run, the parse function is evoked and the response object is sent in which contains all the information of the visited web page.

page_number = response.url.split('/')[-1]
file_name = "page{}.html".format(page_number)

What we have done with this code is to split the string containing the address and saved the page number alone in a page_number variable. Then we create a file_name variable inserting the page_number in the string that would be the filename of the files we would be creating.

with open(file_name, 'wb') as file:
  file.write(response.body)

We have now created the file, and we are writing the contents of the web page into the file using the body attribute of the response object.

We can do more than just saving the web page. The BeautifulSoup library can be used to parse the body.response. You can check out this BeautiulSoup tutorial if you are not familiar with the library.

From the page to be scrapped, here is an excerpt of the html containing the data we need:

<div id="results">
<table>
<tr><td><div><a href="/places/default/view/Afghanistan-1">
<img src="/places/static/images/flags/af.png" /> Afghanistan</a></div></td>
<td><div><a href="/places/default/view/Aland-Islands-2">
<img src="/places/static/images/flags/ax.png" /> Aland Islands</a></div></td>
</tr>
...

</table>
</div>

You‘d notice that all of the needed data is enclosed in div tags, so we are going to rewrite the code to parse the html.
 
Here‘s our new script:

import scrapy
from bs4 import BeautifulSoup
 
class SampleSpider(scrapy.Spider):
    name = "sample"
 
    start_urls = [
     "http://example.webscraping.com/places/default/index/0",
     "http://example.webscraping.com/places/default/index/1",
     "http://example.webscraping.com/places/default/index/2"
     ]
 
    def parse(self, response):
      page_number = response.url.split('/')[-1]
      file_name = "page{}.txt".format(page_number)
      with open(file_name, 'w') as file:
        html_content = BeautifulSoup(response.body, "lxml")
        div_tags = html_content.find("div", {"id": "results"})
        country_tags = div_tags.find_all("div")
        country_name_position = zip(range(len(country_tags)), country_tags)
        for position, country_name in country_name_position:
          file.write("country number {} : {}\n".format(position + 1, country_name.text))

The code is pretty much the same as the initial one, however I have added BeautifulSoup to our namespace and I have changed the logic in the parse function.

Let‘s have a quick look at the logic.

def parse(self, response):

Here we have defined the parse function, and given it a response parameter.

page_number = response.url.split('/')[-1]
file_name = "page{}.txt".format(page_number)
with open(file_name, 'w') as file:

This does the same thing as discussed in the intial code, the only difference is that we are working with a text file instead of an html file. We would be saving the scraped data in the text file, and not the whole web content in html as done previously.

html_content = BeautifulSoup(response.body, "lxml")

What we‘ve done in this line of code is to send in the response.body as an argument to the BeautifulSoup library, and assigned the results to the html_content variable.

div_tags = html_content.find("div", {"id": "results"})

Taking the html content, we are parsing it here by searching for a div tag that also has and id attribute with results as it‘s value, then we get to save it in a div_tags variable.

country_tags = div_tags.find_all("div")

Remember that the countries existed in div tags as well, now we are simply getting all of the div tags and saving them as a list in the country_tags variable.

country_name_position = zip(range(len(country_tags)), country_tags)
 
for position, country_name in country_name_position:
  file.write("country number {} : {}\n".format(position + 1, country_name.text))

Here, we are iterating through the position of the countries among all the country tags then we are saving the content in a text file.

So in your text file, you would have something like:

country number 1 :  Afghanistan
country number 2 :  Aland Islands
country number 3 :  Albania
……..

Conclusion

Scrapy is undoubtedly one of the most powerful libraries out there, it is very fast and basically downloads the web page. It then gives you the freedom to whatever you wish with the web content.

We should note that Scrapy can do much more than we have checked out here. You can parse data with Scrapy CSS or Xpath selectors if you wish. You can read up the documentation if you need to do something more complex. ]]> Python BeautifulSoup Tutorial For Beginners https://linuxhint.com/python-beautifulsoup-tutorial-for-beginners/ Wed, 14 Mar 2018 13:05:14 +0000 https://linuxhint-com.zk153f8d-liquidwebsites.com/?p=23712 Web scraping is of great importance in today‘s world. Everybody needs data, from different sources including web pages. In this article, we will look at how to parse html with the beautifulsoup library. Extracting needed data out of a bunch of alphabets and symbols, thanks to this great library, has become a lot easier. BeautifulSoup written in Python can easily be installed on your machine using Python‘s pip installation tool. The following command would help get the library installed:

pip install BeautifulSoup4

To check if the installation was successful, activate the Python interactive shell and import BeautifulSoup. If no error shows up, it means everything went fine.  If you do not know how to go about that, type the following commands in your terminal.

$ python
Python 3.5.2 (default, Sep 14 2017, 22:51:06)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import bs4

To work with the BeautifulSoup library, you have to pass in html. When working with real websites, you can get the html of a webpage using the requests library. The installation and use of the requests library is beyond the scope of this article, however you could find your way around the documentation it‘s pretty easy to use. For this article, we are simply going to be using html in a python string which we would be calling html.

html = """<html>
<head>
<title>Employee Profile</title>
<meta charset="utf-8"/>
</head>
<body>
<div class="name"><b>Name:</b>Dr Peter Parker</div>
<div class="job"><b>Job:</b>Machine Learning Engineer</div>
<div class="telephone"><b>Telephone:</b>+12345678910</div>
<div class="email"><b>Email:</b><a href="mailto:peteparker@svalley.com">
peteparker@svalley.com</a></div>
<div class="website"><b>Website:</b><a href="http://pparkerworks.com">
pparkerworks.com</a></div>
</body>
</html>
"""

To use beautifulsoup, we import it into the code using the code below:

from bs4 import BeautifulSoup

This would introduce BeautifulSoup into our namespace and we can get to use it in parsing our string.

soup = BeautifulSoup(html, "lxml")

Now, soup is a BeautifulSoup object of type bs4.BeautifulSoup and we can get to perform all the BeautifulSoup operations on the soupvariable.

Let‘s take a look at some things we can do with BeautifulSoup now.

MAKING THE UGLY, BEAUTIFUL

When BeautifulSoup parses html, it‘s not usually in the best of formats. The spacing is pretty horrible. The tags are difficult to find. Here is an image to show what they would look like when you get to print the soup:

However, there is a solution to this. The solution gives the html the perfect spacing, making things look good. This solution is deservedly called “prettify“.

Admittedly, you may not get to use this feature most of the time; however there are times when you may not have access to the inspect element tool of a web browser. In those times of limited resources, you would find the prettify method very useful.

Here is how you use it:

soup.prettify()

The markup would look properly spaced, just like in the image below:

When you apply the prettify method on the soup, the result is no longer a type  bs4.BeautifulSoup. The result is now type ‘unicode‘. This means you cannot apply other BeautifulSoup methods on it, however the soup itself is not affected so we are safe.

FINDING OUR FAVORITE TAGS

HTML is made up of tags. It stores all of it’s data in them, and in the midst of all that clutter lies the data we need. Basically, this means when we find the right tags, we can get what we need.

So how do we find the right tags? We make use of BeautifulSoup‘s find and find_all methods.

Here‘s how they work:

The find method searches for the first tag with the needed name and returns an object of type bs4.element.Tag.

The find_all method on the other hand, searches for all tags with the needed tag name and returns them as a list of type bs4.element.ResultSet. All the items in the list are of type bs4.element.Tag, so we can carry out indexing on the list and continue our beautifulsoup exploration.

Let‘s see some code. Let‘s find all the div tags:

soup.find(“div“)

We would get the following result:

<div class="name"><b>Name:</b>Dr Peter Parker</div>

Checking the html variable, you would notice that this is the first div tag.

soup.find_all(“div“)

We would get the following result:

[
<div class="name"><b>Name:</b>Dr Peter Parker</div>,
<div class="job"><b>Job:</b>Machine Learning Engineer</div>,
<div class="telephone"><b>Telephone:</b>+12345678910</div>,
<div class="email"><b>Email:</b><a href="mailto:peteparker@svalley.com">
peteparker@svalley.com</a></div>,
<div class="website"><b>Website:</b><a href="http://pparkerworks.com">
pparkerworks.com</a></div>]

It returns a list.  If for example you want the third div tag, you run the following code:

soup.find_all(“div“)[2]

It would return the following:

<div class="telephone"><b>Telephone:</b>+12345678910</div>

FINDING THE ATTRIBUTES OF OUR FAVORITE TAGS

Now that we have seen how to get our favorite tags, how about getting their attributes?

You may be thinking at this point: “What do we need attributes for?“. Well, a lot of times, most of the data we need are going to be email addresses and websites. This sort of data is usually hyperlinked in webpages, with the links in the “href“ attribute.

When we have extracted the needed tag, using the find or find_all methods, we can get attributes by applying attrs. This would return a dictionary of the attribute and it‘s value.

To get the email attribute for example, we get the <a> tags which surrounds the needed info and do the following.

soup.find_all(“a“)[0].attrs

Which would return the following result:

{'href': 'mailto:peteparker@svalley.com'}

Same thing for the website attribute.

soup.find_all(“a“)[1].attrs

Which would return the following result:

{'href': '<a href="http://pparkerworks.com/">http://pparkerworks.com</a>'}

The returned values are dictionaries and normal dictionary syntax can be applied to get the keys and values.

LET‘S SEE THE PARENT AND CHILDREN

There are tags everywhere. Sometimes, we want to know what the children tags are and what the parent tag is.

If you don‘t already know what a parent and child tag is, this brief explanation should suffice: a parent tag is the immediate outer tag and a child is the immediate inner tag of the tag in question.

Taking a look at our html, the body tag is the parent tag of all the div tags. Also, the bold tag and the anchor tag are the children of the div tags, where applicable as not all div tags possess anchor tags.

So we can access the parent tag by calling the findParent method.

soup.find("div").findParent()

This would return the whole of the body tag:

<body>
<div class="name"><b>Name:</b>Dr Peter Parker</div>
<div class="job"><b>Job:</b>Machine Learning Engineer</div>
<div class="telephone"><b>Telephone:</b>+12345678910</div>
<div class="email"><b>Email:</b><a href="mailto:peteparker@svalley.com">
peteparker@svalley.com</a></div>
<div class="website"><b>Website:</b><a href="http://pparkerworks.com">
pparkerworks.com</a></div>
</body>

To get the children tag of the fourth div tag, we call the findChildren method:

soup.find_all("div")[4].findChildren()

It returns the following:

[<b>Website:</b>, <a href="http://pparkerworks.com">pparkerworks.com</a>]

WHAT‘S IN IT FOR US?

When browsing web pages, we do not see tags everywhere on the screen. All we see is the content of the different tags. What if we want the content of a tag, without all of the angular brackets making life uncomfortable? That‘s not difficult, all we‘d do is to call get_text method on the tag of choice and we get the text in the tag and if the tag has other tags in it, it also gets their text values.

Here‘s an example:

soup.find("body").get_text()

This returns all of the text values in the body tag:

Name:Dr Peter Parker
Job:Machine Learning Engineer
Telephone:+12345678910
Email:peteparker@svalley.com
Website:pparkerworks.com

CONCLUSION

That‘s what we‘ve got for this article. However, there are still other interesting things that can be done with beautifulsoup. You can either check out the documentation or use dir(BeautfulSoup) on the interactive shell to see the list of operations that can be carried out on a BeautifulSoup object. That‘s all from me today, till I write again. ]]> Kivy Python Tutorial https://linuxhint.com/kivy-python-tutorial/ Wed, 28 Feb 2018 18:22:49 +0000 https://linuxhint-com.zk153f8d-liquidwebsites.com/?p=23203 The importance of mobile software in our world today can never be overemphasized, everyone moves about with their devices regardless of the operating system, and for the devices to be useful, there’s a need for software to help to carry out our daily tasks.

The Android operating system is arguably one of the most used operating systems on mobile devices today, and it is very efficient as well thanks to its affiliations with the Linux operating system. In this article, we are going to discuss how to build a sample android app with python.

So why Python?

We know languages like Java, Kotlin, frameworks like Xamarin, React Native are very efficient in the building of apps, but more often than not system admins are more conversant with using scripting languages such as Python for their tasks.

With Kivy, they can get to build minimal android apps for simple tasks on their Android devices without having to experience a change in syntax. Yes, we all know Python is not so fast when used in app development but who cares if it does the needed job?

With this, you can quickly write a web scraping script for example and compile into an android app and run it on the move; that’s pretty cool.

To do this, we are going to be making use of a Python library called Kivy. Kivy is used to build cross-platform mobile apps, so it’s not necessarily for android devices only it also supports the building of iOS and Windows software.


Installation of Kivy

Kivy is very easy to install, but things could go a bit haywire if the installed dependencies begin to clash.

To install Kivy, we can use the “pip’ command for installing Python libraries, and we can use “apt-get” as well. For Kivy to work, it has a lot of dependencies especially when you are trying to make use of features such as the Camera, i.e., OpenCV, or another library such as Pillow.

However, you can get to do a simple installation of Kivy.

You can install Kivy for Python 2 with the command below:

sudo apt-get install python-kivy

Then Kivy for Python 3 can be installed with the command below:

sudo apt-get install python3-kivy

If you intend installing with the “pip” command, the command below will do the job:

pip install kivy

Then one very popular dependency which is pygame can be installed:

pip install pygame

If you intend to install the dependencies at this point, you can go ahead and install.

 For Ubuntu 16.04:

sudo apt-get install python-setuptools python-pygame python-opengl \
python-gst0.10 python-enchant gstreamer0.10-plugins-good python-dev \
build-essentialpython-pip libgl1-mesa-dev libgles2-mesa-dev zlib1g-dev

If you intend installing for other versions of Ubuntu, you can follow the steps from the Github documentation.

Before we proceed, you can confirm if the installation of Kivy is successful by importing the module from the interactive shell.

>>> import kivy
[INFO  ] [Logger  ] Record log in /data/user/0/ru.iiec.pydroid3/app_HOME/.kivy/
logs/kivy_18-02-26_0.txt
[INFO  ] [Kivy  ] v1.9.2-dev0
[INFO  ] [Python  ] v3.6.2 (default, Oct 15 2017, 09:18:13)
[GCC 7.2.0]
>>>

All you need is a result in this format; the numbers are not expected to tally.


Writing of code

We are going to be creating a simple app that displays some text on the screen.

Create a python file, which we would be naming “main.py”. This file would have the following content:

from kivy.app import App
class HelloApp(App):
 
pass
if __name__ == "__main__":
  HelloApp().run()

On the surface, it looks like it does nothing, but we would go through what each line of code does.

from kivy.app import App

This imports the App class from the kivy library which helps generate the application interface itself, asides that it has a lot of other properties to support the making of an app.

class HelloApp(App):
  pass

This creates a class HelloApp which inherits from the App which we imported earlier; we are not doing much here as all we have done is use the “pass” keyword.

So without typing any code in, it has all the methods of the App class.

if __name__ == "__main__":
  HelloApp().run()

Then we check to see if the Python script is run directly or being imported. If it runs directly, it executes the run() method of the App class that was inherited else nothing happens.

We are almost done just one more file. This is a kv file, which we would be using for our markup.

The kv file works in the kv language which has some similarity in syntax with Python.

Just create a new file without a name, and input the following lines of code.

Label:
   text:
        "Welcome To Linux Hint"

Looking at the main.py file, we would notice remember that we created a HelloApp() class which inherited from App and that was the only class.

In the kv file, then Label is automatically linked to the classes created in the python file. “Label” is used for displaying by using the box model.

The question is; how does our python file know that this file has the markup? It does this through the name.

Since our HelloApp class has two different words differentiated by the capitals, the kv file is expected to be named with the first word all in small letters and our file would be named hello.kv.

If our class is called LinuxApp or GameApp, our kv file would be named linux.kv and game.kv respectively.

Now, you can run your python file:

python main.py

You should get an output saying “Welcome To Linux Hint”.

This is just the tip of the iceberg of what you can do with the Kivy library; you can go through the full documentation here, you can also check out other examples as well.


Installing and using Buildozer

 If you have followed this article from the beginning, you would recall that installing kivy we had to consider a lot of dependencies. Installing buildozer, on the other hand, is not as complicated.

All we would be doing is to clone the files from the GitHub repository, we install and then use.

git clone https://github.com/kivy/buildozer.git
cd buildozer
sudo python2.7 setup.py install

Here python2.7 would be the version of python that is installed on your system; for example, if you have python 3.5 installed you use Python3.5. Though some people claim to have issues using buildozer with Python 3, you can give it a try and if it fails you switch to Python 2.
 
After installation, you run the code below. Just like in the first case, Python2.7 can be changed to any version of Python it would be reasonable to use the version of python used to install buildozer.

python2.7 -m buildozer init

This creates a buildozer.spec file which contains the configuration settings for our app. While you can proceed without changing any of the configurations, you can check out the file and change things such as the application name, package name, etc.
 
The file should be in this format:

[app]
 
# (str) Title of your application
title = app
 
# (str) Package name
package.name = myapp
 
# (str) Package domain (needed for android/ios packaging)
package.domain = org.test

….
….

After this, you can get to compile your Android application, just like the first two instances you can change python2.7 to the version of python you have installed on your machine.

python2.7 buildozer android debug deploy run

If you are doing this for the first time, the needed Android SDK, Android NDK, and Android Ant files would be downloaded so you can get a cup of coffee as it may take some time depending on how fast your internet connection is.
 
When buildozer is done compiling the application, it saves it in the bin directory.
 
That‘s all for this tutorial; now you can create simple Android applications and run some scripts on your Android device.

]]>
How To Deploy Docker Container On AWS Using Elastic Beanstalk https://linuxhint.com/docker-aws-elastic-beanstalk/ Sun, 03 Dec 2017 07:12:01 +0000 https://linuxhint-com.zk153f8d-liquidwebsites.com/?p=20473 How To Deploy Docker Containers On AWS

Cloud computing has become the way to go for hosting of different web services today. It is cost friendly, more secure and more dependable than the usual hosting services common some years back.  With Amazon Web Services, the already great idea of cloud computing has definitely gotten better and easier to use. Amazon is a reliable company, so anybody would feel at rest having them handle the hosting of their web applications.  For you to be reading this article, I would assume that you have an idea of what cloud computing is, what Amazon Web Services(AWS) does and also what docker is. Well, just in case you do not have a much of an idea about what they are, let’s go through a quick introduction.

Firstly, Cloud computing.

Cloud computing simply involves the delivery of on-demand computing resources. This involves everything from applications to data and other IT resources over the internet with a pay-as-you-go pricing.  So you do not necessarily have to pay for resources you do not get to use, when it comes to cloud computing.

Secondly, Amazon Web Services (AWS).

“Amazon Web Services is a secure cloud services platform, offering compute power, database storage, content delivery and other functionality to help businesses scale and grow.” That simple explanation is as quoted from the official website (Amazonwebsite).  Basically, AWS helps improve flexibility, scalability and reliability of web applications.

Thirdly, Docker Container.

A Docker container can be described as an open source application development platform. It basically packages the applications into containers enabling them to be easily movable or portable on any Linux operating system.  Thats all for the quick summary of what cloud computing is, what AWS does and what a docker container is, a full explanation of those concepts is beyond the scope of this article.

So, we are going to be working with the Amazon Elastic Beanstalk service which is the Amazon Web Service we would use to run docker applications. It is an easy to use service for deploying and scaling web applications and services.  We are going to take things step by step, as we may have to reference to a previously taken step for some explanation. Let’s get into the thick of things.

Step 1

We get to visit the official website of Amazon Beanstalk. Firstly visit the Amazon Web Services website and ensure that you are logged in, then navigate your way to the Beanstalk section by checking out the services. If you have difficulty finding that section, you can quickly get it by visiting this link.

Step 2

 It would load up the Beanstalk section, and you would find be able to create a new application. However before we get to do that, ensure that Beanstalk is indicating the right geographical location which you can find at the top right hand corner of the webpage.

When you get to confirm your location, you can click on “Create New Application” which is directly below the part of the page where you got to change your location.

Step 3

It then gets to load a new webpage, where you get to input details before creating a new application. You are expected to see a form with two sections:

  • Application name
  • Description

Let’s give our application the name “ca-web-server”. You can give it any name you want, however you have to be careful through out this article. It is advisable you simply follow through, and do things the way you would like there after.

The description could be left empty, as it is optional. That’s exactly what we are going to do, we would leave it empty.

Then you click on “Next”

Step 4

It then comes up with a new page, and you can see your application’s name at the top left hand corner of the webpage.

On this page, we have to setup the environment type.

We have a form with three sections:

  • Environment Tier
  • Predefined Configuration
  • Environment Type

We simply want the application to be a web server, so you click on the “Environment tier” and on the drop down menu, we select “Web Server”.

We then click on the “Predefined Configuration” and on the drop down menu, we select “Docker”.

We click on the “Environment type” and on the drop down menu, we select “Single Instance”.

Then you click “Next”.

Step 5

You are then directed to the Application Version page. You select the “Upload Your Own” button, if you already have a docker file. When uploaded, we click on “Next”.

Step 6

Then we get a webpage showing environment information.

Here, we have the “Environment name” prefilled, and the “Environment URL” prefilled. Then you click on “Check Availability”. This then checks the availability of the chosen url, that is the chosen name earlier is merged with elasticbeanstalk.com.

If the “Environment URL” turns green, when we are ready to proceed.

Then you click “Next”.

Step 7

You then get a page asking you to select “Additional Resources”. We don’t need this, so we could skip. However, overtime you would get to know the usefulness of the additional resources and would be able to pick according to your requirements.

So, click next.

Step 8

It comes up with a configuration page. You can leave the “Instance type” at the default selection which should be “t1 micro”. This creates the process as an EC2 instance.

Then you can select on the “EC2 key pair” which comes with a drop down selection, you can then select the available pairing which is associated with your Amazon Web Service account.

You can then type in your email in the email address section, you can decide to leave it empty if you wish to. Amazon would send any information about any important events associated with the account to this email address.

The instance profile should be left at its default selection.

Then you click “Next”.

Step 9

This comes up with a section called “Environment Tags” which helps with improving the security.

It has a key and a value. Where the key can be any character and the value could also have any character also, the combination of both helps with securing the connection processes.

However, that is not needed right now, so you click “Next”.

Step 10

This comes up with a review information of the service and the configuration settings. It’s time to get the environment out there, so you click “Launch”.

It launches it, and comes up with a window showing the steps being taken as the container is being processed. This would take some time.

That’s it, your Docker container has been deployed to the AWS cloud.

Step 11

Return to the dashboard where you can get access to all of Amazon Web Services. Then click on “EC2”.

Click on “Instances”, here you would see the current process showing a “running” status if everything went well. Tick the process, and further information on the process is going to be displayed.

You would see something called “public dns” copy the information there, we would need it to access the instance from the terminal.

You can then access the docker using:

ssh -i <path-ssh-key> docker@<ssh-host>

Where:

<path-ssh-key> is the key-pair that we chose to use. Should be replaced with “mykey.pem” since we left it empty.

docker is the ec2-user name.

<ssh-host> is the public dns copied earlier.

Hit the enter key, type “y” for yes and hit the enter key once again and we are in.

Conclusion

So, these steps would help take your custom built Docker container and have it launched and run on AWS using the Elastic Beanstalk service. Docker and AWS have come together to make it easier than ever to deploy a docker container on Amazon’s EC2 infrastructure.

]]>
Syslog Tutorial https://linuxhint.com/syslog-tutorial/ Fri, 24 Nov 2017 20:43:34 +0000 https://linuxhint-com.zk153f8d-liquidwebsites.com/?p=20316 The main reason for networking is communication. While networking, crucial messages have to be passed between network devices so as to keep track of events as they occur. As a system administrator or a Developer Operations (DevOps) personnel, keeping track of activities ongoing over a network is very vital, and is very useful for solving problems when ever they surface.

The method of logging most times, is considered as time consuming or stressful. In the end, the effort is usually worth it. However, with syslog, all of that stress is reduced, as you could get to automate the logging process.All you just have to do is to go over the logs whenever a problem comes up and tackle the problems as the logs indicate.

Syslog is a known standard for message logging. Most times, the system that does the logging and the software that gets to generate them tend to interfere during processes. But syslog helps separate the software generating the logs from the system that stores the logs, thereby making the process of logging less complicated and stressful.

In other words, syslog is an open system, designed to help monitor network devices or systems and send events to a logging server. It ensures that messages are distinguished based on the priority of the messages and the sort of network device that is sending the message.

Apart from helping with the generating and storing of logs, it can also be used for security auditing as well as general analysis and debugging of system messages.

The syslog standard is available for use across different network devices such as routers, switches, load balancers, intrusion protection systems etc. by using the User Datagram Protocol of port 514 to communicate messages to the logging servers.

A syslog message follows either the legacy-syslog or BSD-syslog protocol and takes the following format:

  • PRI message section
  • HEADER message section
  • MESSAGE section

A syslog message cannot ever go past 1024 bytes.


PRI message section

PRI is also known as the Priority Value part of the syslog message, and recall earlier that I talked about syslog sending logs messages according to the level of priority and also the type of network device or facility, here is where all that information is displayed. This part represents the facility and severity section of the syslog message.

The priority value is obtained by calculating the product of the facility number (the part of the system sending the message) by 8 and then adding the numerical value of the severity (this is the level of importance of the message according to the system.

Priority value = (Facility number * 8) + Severity

HEADER message section

While the PRI part was more about the system, the header part is more about the information that comes with the syslog event.

It contains the message timestamp, the hostname or the IP address of the system. The format of the timestamp field is:

MM dd hh:mm:ss

Where:

MM is the month in which the syslog was sent as an abbreviation. This means the month come in the form of Jan, Feb, Mar, Apr etc.

dd is the day of the month in which the message was sent. When the day is not double digits, the value is represented by a space and the number instead of a 0 and the number. This means “ 7” is used to depict 7 instead of “07”.

hh is the hour of the day when the message was sent, using the 24 hour time format. With values between 00 and 23, with 00 and 23 inclusive.

mm is the minute of the hour when the message was sent. With values between 00 and 59, with 59 inclusive.

ss is the second of the minute when the message was sent. With values between 00 and 59, with 59 inclusive.

An example of the above is:

Mar  8 22:30:15


MESSAGE section

This most times is where all of the needed information lies. It contains the name of the program, the process that led to the generation of the message and the text of the message itself.

The message part is usually in the format: program[pid]: message_text.

Example:

The following is a sample syslog message: <133>Feb 25 14:09:07 webserver syslogd: restart. The message corresponds to the following format: <priority>timestamp hostname application: message.

In the end, after generating the message, parsing it is a different ball game. You can parse the syslog using a programming language such as python, using regular expressions, using xml parser and you can also parse using json. A log parser like syslog-ng works perfectly with Python. It allows you write your own parser in python, allowing for much more control over the parsing potentials.

Python is very popular for scraping data, so you can easily find modules for scrapping the needed data from the syslog which makes it easier to process messages, query databases etc. If you intend using syslog-ng, you can get the OSE configuration file and include it in the file.

However, you should ensure that the PYTHON_PATH environment variable includes the path to the Python file and then you export the PYTHON_PATH environment variable.

For example:

export PYTHONPATH=/opt/syslog-ng/etc

The Python object is initiated only once, when syslog-ng OSE is started or reloaded. That means it keeps the state of internal variables while syslog-ng OSE is running. Python parsers consist of two parts. The first is a syslog-ng OSE parser object that you use in your syslog-ng OSE configuration, for example, in the log path.

This parser references a Python class, which is the second part of the Python parsers. The Python class processes the log messages it receives, and can do virtually anything that you can code in Python.

parser <name_of_the_python_parser>{
  python(
    class("<name_of_the_python_class_executed_by_the_parser>")
  );
};

python {
import re
class MyParser(object):
    def init(self, options):
        '''Optional. This method is executed when syslog-ng is started or reloaded.'''
        return True
    def deinit(self):
        '''Optional. This method is executed when syslog-ng is stopped or reloaded.'''
        return True
    def parse(self, msg):
        '''Required. This method receives and processes the log message.'''
        return True
};

When you finally get to parse your syslog file, you can then get to act on those issues that have been causing problems.

Most times, you would find the paths to the directories where the problem lies, so you can easily navigate directories using the “cd” command.

With syslog, you are able to save more time and improve efficiency.

]]>