2021년 8월 21일 토요일

Face Parcing- Using Pytorch

 So far, I have mainly dealt with human body key point detection and face detection. And object segmentation was also covered. Today I will learn how to do face parcing using face segmentation, which can be viewed as one of object segmentation.

<body keypoint detection, face detection, object segmentation>


Face parcing

Face Parcing is to label the area around the face by dividing it into 19 parts.

<face parcing>

In this model, detection areas are classified as following 19 sections. 

atts = ['background', 'skin', 'l_brow', 'r_brow', 'l_eye', 'r_eye', 'eye_g', 'l_ear', 'r_ear', 'ear_r',
        'nose', 'mouth', 'u_lip', 'l_lip', 'neck', 'neck_l', 'cloth', 'hair', 'hat']

I will clone zllrunning's Github and use it.

The Github code of zllrunning uses PyTorch. The original code is basically written to support CUDA, but I modified the code to make it possible to test it on a Raspberry Pi that does not support CUDA and a PC without NVidia GPU. 

If you prefer Tensorflow over PyTorch, see MaybeShewill-CV/bisenetv2-tensorflow. This site also uses the same BiSeNet model as zllrunning's github. 

Face parsing using BiSeNet model is quite light. Therefore, this model works without problems even on devices that do not use CUDA. Non-Jetson series Raspberry Pis and PCs without GPUs can also be tested. However, please install OpenCV, PyTorch, torchvision according to the device you are using.


Prerequisites

Faceparcing uses PyTorch and OpenCV. So you must install those packages first. If you use Jetson series, you can use pre installed OpenCV. So skip the OpenCV installation. There is a PyTorch package from NVidia that has been optimized for the Jetson series. Download and install it from the NVidia site, not the PyTorch website.


And you need to install pillow for image processing.

pip3 install pillow 


Installation

If OpenCV, PyTorch, and torchvision are installed, download the source code from github.

cd /usr/local/src
git clone https://github.com/zllrunning/face-parsing.PyTorch.git

And download the Pytorch model. You can download the model from https://drive.google.com/open?id=154JgKpzCPW82qINcVieuPH3fZ2e0P812. This URL is introduced in the Train section of the https://github.com/zllrunning/face-parsing.PyTorch page. The downloaded model is copied to the face-parsing.PyTorch/res/cp directory. There is no problem if you save it to a different path.

cd /usr/local/src/face-parsing.PyTorch
mkdir -p res/cp
cp "79999_iter.pth " /usr/local/src/face-parsing.PyTorch/res/cp

This model was trained using the CelebAMask-HQ dataset. CelebAMask-HQ is a 3,000 image dataset in which 512 x 512 images of faces and accessories (glasses, hats, clothes) are manually classified into 19 segments. The CelebAMask-HQ dataset can be downloaded from http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html.


Run a sample program 

Let's run a sample program to test whether the FaceParcing is working properly. As I said earlier, I have modified the source code so that it can work in PyTorch in CPU mode without CUDA. 

#!/usr/bin/python
# -*- encoding: utf-8 -*-

import argparse
from model import BiSeNet
import torch
import os
import os.path as osp
import numpy as np
from PIL import Image
import torchvision.transforms as transforms
import cv2

def vis_parsing_maps(im, parsing_anno, stride, save_im=False, save_path='vis_results/parsing_map_on_im.jpg'):
    # Colors for all 20 parts
    part_colors = [[255, 0, 0], [255, 85, 0], [255, 170, 0],
                   [255, 0, 85], [255, 0, 170],
                   [0, 255, 0], [85, 255, 0], [170, 255, 0],
                   [0, 255, 85], [0, 255, 170],
                   [0, 0, 255], [85, 0, 255], [170, 0, 255],
                   [0, 85, 255], [0, 170, 255],
                   [255, 255, 0], [255, 255, 85], [255, 255, 170],
                   [255, 0, 255], [255, 85, 255], [255, 170, 255],
                   [0, 255, 255], [85, 255, 255], [170, 255, 255]]

    im = np.array(im)
    vis_im = im.copy().astype(np.uint8)
    vis_parsing_anno = parsing_anno.copy().astype(np.uint8)
    vis_parsing_anno = cv2.resize(vis_parsing_anno, None, fx=stride, fy=stride, interpolation=cv2.INTER_NEAREST)
    vis_parsing_anno_color = np.zeros((vis_parsing_anno.shape[0], vis_parsing_anno.shape[1], 3)) + 255

    num_of_class = np.max(vis_parsing_anno)

    for pi in range(1, num_of_class + 1):
        index = np.where(vis_parsing_anno == pi)
        vis_parsing_anno_color[index[0], index[1], :] = part_colors[pi]

    vis_parsing_anno_color = vis_parsing_anno_color.astype(np.uint8)
    print(vis_parsing_anno_color.shape, vis_im.shape)
    vis_im = cv2.addWeighted(cv2.cvtColor(vis_im, cv2.COLOR_RGB2BGR), 0.4, vis_parsing_anno_color, 0.6, 0)

    # Save result or not
    if save_im:
        cv2.imwrite(save_path[:-4] +'.png', vis_parsing_anno)
        cv2.imwrite(save_path[:-4] +'.jpg', vis_im, [int(cv2.IMWRITE_JPEG_QUALITY), 100])

    # return vis_im

def evaluate(respth='./res/test_res', dspth='./data', cp='model_final_diss.pth'):

    if not os.path.exists(respth):
        os.makedirs(respth)

    n_classes = 19
    net = BiSeNet(n_classes=n_classes)
    save_pth = osp.join('./res/cp', cp)

    if CUDA_SUPPORT:
        net.cuda()
        net.load_state_dict(torch.load(save_pth))
    else:
        device = torch.device('cpu')
        net.load_state_dict(torch.load(save_pth, map_location=device))

    net.eval()

    to_tensor = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
    ])
    with torch.no_grad():
        for image_path in os.listdir(dspth):
            img = Image.open(osp.join(dspth, image_path))
            image = img.resize((512, 512), Image.BILINEAR)
            img = to_tensor(image)
            img = torch.unsqueeze(img, 0)
            if CUDA_SUPPORT:
                img = img.cuda()

            out = net(img)[0]
            parsing = out.squeeze(0).cpu().numpy().argmax(0)
            # print(parsing)
            # print(np.unique(parsing))
            vis_parsing_maps(image, parsing, stride=1, save_im=True, save_path=osp.join(respth, image_path))


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Face Parcing')
    parser.add_argument("--image", default="./data", help="test working directory where the image file exists")
    parser.add_argument("--model", default="79999_iter.pth", help="faceparcing model")
    args = parser.parse_args()    

    CUDA_SUPPORT = torch.cuda.is_available()
    evaluate(dspth= args.image, cp=args.model)


Run the code.

usr/local/src/face-parsing.PyTorch # python3 test.py --image makeup
(512, 512, 3) (512, 512, 3)
(512, 512, 3) (512, 512, 3)
(512, 512, 3) (512, 512, 3)
(512, 512, 3) (512, 512, 3)

If no error occurred, faceParcing will be performed on the 4 images stored in the makeup directory, and the results will be saved in the "face-parsing.PyTorch/res/test_res" directory like this.



Good result. Now, let's go a little deeper and see how we can utilize the pared data.


Under the Hood

return value of the model

Let's take a look at the part to get the return value of the model with a little modification.

            ret = net(img)[0]
            print('model out shape:', ret.shape )
            out = ret.squeeze(0).cpu().numpy()
            print('squeezed shape:', out.shape )
            parsing = out.argmax(0)
            print('argmax shape:', parsing.shape )

The output of this part is as follows.

model out shape: torch.Size([1, 19, 512, 512])
squeezed shape: (19, 512, 512)
argmax shape: (512, 512)

The squeeze function of pytorch tensor is a dimensionality reduction function.  Converts a tensor of dimension [1, 19, 512, 512] to (19, 512, 512). And the squeezed output value can be understood as the following figure.


<squeezed output>


And there is a numpy function argmax(0) that plays the most important role. 
This function gets the layer with the largest value among the 512X512 array values from 19 layers.
For example, the largest value of [0,0] pixels is probably in the background layer. This is because the top left corner of the sample image corresponds to the background. Therefore, the value of [0,0] among the return values of size [512, 512] will be 0.
If all points are filled in this way, the return value of argmax(0) will have a value between 0 and 18. In fact, there are 0 to 17 because there is no hat in the sample image.

The argmax function for 3D arrays can be difficult to understand. It will be easy to understand that it works like this: However, fast processing is possible only by using the argmax function.

parsing = np.zeros((512, 512), dtype=np.uint8)
for x in range(512):
    for y in range(512):
        pixel = a[:,x,y]
        val =  pixel.argmax(0)
        parsing[x, y] = val

<pseudo code for argmax(0) >

Now the parsing array has a value between 0 and 18, which tells which segment the pixels belong to. You can do whatever you want with parsing arrays.

The (channel , height , width) format is unfamiliar to us. This is because OpenCV provides a numpy array of (height , width , channel) format. If you open an image file using OpenCV and check it using the shape property, it is as follows.

c = cv2.imread('C:\\lsh\\study\\image\\jangnara1.jpg', cv2.IMREAD_COLOR )
c.shape

(788, 550, 3)


Although it is not a necessary process, if you want to test it by changing it to the OpenCV format that we are familiar with, you can do it as follows. In the function below, cvmat is converted to OpenCV format (h=512,w=512,c=19).

And since 19, which corresponds to the channel number, was moved last, the parameter of the argmax function can be changed to 2.

def compare(out):
    print('out format', out.shape)    
    parsing = out.argmax(0)
    length = out.shape[0]
    cvmat = out[0, :,:]
    for i in range(1, length):
        cvmat = np.dstack((cvmat , out[i, :,:]))
    print('cv format', cvmat.shape)    
    parsing2 = cvmat.argmax(2)
    print(np.array_equal(parsing,parsing2))

If you call this function, you can see that parcing and parcing2 are the same array.

out format (19, 512, 512)
cv format (512, 512, 19)
True


If only the area corresponding to 18 segments is to be output as an image separately, it can be implemented as follows. Modify the vis_parsing_maps function in the example code above as follows. 

    #We don't need to paint background, so index starts at 1
    for pi in range(1, num_of_class + 1):
        canvas = np.zeros((vis_parsing_anno.shape[0], vis_parsing_anno.shape[1], 3)) + 255
        index = np.where(vis_parsing_anno == pi)
        vis_parsing_anno_color[index[0], index[1], :] = part_colors[pi]
        canvas[index[0], index[1], :] = part_colors[pi]
        name = osp.join('./res/test_res', "%d_%s"%(pi, atts[pi]) + '.jpg')
        cv2.imwrite(name, canvas)


Now the following files will be created in /res/test_res directory.

<output images>


drawing contour of segments

This time, instead of coloring the segment of the face, let's mark the perimeter of the area. To find the perimeter, you can use cv2.findContours. This function is not applicable to color images with channels 3 or 4. One channel of black and white images should be used.

I made a new "contour_parsing_maps" function as follows.

def contour_parsing_maps(im, parsing_anno, stride, save_im=False, save_path='vis_results/parsing_map_on_im.jpg'):
    part_colors = [[255, 0, 0], [255, 85, 0], [255, 170, 0],
                   [255, 0, 85], [255, 0, 170],
                   [0, 255, 0], [85, 255, 0], [170, 255, 0],
                   [0, 255, 85], [0, 255, 170],
                   [0, 0, 255], [0, 255, 255], [170, 0, 255],
                   [0, 85, 255], [0, 170, 255],
                   [255, 255, 0], [128, 128, 128], [255, 255, 170],
                   [255, 0, 255], [255, 85, 255], [255, 170, 255],
                   [0, 255, 255], [85, 255, 255], [170, 255, 255]]    
    im = np.array(im)
    vis_im = im.copy().astype(np.uint8)
    #parsing_anno original dtype:int64 so convert to uint8(0 ~ 255)
    vis_parsing_anno = parsing_anno.copy().astype(np.uint8)

    # comparison = vis_parsing_anno == vis_parsing_anno
    # print(comparison.all())

    print(vis_parsing_anno.shape)

    num_of_class = np.max(vis_parsing_anno)

    #We don't need to paint background, so index starts at 1
    for pi in range(1, num_of_class + 1):
        canvas = np.zeros((vis_parsing_anno.shape[0], vis_parsing_anno.shape[1]), dtype=np.uint8)
        index = np.where(vis_parsing_anno == pi)
        canvas[index[0], index[1]] = 255
        name = osp.join('./res/test_res', "%d_%s"%(pi, atts[pi]) + '.jpg')
        cv2.imwrite(name, canvas)
        contours, hierarchy = cv2.findContours(image=canvas, mode=cv2.RETR_TREE, method=cv2.CHAIN_APPROX_NONE)
        cv2.drawContours(image=vis_im, contours=contours, contourIdx=-1, color=part_colors[pi], thickness=2, lineType=cv2.LINE_AA)
    cv2.imwrite(save_path[:-4] +'_contour.png', vis_im)

If the contour_parsing_maps function is called instead of the vis_parsing_maps function, the following result is obtained. Some wrong segmentation results are seen in the eyes and brows.

<contour drawing image>

Wrapping up

Face detection, which we have dealt with a lot, extracts only the face area as a bounding box or key point, not the entire head including hair.

<face detection>

However, sometimes it is necessary to extract the entire head from the image. I'm making a robot that draws portraits right now. When this robot receives an image, if the face is too small in the image, it tries to crop the portrait by enlarging only the head. In this case, if the hair part is cut off, a good picture will not come out.

What I want is a function that automatically crops to an appropriate size including the head as shown in the following picture. 

Oh My God. The ears are attached to the arm. It looks like the model isn't perfect yet. However, this nonsense error can be corrected by adding a simple conditional expression.

<original input image and face parcing, cropped image for portrait drawing>

The advantage of Face Parcing is that it finds not only the hair but also the hat, so it is very useful for cropping portrait images.

Perhaps your reasons for wanting FaceParcing are different from mine. Anyway, I hope this article is helpful for those who want to use FaceParcing. 

The input image size of this model is 512 X 512. It is best to keep the input image as square as possible. And in the example, the output image shows 512 X 512 as it is. If the image is not square, it needs to be restored to its original size. Github code of zllrunning omitted this process. I also haven't finished this part yet. Please note.

You can download the codes at my github.


댓글 없음:

댓글 쓰기