Different from traditional convolutional neural networks (CNN) this model has intra-layer recurrent connections in the convolutional layers. Therefore each convolutional layer becomes a two-dimensional recurrent neural network(RNN). The units receive constant feed-forward inputs from the previous layer and recurrent inputs from their neighborhoods. The recurrent iterations proceed the region of context captured by each unit expands segmentation and Feature extraction and context modulation are seamlessly integrated is different from typical methods that entail separate modules for the two steps. An multi-scale RCNN is proposed. Deep recurrent convolutional neural network (RCNN) for this task is originally proposed for segmentation. Over two benchmark datasets are Sift Flow the model outperforms many state-of-the-art models in accuracy and efficiency. Scene labeling (or scene parsing) is an important step towards high-level image interpretation. Aims at fully parsing the input image by labeling the semantic category of each pixel. Compared with image classification, scene labeling is more challenging as it simultaneously solves both segmentation and recognition. SIFT Flow Dataset has 715 images from rural and urban scenes composed of 8 classes. The scenes have approximately 320 _ 240 pixels. As in performed a 5-fold cross-validation with the dataset randomly split into 572 training images and 143 test images in each fold. The SIFT Flow is a larger dataset composed of 2688 images of 256 _ 256 pixels and 33 semantic labels.
Key Words – Convolutional Neural Networks, Recurrent Neural Networ .Recurrent Convolutional Neural Network.Segementation,Multi-scale RCNN, SIFT Flow Dataset
1. INTRODUCTION
Scene parsing has drawn increasing research interest due to its wide [3]applications in many attractive areas like autonomous vehicles, robot navigation and virtual reality. [1,4]The remains a challenging problem since it requires solving segmentation, classification and detection simultaneously. RNN gains a strong discriminative capability.[7] RNN is superior by using a better architecture explicitly incorporating context information into the training process of multiple hidden layers. Learns concepts with different abstractness. RNN effectively fuses the output features across different time steps for classification[5] or a more concrete parsing purpose. To verify the effectiveness of RNN and have conducted extensive experiments over five popular and challenging scene parsing datasets including SiftFlow RNN is capable of greatly enhancing the discriminative power of per-pixel feature representations. Scene parsing problem and pro- pose a novel recurrent neural network (RNN) for parsing scene images[2]. RNN can enhance the capability of RNNs in modeling long-range context information at multiple levels and better distinguish pixels that are easy to confuse. Recurrent Neural Networks RNN has been employed to model long-range context in images. For instance built one recurrent connection from the output to the input layer and introduced layer-wise self-recurrent connections.
Compared with those methods the proposed RNN models the context by allowing multiple forms of recurrent connections. Recurrent neural networks (RNN) are suitable for these tasks because the long-range context information can be captured by a fixed number of recurrent weights. Treating scene labeling as a two-dimensional variant of sequence learning RNN can also be applied but the studies are relatively scarce.
This type of RNN has been proposed but there it is used for object recognition[5]. It is unknown if it is useful for scene labeling a more challenging task. This motivates the present work. Multiscale recurrent neural networks have been considered as a promising approach to resolve this issue yet there has been a lack of empirical evidence showing that this type of models can actually capture the temporal dependencies by discovering the latent hierarchical structure of the sequence[7]. The multiscale approach called the hierarchical multiscale recurrent neural network that can capture the latent hierarchical structure in the sequence by encoding the temporal dependencies with different timescales using a novel update mechanism. multiscale RNN model can learn the hierarchical multiscale structure from temporal data without explicit boundary information.
This model called a hierarchical multiscale recurrent neural network (HM-RNN) [16]does not assign fixed update rates but adaptively determines proper update times corresponding to different abstraction levels of the layers. This model tends to learn fine timescales for low-level layers and coarse timescales for high-level layers. The introduce a binary boundary detector at each layer. The boundary detector is turned on only at the time steps where a segment of the corresponding abstraction level is completely processed. Alsoduring the within segment images.
2. RELATED WORKS
The scene parsing problem has been approached with a wide variety of methods in recent years. Many methods rely on MRFs, CRFs, or other types of graphical models to ensure the consistency of the labeling and to account for context [ 19], [ 15], [26]. Most methods rely on a pre-segmentation into superpixels or other segment candidates, and extract features and categories from individual segments and from various combinations of neighboring segments. The graphical model inference pulls out the most consistent set of segments covers the image. The proposed a method to aggregate segments in a greedy fashion using a trained scoring function. The originality of the approach is that the feature vector of the combination of two segments is computed from the feature vectors of the individual segments through a trainable function. They use deep learning methods to train their feature extractor. Their feature extractor operates on hand-engineered features. One of the main question in scene parsing is to take a wide context into account to make a local decision. [ 32] proposed to use the histogram of labels extracted from a coarse scale as input to the labeler that looks at finer scales. Our approach is some simpler: our feature extractor is applied densely to an image pyramid. The coarse feature maps thereby gen-erated are up sampled to match that of the finest scale.
Three scales each feature vector has multiple fields encode multiple regions of increasing sizes and decreasing resolutions centered on the same pixel location. The first end-to-end neural network model for scene labeling refers to the deep CNN proposed in [11]. The model is trained by a supervised greedy learning strategy. Another end-to-end model is proposed. Top-down recurrent connections are incorporated into a CNN to capture context information. In the first recurrent iteration the CNN receives a raw patch and outputs a predicted label map down sampled due to pooling. In other iterations the CNN receives both a down sampled patch and the label map predicted in the previous iteration and then outputs a new predicted label map. Compared with the models in this approach is simple and elegant but its performance is not the best on some benchmark datasets. It is noted that both models in [14] and [7] are called RCNN. For convenience in follows if not specified RCNN refers to the model.
Recurrent Neural Networks RNN[19] has been employed to model long-range context in images. For instance [16]built one recurrent connection from the output to the input layer and introduced layer-wise self-recurrent connections. Compared with those methods the proposed RNN models the context by allowing multiple fo
rms of recurrent connections. In addition RNN combines the
output features at multiple time steps for pixel classification. [21]Utilized a parallel multi-dimensional long short-term memory for fast volumetric segmentation. The performance was relatively inferior and also based on similar motivations that used RNNs to refine the learned features from a CNN [9]by modeling the contextual dependencies along multiple spatial directions. The RNN incorporates the context information into the feature learning process of CNNs
3. PROBLEM AND DATASET DESCRIPTION
Problem Description
More formally in this report address the problem of obtaining the maximum likelihood of a patch from an image.[22]Being from a certain class. This is defined statistically as
P (class|observation) = argmaxclassP (observation|class)
where observation refers to the pixels that belong to a specific patch and class refers to the single label given to the whole patch. This is done for each patch in a given input image and results in an approximate separation of various segments in the image[30]. This approach that also takes into account neighboring patches or other patches in the same image. Also avoid using a prior for our final classification of the form P (class).
The completely isolate the likelihood function and attempt to optimize it. Given an optimal likelihood function a fully the method in subsequent work is more likely to obtain better classification performance[34].
3.2 DATASET USED
Experiments are performed over two benchmark datasets for scene labeling Sift Flow [16] and Stanford Background [28]. The Sift Flow dataset contains 2688 color images all of have the size of 256 × 256 pixels. Among them 2488 images are training data and the remaining 200 images are testing data. There are 33 semantic categories and the class frequency is highly unbalanced. The first parameterized layer is a convolutional layer followed by a 2 × 2 non-overlapping max pooling layer. This is to reduce the size of feature maps and thus save the computing cost and memory. [28]The other two parameterized layers are RCLs. Another 2 × 2 max pooling layer is placed between the two RCLs. The numbers of feature maps in these layers are 32, 64 and 128. The filter size in the first [21]convolutional layer is 7 × 7 and the feed-forward and recurrent filters in RCLs are all 3 × 3. Three scales of images are used and neighboring scales differed by a factor of 2 in each side of the image. For the Sift Flow dataset, the hyper-parameters are determined on a separate validation set.
4. METHODOLOGY
The proposed RCNN[31] was tested on several benchmark object recognition datasets. With fewer parameters RCNN achieved better results than the state-of-the-art CNNs over all of these datasets[25]. The validates the advantage of RCNN over CNN[14]. Recurrent neural network (RNN) has a long history in the artificial neural network communication but most successful applications refer to the modeling of sequential data[11]. A hierarchical RNN called the Neural Abstraction Pyramid (NAP) is proposed for image processing. NAP is a biology-inspired architecture with both vertical and lateral recurrent connectivity through the image interpretation is gradually refined to resolve visual ambiguities.
4.1 RCNN
The key module of the RCNN is the RCL.A generic RNN [20,35]with feed-forward input u(t), internal state x(t) and parameters θ can be described by:
x(t) = F(u(t), x(t − 1), θ) (1) (1)
where F is the function describing the dynamic behavior of RNN. The RCL introduces recurrent connections into a convolutional layer. It can be regarded as a special two-dimensional RNN feed-forward and recurrent computations both take the form of convolution.
xijk (t) = σ (wf ) u(i,j)(t) + (wr )⊤x(i,j)(t − 1)+bk (2)
Where u(i,j) and x(i,j) are vectorized square patches centered at (i, j) of the feature maps of the previous layer and the current layer, wfand wr are the weights of feed-forward and recurrent for the kth feature map and bk is the kth element of the bias. σ used is composed of two functions σ(zijk ) = h(g(zijk )), where g is the widely used rectified linear function g(zijk ) = max (zijk , 0), and h is the local response normalization (LRN) :
g(zijk)
h(g(zijk)) = ———————————– (3)
(1+a/L ∑_(k=Max(0,k-L/2))^(Min(K,k+L/2))▒〖(G(Zijk))〗 2)
where K is the number of feature maps, α and β are constants controlling the amplitude of normal- ization. The LRN forces the units in the same location to compete for high activities mimics the lateral inhibition in the cortex. In our experiments, LRN is found to consistently improve the accuracy, though slightly. Following [11], α and β are set to 0.001 and 0.75, respectively. L is set to K/8 + 1. During the training or testing phase an RCL is unfolded for T time steps into a multi-layer sub- network. T is a predetermined hyper-parameter T = 3.
The receptive field (RF) of each unit expands with larger T so that more context information is captured. The depth of the subnetwork also increases with larger T . In the meantime the number of parameters is kept constant due to weight sharing.Let u0 denote the static input. The input to the RCL denoted by u(t) can take this constant u0 for all t. But here adopt a more general form:
u(t) = γu0 (4)
There is only one path from input to output. γ > 0, the network is a typical RNN[25].
There are multiple paths from input to output. RCNN is composed of a stack of RCLs. Between neighboring RCLs there are only feed-forward connections. Max pooling layers are optionally interleaved between RCLs. The total number of recurrent iterations is set to T for all N RCLs. There are two approaches to unfold an RCNN(15). First unfold the RCLs one by one and each RCL is unfolded for T time steps before feeding to the next RCL. This unfolding approach multiplicatively increases the depth of the network. The largest depth of the network is proportional to NT . The second approach at each time step the states of all RCLs are updated successively. The unfolded network has a two-dimensional structure x axis is the time step and y axis is the level of layer. This unfolding approach additively increases the depth of the network. The largest depth of the network is proportional to N + T . first unfolding approach due to the following advantages. First it leads to larger effective RF and depth are important for the performance of the model. The second approach is more computationally intensive the feed-forward inputs need to be updated at each time step. The first approach the feed-forward input of each RCL needs to be computed for only once.
4.2 Multi-scale RCNN
The natural scenes objects appear in various sizes. To capture this variability, the model should be scale invariant[24]. A multi-scale CNN [3]is proposed to extract features for scene labeling several CNNs[18] with shared weights are used to process images of different scales. This approach is adopted to construct the multi-scale RCNN[21]. The original image corresponds to the finest scale. Images of coarser scales are obtained simply by max pooling the original image. The outputs of all RCNN[12]s are concatenated to form the final representation. For pixel p, its probability falling into the cth semantic category is given by a soft max layer:
exp(wcTfp)
Ycp = ——————————-(5)
∑_(e^’)▒〖exp(wcTfp)〗
wher
e f p denotes the concatenated feature vector of pixel p, and wc denotes the weight for the cth category
.The loss function is the cross entropy between the predicted probability ypc and the true hard label Ycp
L=- ∑▒p ∑_c▒〖Ycp logycp 〗 (7)
where ^ypc = 1 if pixel p is labeled as c and ^ypc = 0. The model is trained by back propagation through time that is unfolding all the RCNNs to feed forward networks and apply the algorithm[35].
Figure 1: Training and testing processes of multi-scale RCNN for segementation. Solid lines denote feed-forward connections and dotted lines denote recurrent connections.
Patch-wise Training and Image-wise Testing
Patch-wise Training and Image-wise Testing Most neural network models for scene labeling are trained by the patch-wise approach[6]. The training samples are randomly cropped image patches labels correspond to the categories of their center pixels. Valid convolutions [15]are used in both feed-forward and recurrent computation fig (3). The patch is set to a proper size so that the last feature map has exactly the size. The image-wise training an image is input to the model and the output has exactly the same size as the image. The loss is the average of all pixels loss. Have conducted experiments with both training methods and found that image-wise training seriously suffered from over-fitting. A possible reason is that the pixels in an image have too strong correlations[31]. So patch-wise training is used in all our experiments.
The suggested that image-wise and patch-wise training are equally effective and the former is faster to converge. But their model is obtained by fine tuning the model pretrained on Image. This conclusion may not hold for models trained from scratch. The testing phase the patch wise approach is time consuming because the patches corresponding to all pixels need to be processed and image-wise testing. There are two image-wise testing approaches to obtain dense label maps. The first is the Shift-and-stitch approach[27]. The predicted label map is down sampled by a factor of s the original image will be shifted and processed for s2 times[33]. At each time the image is shifted by (x; y) pixels to the right and down. The outputs for all shifted images are interleaved so that each pixel has a corresponding prediction.
Shift-and-stitch approach needs to process the image for s2 times although it produces the exact prediction as the patch-wise testing[10].The second approach inputs the entire image to the network and obtains down sampled label map then simply up sample the map to the same resolution as the input image using bilinear or other interpolation methods .This approach may suffer from the loss of accuracy but is very efficient. The deconvolutional layer proposed is adopted for up sampling the back propagation counter part of the convolutional layer. The deconvolutional weights are set to simulates the bilinear interpolation. Both of the image-wise testing methods are used in our experiments.
5. RESULTS
RCNN has three parameterized layers (Figure 1). The first parameterized layer is a convolutional layer followed by a 2 _ 2 non-overlapping max pooling layer. This is to reduce the size of feature maps and thus save the computing cost and memory. The other two parameterized layers are RCLs. Another 2 _2 max pooling layer is placed between the two RCLs. The numbers of feature maps in these layers are 32, 64 and 128. The filter size in the first convolutional layer is 7 _ 7, and the feed-forward and recurrent filters in RCLs are all 3 _ 3. Three scales of images are used and neighboring scales differed by a factor of 2 in each side of the image
Fig:2 Segemented Result
Model PA (%) CA (%)
Multi-scale CNN + cover(7] 78.5 29.6
Multi-scale CNN + rCPN[3] 79.6 33.6
FCNN (_finetuned from VGG model [15]) 85.1 51.7
Segment RNN 85.23 45.6
Table 1: segmentation Accuracy
6. CONCLUSION
A feed-forward convolutional network trained [8,18] end-to-end in a supervised manner and fed with raw pixels from large patches over multiple scales can produce state of the art performance on standard scene parsing datasets. The model does not rely on engineered features and uses purely supervised training from fully-labeled images to learn appropriate low-level and mid-level features. The conclusion that training a useful auto encoder on our data would prove challenging. First our data is very unbalanced and as such an auto encoder model rapidly learns to detect easy textures and pattern discourages its ability to detect rarer and more complex patterns.
The water and farmland have much simpler texture patterns and as such lie on a small regions of the texture manifold. The city textures lie on a much larger manifold.There are a lot more complex structures associated with them thus each of their training points on the manifold are much more rare and in that sense underrepresented[19]. This is a problem did not approach but certainly could have as it is certainly a recurring one in any real-world application.
7. REFERENCES
Arbelaez P. and Maire M. and Fowlkes C. and Malik J. (2011), ‘Contour Detection and Hierarchical Image Segmentation’, IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 898-916.
A. Krizhevsky, I. Sutskever, and G. E. Hinton(2012). Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012
A. Sharma, O. Tuzel, and M.-Y. Liu. (2014) Recursive context propagation network for semantic scene labeling. In NIPS, pages 2447–2455
Boykov Y. and Jolly M. P. (2001), ‘Interactive graph cuts for optimal boundary & region segmentation of objects in n-d images’, In Proceedings of International Conference of Computer Vision (ICCV), volume 1.
B. Fulkerson, A. Vedaldi, and S. Soatto Ming Liang and Xiaolin Hu and Bo Zhang(2009). Class segmentation and object localization with superpixel neighborhoods. In ICCV, pages 670–677.
Carreira J. and Sminchisescu C. (2012), ‘CPMC: Automatic Object Segmentation Using Constrained Parametric Min-Cuts’, IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1312-1328.
Clement Farabet M. and Camille Couprie Y. and Laurent Najman M. and Yann LeCun L. (2013), ’Learning Hierarchical Features for Scene Labeling’, IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1915-1929.
C. Farabet and B. Martini and P. Akselrod and S. Talay and Y. LeCun and E. Cu-lurciello Ming Liang and Xiaolin Hu and Bo Zhang(2010). Hardware accelerated convolutional neural networks for synthetic vision systems. In International Symposium on Circuits and Systems (ISCAS’10), Paris.
D. Grangier and L. Bottou, and R. Collobert Ming Liang and Xiaolin Hu and Bo Zhang(2009). Deep Convolutional Net-works for Scene Parsing. In ICML 2009 Deep Learning Workshop.
Farabet C. and Couprie C. and Najman L. and LeCun Y. (2012), ’Scene parsing with multiscale feature learning purity trees and optimal covers’, In Proceedings of the International Conference on Machine Learning (ICML), pp.575-582.
H. Schulz and S. Behnke(2012). Learning object-class segmentation with convolutional neural networks
. In 11th European Symposium on Artificial Neural Networks (ESANN).
Junyoung Chung and Caglar Gulcehre and Kyunghyun
Cho and Yoshua Bengio(2015). Gated feedback recurrent neural networks. In Proceedings of the 32nd International Conference on Machine Learning (ICML).
K. Jarrett and K. Kavukcuoglu and M. Ranzato and Y. LeCun(2009). What is the best multi-stage architecture for object recognition? In Proc. International Conference on Computer Vision (ICCV’09).
K.He and X.Zhang and S.Ren and J.Sun(2014),“Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1915-1929
K. Simonyan and A. Zisserman(2004). Very deep convolutional networks for large-scale image recognition.CoRR, abs/1409.1556.
L. Najman and M. Schmitt Ming Liang and Xiaolin Hu and Bo Zhang(1996). Geodesic saliency of watershed contours and hierarchical segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 18(12):1163–1173.
LeCun Y. and Bottou L. and Orr G. and Muller K. (1998), ‘Efficient backprop’, In Computer Vision and Pattern Recognition (CVPR), pp. 9-4
Lingpeng Kong and Chris Dyer and Noah A and Smith Ming Liang and Xiaolin Hu and Bo Zhang(2008). Segmental recurrent neural networks. arXiv preprint arXiv:1511.06018, 2015.
Ming Liang and Xiaolin Hu(2004) Recurrent Convolutional Neural Network for Object Recognition IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1915-1929
Ming Liang and Xiaolin Hu and Bo Zhang(2008) Convolutional Neural Networks with Intra-layer Recurrent Connections for Scene Labeling IEEE International Conference on Computer Vision, pages 1–8, Sept. 2009.
Ming Liang and Xiaolin Hu and Bo Zhang(2009) Multiscale Recurrent neural networksIn Proceedings of International Conference of Computer Vision (ICCV), volume 1.
P. H. Pinheiro and R. Collobert. (2014). Recurrent convolutional neural networks for scene parsing. DOI. 12.1.
Russell B. and Torralba A. and Liu C. and Fergus R. and Freeman W. (2007), ‘Object recognition by scene alignment’, In Neural Advances in Neural Information, In Proceeding of Computer Vision and Pattern Recognition, DOI. 10.1167.
R. Socher and C. C. Lin and A. Y. Ng and C. D(2011).. Manning. Parsing Natural Scenes and Natural Language with Recursive Neural Networks. In Proceedings of the 26th International Conference on Machine Learning (ICML).
Razvan Pascanu and Tomas Mikolov and Yoshua Bengio(2012).. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063.
Schulz H. and Behnke S. (2012), ‘Learning object-class segmentation with convolutional neural networks’, In 11th European Symposium on Artificial Neural Networks (ESANN), DOI. 10.1.1.307.2322
S. Gould and J. Rodgers and D. Cohen and G. Elidan, and D. Koller(2008) Multi-class segmentation with relative location prior. Int. J. Comput. Vision, 80(3):300–316.
S. Gould and R. Fulton and D. Koller(2009).. Decomposing a scene into geometric and semantically consistent regions. IEEE International Conference on Computer Vision, pages 1–8.
S. Turaga and K. Briggman and M. Helmstaedter and W. Denk and H. Seung(2009). Maximin affinity learning of image segmentation. NIPS.
ShaoqingRen and KaimingHe and RossGirshick and and JianSun (2001) Faster R-CNN:Towards RealTime Object Detection with Region Proposal Networks, In Proceedings of International Conference of Computer Vision (ICCV), volume 1
Salah El Hihi and Yoshua Bengio. (1995)Hierarchical recurrent neural networks for long-term dependencies. In Advances in Neural Information Processing Systems, pp. 493–499. Citeseer, 1995.
Tomas Mikolov and Martin Karafiát and Lukas Burget and Jan Cernocky and Sanjeev Khudanpur(2010). Recurrent neural network based language model. volume 2, pp. 3.
Tsungnan Lin and Bill G Horne and Peter Tino and C Lee Giles(1996). Learning long-term dependencies in narx recurrent neural networks. IEEE Transactions on Neural Networks, 7(6):1329–1338.
V. Lempitsky and A. Vedaldi, and A. Zisserman(2011). A pylon model for semantic segmentation. In Advances in Neural Information Processing Systems.
Yuhuai Wu and Saizheng Zhang and Ying Zhang and Yoshua Bengio and Ruslan Salakhutdinov.(2016) On multiplicative integration with recurrent neural networks. preprint arXiv:1606.06630.
.