Wei Li ([email protected]), Fudan University

Mengyao Zhu ([email protected]), Shanghai University

Bing-Kun Bao ([email protected]), Nanjing University of Posts and Telecommunications, Nanjing, China

Min Xu ([email protected]), University of Technology Sydney, Australia

Xi Shao ([email protected]), Nanjing University of Posts and Telecommunications, Nanjing, China


Time-sequenced media, such as video, audio, music, exist everywhere in our daily lives. As an important portion of multimedia, it is more vivid and comprehensive than other static media. However, it is still difficult process and analyze this kind of media, especially due to its unstructured nature and vast impact. Recently, we have seen the breakthroughs that deep learning was bringing in many fields which were difficult to solve before. One such field that deep learning has a potential to help solving is time-sequenced multimedia processing. In addition, deep reinforcement learning methods have been used for human-machine interaction. This workshop seeks innovative papers that exploit novel technologies and solutions from both industry and academia on highly effective and efficient time-sequenced multimedia computing and processing.

Scope and Topics

This workshop of ICME 2019 invites original and high quality workshop paper relating to all topics of time-sequenced multimedia computing and processing. The list of possible topics includes, but not limited to:

  • Time-sequenced multimedia retrieval
  • Time-sequenced multimedia fusion methods
  • Time-sequenced multimedia content analysis
  • Time-sequenced multimedia applications
  • Time-sequenced multimedia representation and processing
  • Time-sequenced multimedia mining
  • Time-sequenced multimedia indexing
  • Time-sequenced multimedia annotation and recommendation
  • Time-sequenced multimedia abstraction and summarization


Yun Fu ([email protected]), Northeastern University

Joseph P Robinson ([email protected]), Northeastern University

Ming Shao ([email protected]), University of Massachusetts, Dartmouth

Siyu Xia ([email protected]), Southeast University


We have witnessed remarkable advances in facial recognition technologies over the past a few years due to the rapid development of deep learning and large-scale, labeled face collections. There is a need for evermore challenging image and video collections to solve emerging problems in the fields of faces and multimedia; along with more advanced techniques to model and explore complex, real-world data.

In parallel to conventional face recognition, researchers have continued to show an increasing interest in topic of face synthesis and morphing. Works in this have been done using imagery, videos, and various other modalities (e.g., hand sketches, 3-Dimensional models, view-points, etc.). Some focus more on the individual or individuals (e.g., with/without makeup, age varying, predicting a child appearance from parents, face swapping, etc.). While still others leverage generative modeling for semi-supervised learning of recognition or detection systems. Besides, and in some cases alongside, generative modeling are methodologies to automatically interrupt and analyze faces for a better understanding from visual context (e.g., relationships of persons in a photo, whether blood relatives or affiliated, age estimation, occupation recognition, etc.). Certainly, it is an age where many creative approaches and views are proposed for face synthesizing. In addition, various advances are being made in other technologies involving automatic face understanding: face tracking (e.g., landmark detection, facial expression analysis, face detection), face characterization (e.g., behavioral understanding, emotion recognition), facial characteristic analysis (e.g., gait, age, gender and ethnicity recognition), group understanding via social cues (e.g., kinship, non-blood relationships, personality), and visual sentiment analysis (e.g., temperament, arrangement). The ability to create effective models for visual certainty has significant value in both the scientific communities and the commercial market, with applications that span topics of human-computer interaction, social media analytics, video indexing, visual surveillance, and Internet vision.

Scope and Topics

The 2nd workshop on faces in multimedia1 serves a forum for researchers to review the recent progress the automatic face understanding and synthesizing in multimedia. Special interest will be given to generative-based modeling. The workshop will include up to two keynotes, along with peer-reviewed papers (oral and poster). Original high-quality contributions are solicited on the following topics:

  • Face synthesis and morphing; works on generative modeling;
  • Soft biometrics and profiling of faces: age, gender, ethnicity, personality, kinship, occupation, and beauty ranking;
  • Deep learning practice for social face problems with ambiguity including kinship verification, family recognition and retrieval;
  • Understanding of the familial features from vast amount of social media data;
  • Discovery of the social groups from faces and the context;
  • Mining social face relations through metadata as well as visual information;
  • Tracking and extraction and analysis of face models captured by mobile devices;
  • Face recognition in low-quality or low-resolution video or images;
  • Novel mathematical models and algorithms, sensors and modalities for face & body gesture and action representation;
  • Analysis and recognition for cross-domain social media;
  • Novel social applications involving detection, tracking & recognition of faces;
  • Face analysis for sentiment analysis in social media;
  • Other applications involving face analysis in social media content.


Prof. Ran He ([email protected]), Chinese Academy of Sciences, China

Prof. Xiaotong Yuan ([email protected]), Nanjing University, China

Prof. Jitao Sang ([email protected]), Beijing Jiaotong University, China


With the advent of the era of information, as well as wide adaptation of media technologies in people's daily life, it is highly demanding to efficiently process or organize multimedia information rapidly emerged from the Internet, wider surveillance networks, mobile devices, smart cameras, etc. Due to the importance of multimedia information (images, sounds, videos) and its promising applications, multimedia computing has attracted strong interest of researchers. It is critical to find good information theoretic metrics to develop robust machine learning methods for multimedia computing. Information theory also provides new means and solutions for multimedia computing.

The emergences of Generative Adversarial Networks (GANs) and Variational Auto-Encoder (VAE) in recent years have greatly promoted the development of multimedia computing. However, the current progress is still far from its promise. Since there is no explicit expression for data distribution learned by GANs and VAE, both the two generative models lack of interpretation. Besides, due to the complexity of high dimensional multimedia information, the visual data generated by GANs and VAEs may be not photo-realistic. To advance the progress further, one often adopts information theory and optimization strategies in traditional deep learning to find new solutions for multimedia computing.

Scope and Topics

The goal of this workshop is to provide a forum for recent research advances in the area of information theory and multimedia computing. The workshop seeks original high-quality submissions from leading researchers and practitioners in academia as well as industry, dealing with theories, applications and databases of visual events. Topics of interest include, but are not limited to:

  • Information theoretic learning for multimedia computing
  • Generative adversarial networks for multimedia computing
  • Variational auto-encoder for multimedia computing
  • Probabilistic graph models for multimedia computing
  • Graph Convolutional Networks for multimedia computing
  • Domain adaptation for multimedia computing
  • Reinforce learning for multimedia computing
  • Adversarial samples for multimedia computing
  • Machine learning for multimedia computing
  • Entropy, mutual information, correntropy, divergence, Wasserstein distance, KL distance, Maximum mean discrepancy for multimedia computing
  • New information theoretic measures and optimization methods
  • Cross-modal/cross-domain learning
  • Multimodal data understanding
  • Multi-spectrum data fusion
  • Evaluation methodologies for multimedia computing
  • Information theory-based applications (security, sports, news, etc.)


Wei Zhang ([email protected]), JD AI Research, China

Ting Yao ([email protected]), JD AI Research, China

Wen-Huang Cheng ([email protected]), National Chiao Tung University


Artificial intelligence (AI) emphasizes the creation of intelligent machines that work and react like humans, which has been substantially changing almost every aspect of our daily life. With the rise of AI, an increasing number of researchers and practitioners have dedicated tremendous efforts to push the limit of machine intelligence. Particularly, recent advancements in computer vision, multimedia have boosted the performance of many scientific tasks, as well as the user experience in real applications. Meanwhile, the ever growing fashion industry has reached 2% share of world’s GDP in 2018. Visual fashion computation is an emerging research area that studies computable fashionrelated tasks in visual domain, e.g., item / attributes / style recognition, landmark / pose estimation, cloth / human parsing, street-to-shop retrieval, popularity prediction, outfit compatibility, design and image synthesis.

The goal of this workshop is to call for a coordinated effort to understand the scenarios and challenges emerging in visual fashion computing with AI techniques, identify key tasks and evaluate the state of the art, showcase innovative methodologies and ideas, introduce large scale real systems or applications, as well as propose new real-world datasets and discuss future directions. Fashion-related data of interest covers a wide spectrum, ranging from fashion product, human body, design and art. We solicit manuscripts in all fields that explores the synergy of fashion computation and AI techniques.

Scope and Topics

We believe the workshop will offer a timely collection of research updates to benefit the researchers and practitioners working in the broad fields ranging from computer vision, multimedia to machine learning. To this end, we solicit original research and survey papers addressing the topics listed below (but not limited to):

  • AI technologies for fashion product analysis;
  • AI technologies for human analysis;
  • AI technologies for art and design;
  • AI technologies for fashion search and recommendation;
  • AI technologies for fashion content generation;
  • AI technologies for fashion-related image / video captioning;
  • Data analytics and demo systems for visual fashion computation;
  • Fashion related datasets and evaluation protocols;
  • AI-assisted or human-AI co-operated technologies for fashion computing;
  • Emerging new applications in visual fashion computation


Tian Gan ([email protected]), Shandong University, China

Wen-Huang Cheng ([email protected]), National Chiao Tung University, Taiwan

Kai-Lung Hua ([email protected]), National Taiwan University of Science and Technology, Taiwan

Klaus Schoeffmann ([email protected]), Klagenfurt University, Austria

Vladan Velisavljevic ([email protected]), University of Bedfordshire

Christian von der Weth ([email protected]), National University of Singapore, Singapore


The intimate presence of mobile devices in our daily life, such as smartphones and various wearable gadgets like smart watches, has dramatically changed the way we connect with the world around us. Users rely on mobile devices to maintain an always-on relation to information, personal and social networks, etc. Nowadays, in the era of the Internet-of-Things (IoT), these devices are further extended by smart sensors and actuators and amend multimedia devices with additional data and possibilities. With a growing number of powerful embedded mobile sensors like camera, microphone, GPS, gyroscope, accelerometer, digital compass, and proximity sensor, there is a variety of data available and hence enables new sensing applications across diverse research domains comprising mobile media analysis, mobile information retrieval, mobile computer vision, mobile social networks, mobile human-computer interaction, mobile entertainment, mobile gaming, mobile healthcare, mobile learning, and mobile advertising. Regardless of the application fields, many issues and challenges brought by the emerging technologies for mobile multimedia still lie ahead and many research questions remain to be answered. For example, seamless user experience has been identified as one key factor in designing mobile multimedia applications for multiple form factors. Yet, its provision is challenging and requires effective integration of rich mobile sensors and multidisciplinary research, such as multimedia content adaptation and user behavior analysis. Also, the effective and efficient use of multimedia data on mobile devices, including convenient interaction and appropriate visualization, is still not fully solved. Small screen sizes, integrated cameras and microphones, but also the new gesture and haptic sensors, pose new challenges but at the same time provide new opportunities and innovative ways for multimedia interaction. In addition, for power saving purpose, applicationdriven energy management is an important technical consideration in mobile multimedia computing (e.g., how to offload computation-intense tasks to the cloud servers to save energy and extend battery lifetimes for mobile users?). In addition to energy optimization, it is often also required to optimize runtime performance when dealing with multimedia data in a mobile scenario (e.g., by usage of specific programming techniques, or by offloading specific parts to the cloud). The workshop on Mobile Multimedia Computing (MMC 2019) aims to bring together researchers and professionals from worldwide academia and industry for showcasing, discussing, and reviewing the whole spectrum of technological opportunities, challenges, solutions, and emerging applications in mobile multimedia.

Scope and Topics

Topics of interest include, but are not limited to:

  • 2D/3D computer vision on mobile devices
  • Action/gesture/object/speech recognition with mobile sensor
  • Computational photography on mobile devices
  • Human computer interaction with mobile and wearable devices
  • Mobile multimedia indexing and retrieval
  • Mobile visual search
  • Mobile multimedia content adaptation and adaptive streaming
  • Mobile social signal processing
  • Mobile virtual and augmented reality
  • Multimedia data in the IoT
  • Multimedia Cloud Computing
  • Multi-modal and multi-user mobile sensing
  • Novel multimedia applications taking advantage of mobile devices
  • Power saving issues of mobile multimedia computing
  • Personalization, privacy and security in mobile multimedia
  • User behavior analysis of mobile multimedia applications
  • Ubiquitous computing/adaptation on mobile and wearable devices
  • Visualization of multimedia data on mobile devices
  • Other topics related to mobile multimedia computing


Dong Zhao ([email protected]), Associate Professor, Beijing University of Posts and Telecommunications

Chenqiang Gao ([email protected]), Professor, Chongqing University of Posts and Telecommunications

Jiayi Ma ([email protected]), Associate Professor, Wuhan University

Quan Zhou ([email protected]), Associate Professor, Nanjing University of Posts and Telecommunications

Ji Zhao ([email protected]), Senior Researcher, TuSimple

Yu Zhou ([email protected]), Assistant Professor, Beijing University of Posts and Telecommunications


Multimedia technology provides a valuable resource to enhance the unmanned systems, e.g., robot, unmanned aerial vehicle and driverless car. This workshop aims to bring together researchers, engineers as well as industrial practitioners, who concern about the applications of the multimedia in unmanned systems. Joint force efforts along this direction are expected to bridge the gap between multimedia technology and unmanned systems.

All the topics included by the workshop are core issues in the development of the multimedia technology and unmanned systems and have received an increasing attention in recent years. We believe these topics will receive broad interests in fields of multimedia and robotics, and hence sufficient number of high quality submissions can be attracted.

Compared with the related workshops, this workshop pays more attention to emerging issues of multimedia applications in unmanned systems from the perspective of industrial applications, which include but not limited to multimedia data compilation, multimedia scene autonomous perception, multimedia scene understanding, UAV sensing network, and core problems in real applications.

Scope and Topics

The scope of this workshop is the various aspects of research in the application of the multimedia technology in unmanned systems, namely, mobile robot, unmanned aerial vehicle and driverless car. Topics of interests include, but are not limited to, following fields:

  • Robot Vision.
  • Simultaneous Localization and Mapping.
  • 2D/3D Obstacle Avoidance.
  • Image/Video Understanding.
  • Deep Learning for Robot.
  • Image Matching.
  • UAV Sensing Network and Applications.
  • Multimedia data in the IoT.
  • Industrial applications of the multimedia technology for unmanned systems.


M. Shamim Hossain ([email protected]), King Saud University, KSA

Stefan Goebel ([email protected]), KOM, TU Darmstadt, Germany

Yin Zhang ([email protected]), Zhongnan University of Economics and Law, China


Today multimedia services and technologies play an important role in providing and managing smart health services to anyone, anywhere and anytime seamlessly. These services and technologies facilitate doctors and other health care professionals to have immediate access to smart -health information for efficient decision making as well as better treatment. Researchers are working in developing various multimedia tools, techniques and services to better support smart -health initiatives. In particular, works in smart-health record management, elderly health monitoring, real-time access of medical images and video are of great interest.

Scope and Topics

This workshop aims to report high quality research on recent advances in various aspects of ehealth, more specifically to the state-of-the-art approaches, methodologies and systems in the design, development, deployment and innovative use of multimedia services, tools and technologies for smart health care. Authors are solicited to submit complete unpublished papers in the following, but not limited to the following topics of interest:

  • Edge-Cloud for Smart Healthcare
  • Deep learning approach for smart healthcare
  • Serious Games for health
  • Multimedia big data for health care applications
  • Adaptive exergames for health
  • Multimedia Enhanced Learning , Training & Simulation for Smart Health
  • Sensor and RFID technologies for Smart -health
  • Cloud-based smart health Services
  • Resource allocation for Media Cloud-assisted health care
  • IoT-Cloud for Smart Healthcare
  • Wearable health monitoring
  • Smart health service management
  • Context-aware Smart -Health services and applications
  • Elderly health monitoring
  • Collaborative Smart –Health
  • Haptics for Surgical/medical Systems
  • 5G Tactile Internet for Smart Health


Wei-Ta Chu ([email protected]), National Chung Cheng University

Norimichi Tsumura ([email protected]), Graduate School of Engineering, Chiba University

Shoji Yamamoto ([email protected]), Tokyo Metropolitan College of Industrial Technology

Toshihiko Yamasaki ([email protected]), The University of Tokyo


The MMArt workshop and ACM workshop will solicit original contributions on diverse works covering multimedia artworks analysis and attractiveness computing. This workshop will be run with the following activities:

  • Oral presentation: High-quality contributions will be presented in the oral sessions. Each submission will be reviewed by at least three reviewers coming from diverse areas around the world.
  • Academic and industry keynote talks: We will arrange at least one keynote talk, either from the academic perspective or the industry perspective. The academic keynote speech states state-of-the-art research results and proposes open technical questions. Keynote speaker invited from the industry will show research opportunities and challenges, marketing, business models, and copyright issues.
  • Invited talks from the main conference: Authors having contributions related to multimedia artworks or attractiveness computing at the main conference may be invited to give talks at this workshop. This arrangement not only strengthens the MMArt-ACM workshop but also facilitates highly-interactive communication between the speaker and the audience.

Scope and Topics

Topics include, but are not limited to analysis and applications of the following areas are solicited:

  • Creation: content synthesis and collaboration; creation of novel artworks or attractive content; connecting real-world art with digital artworks; attractive content creation.
  • Editing: content authoring, composition, summarization, and presentation; multimodality integration.
  • Indexing and retrieval: novel features and structure to index multimedia artworks or attractive content; retrieval interface and model; socially-aware analysis.
  • Methodology: machine learning for multimedia artworks or attractive content; classification and pattern recognition; generic model and heuristics in analysis.
  • Interaction: interaction on various devices; user in the loop of computation; human factors in artworks or attractiveness.
  • Evaluation: dataset development; evaluation of systems for multimedia artworks or attractive content; design of user study; limitation of the state-of-the-art.
  • Novel applications: novel application scenarios; development of novel challenges and perspectives.


Liangliang Ren (Department of Automation University of Tsinghua University)

Guangyi Chen (Dept. of Automation University of Tsinghua University)

Dr. Jiwen Lu (Contact Person)([email protected]), Department of Automation Tsinghua University, China


Human Identities are an important information source in many high-level multimedia analysis tasks such as video summarization, semantic retrieval, interaction indexing, and scene understanding. The aim of this workshop is to bring together researchers in computer vision and multimedia to share ideas and propose solutions on how to address the many open issues in human identification, and present new datasets that introduce new challenges in the field. Human identification in multimedia is one relatively new problem in multimedia analysis and, recently, it has attracted the attention of many researchers in the field. Human Identification is significant to many multimedia related applications such as video surveillance, video search, human-computer interaction, and video summarization. Recent advances in feature representations, modeling, and inference techniques have led to a significant progress in the field. The proposed workshop aims to explore recent progress in human identification with multimedia data by taking stock of the past five years of work in this field and evaluating different algorithms. The proposed workshop will help the community to understand the challenges and opportunities of human identification in multimedia techniques for the next few years.

Scope and Topics

Topics of interests include (but not limited to) the following streams:

  • Multimedia feature representation
    • Image feature representation
    • Video feature representation
    • Audio feature representation
    • Multiview feature representation
    • Multimodal feature representation
  • Statistical learning for human identification
    • Sparse learning for human identification
    • Dictionary learning for human identification
    • Manifold learning for human identification
    • Metric learning for human identification
    • Deep learning for human identification
  • Applications
    • Video surveillance
    • Multimedia search
    • Video summarization
    • Benchmark datasets
    • Comparative evaluations


Prof. Lifang Wu ([email protected]), Beijing University of Technology, China.

Prof. Jufeng Yang ([email protected]), Nankai University, China.

Prof. Rongrong Ji ([email protected]), Xiamen University, China.


With the rapid development of digital photography technology and widespread popularity of social networks, people tend to express their opinions using images and videos together with text, resulting in a large volume of visual content. To manage, recognize and retrieve such gigantic visual collections poses significant technical challenges. Visual emotion analysis of the large-scale visual content is rather challenging because it involves multidisciplinary understanding of human perception and behavior. The development is constrained mainly by the affective gap and the subjectivity of emotion perceptions. Recently, great advancements in machine learning and artificial intelligence have made large-scale affective computing of visual content a possibility, which received a lot of interest and attention from both academic and industrial research communities.

Scope and Topics

This workshop seeks original contributions reporting the most recent progress on different research directions and methodologies on visual emotion recognition and retrieval, as well as the wide applications. It targets a mixed audience of researchers and product developers from the multimedia community, and may draw attention of people in machine learning, psychology, computer vision, etc. The topics of interest include, but are not limited to:

  • Dominant emotion recognition
  • Discrete emotion distribution estimation
  • Continuous emotion distribution estimation
  • Personalized emotion perception prediction
  • Group emotion clustering and affective region detection
  • Weakly-supervised/unsupervised learning for emotion recognition
  • Few/one shot learning for emotion recognition
  • Deep learning and reinforcement learning for emotion recognition
  • Metric learning for emotion recognition
  • Multi-modal/multi-task learning for emotion recognition
  • Image retrieval incorporating emotion
  • Emotion based visual content summarization
  • Image captioning with emotion
  • Virtual reality, such as affective human-computer interaction
  • Applications in entertainment, education, psychology, and health care,
  • etc.


Weiyao Lin ([email protected]), Shanghai Jiao Tong University, China.

John See ([email protected]), Multimedia University, Malaysia.

Michael Ying Yang ([email protected]), University of Twente, the Netherlands.



With the rapid growth of video surveillance applications and services, the amount of surveillance videos has become extremely "big" which makes human monitoring tedious and difficult. Therefore, there exists a huge demand for smart surveillance techniques which can perform monitoring in an automatic or semi-automatic way. Firstly, with the huge amount of surveillance videos in storage, video analysis tasks such as event detection, action recognition, and video summarization are of increasing importance in applications including events-of-interest retrieval and abnormality detection. Secondly, with the fast increase of semantic data (e.g., objects' trajectory & bounding box) extracted by video analysis techniques, the semantic data have become an essential data type in surveillance systems, introducing new challenging topics, such as efficient semantic data processing and semantic data compression, to the community. Thirdly, with the rapid growth from the static centric-based processing to the dynamic collaborative computing and processing among distributed video processing nodes or cameras, new challenges such as multi-camera joint analysis, human re-identification, or distributed video processing are being issued in front of us. The requirement of these challenges is to extend the existing approaches or explore new feasible techniques.

Scope and Topics

This workshop is intended to provide a forum for researchers and engineers to present their latest innovations and share their experiences on all aspects of design and implementation of new surveillance video analysis and processing techniques. Topics of interests include, but are not limited to:

  • Event detection, action recognition, and activity analysis in surveillance videos
  • Multi-camera joint analysis and recognition
  • Object detection and tracking in surveillance videos
  • Recognition and parsing of crowded scenes
  • Human re-identification
  • Summarization and synopsis on surveillance videos
  • Surveillance scene parsing, segmentation, and analysis
  • Semantic data processing and compression in surveillance systems
  • Facial property analysis


Yang Yang ([email protected]), University of Electronic Science and Technology of China, China.

Yang Wang ([email protected]), Dalian University of Technology, China.

Xing Xu ([email protected]), University of Electronic Science and Technology of China, China.

Zi Huang ([email protected]), The University of Queensland, Australia.


Due to the explosive growth of multi-modal data (e.g., images, videos, blogs, sensor data, etc) on the Internet, together with the urgent requirement of joint understanding the heterogeneous data, the research attention over multi-modal data, especially the visual and textual data to bridge the semantic gap has attracted a huge amount of interest from the computer vision and natural language processing communities. Great efforts have been made to study the intersection of cross-media data, especially combining vision and language, and fantastic applications include (i) generating image descriptions using natural language, (ii) visual question answering, (iii) retrieval of images based on textural queries (and vice versa), (iv) generating images/videos from textual descriptions, (v) language grounding and many other related topics.

Though booming recently, it remains challenging as reasoning of the connections between visual contents and linguistic words are difficult. Semantic knowledge involves reasoning the external knowledge of the word. Although reasoning ability is always claimed in recent studies, most “reasoning” simply uncovers latent connections between visual elements and textual/semantic facts during the training on manually annotated datasets with a large number of image-text pairs. Furthermore, recent studies are always specific to certain datasets that lack generalization ability, i.e., the semantic knowledge obtained from specific dataset cannot be directly transferred to other datasets, as different benchmark may have different characteristics of its own. One potential solution is leveraging external knowledge resources (e.g., social-media sites, expert systems and Wikipedia.) as intermediate bridge for knowledge transfer. However, it is still implicit that how to appropriately incorporate the comprehensive knowledge resources for more effective knowledge-based reasoning and transfer across datasets. Towards a broad perspective of applications, integrating vision and language for knowledge reasoning and transfer has yet been well exploited in existing research.

Scope and Topics

This special issue targets the researchers and practitioners from both the academia and industry to explore recent advanced learning models and systems that address the challenges in cross-media analysis. It provides a forum to publish recent state-of-the-art research findings, methodologies, technologies and services in vision-language technology for practical applications. We invite original and high quality submissions addressing all aspects of this field, which is closely related to multimedia search, multi-modal learning, cross-media analysis, cross-knowledge transfer and so on.

Topics of interest include, but are not limited to:

  • Deep learning methods for language and vision
  • Generative adversarial networks for cross-modal data analysis
  • Big data storage, indexing, and searching over multi-modal data
  • Transfer learning for vision and language
  • Cross-media analysis (retrieval, hashing, reasoning, etc)
  • Multi-modal learning and semantic representation learning
  • Learning multi-graph over cross-modal data
  • Generating image/video descriptions using natural language
  • Visual question answering/generation on images or videos
  • Retrieval of images based on textural queries (and vice versa)
  • Generating images/videos from textual descriptions
  • Language grounding


Lu Fang (primary contact) ([email protected]), Associate Professor, Tsinghua-Berkeley Shenzhen Institute, China.

David J. Brady ([email protected]), Professor, Duke University, USA.

Shenghua Gao ([email protected]), Assistant Professor, ShanghaiTech University, China.

Yuchen Guo ([email protected]), Postdoc, Tsinghua University, China.



Gigapixel videography, beyond the resolution of single camera and human visual perception, plays an important role in capturing large-scale dynamic scene with extremely high resolution for both macro and micro domains. Restricted by the spatial-temporal bandwidth product of optical system, the size, weight, power and cost are central challenges in gigapixel video.

The UnstructureCam we designed, an end-to-end unstructured multi-scale camera system, shows the ability of real-time capture, dynamically adjusting local-view cameras, and online warping for synthesizing gigapixel video. We take the advantage of our UnstructureCam to develop the Gigapixel Video Dataset. And these datasets we provide in are all characterized by extremely high resolution, large scale, wide FOV and huge data throughout. We hope that our datasets will help researchers solve the corresponding computer vision tasks.

Scope and Topics

Gigapixel videography plays important role in capturing large-scale dynamic scenes for both macro and micro domains. Benefited from the recent progress of gigapixel camera design, the capture of gigapixel-level image/video becomes more and more convenient. In particular, along with the emergence of gigapixel-level image/video, the corresponding computer vision tasks remain unsolved, due to the extremely high-resolution, large-scale, huge-data that induced by the gigapixel camera.

To this end, we solicit original and ongoing research addressing the topics listed below (but not limited to):

  • Long-term multi-target object tracking
  • Large-scale generic object detection
  • Large-scale human action recognition
  • Large-scale face detection/recognition
  • Large-scale anomaly detection
  • Large-scale crowd counting
  • Gigapixel video enhancement
  • Gigapixel video compression and streaming
  • Gigapixel video depth estimation

Workshop Chairs

Jingdong Wang
Microsoft Research Asia, China
Susanne Boll
University of Oldenburg, Germany
Z. Jane Wang
University of British Columbia,Canada