Bio

Currently I am an associate professor at CITIUS in University of Santiago de Compostela (Galicia, Spain). My research interests include parallel and distributed computing, Big Data technologies, programming models and software optimization techniques for emerging architectures. I received the B.Sc. in physics and the Ph.D. in computer science (2006) from University of Santiago de Compostela (Spain). I was a visiting postdoctoral researcher at University Carlos III de Madrid (Spain) and University of Illinois at Urbana-Champaign (USA), and I also worked as researcher and project manager at Galicia Supercomputing Center (Spain).

Publications

Journals

  1. Efficient phylogenetic tree inference for massive taxonomic datasets: harnessing the power of a server to analyze 1 million taxa    
    César Piñeiro and Juan C. Pichel.
    GigaScience, Vol. 13, pages 1-12, 2024.
  2. QPU integration in OpenCL for heterogeneous programming   
    Jorge Vázquez-Pérez, César Piñeiro, Juan C. Pichel, Tomás F. Pena and Andrés Gómez.
    Journal of Supercomputing, 2024.
  3. An Unsupervised Perplexity-based Method for Boilerplate Removal       
    Marcos Fernández-Pichel, Manuel Prada-Corral, David E. Losada, Juan C. Pichel and Pablo Gamallo.
    Natural Language Engineering, Vol. 30, pages 132-149, 2024.
  4. BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale     
    César Piñeiro and Juan C. Pichel.
    GigaScience, Vol. 12, pages 1-12, 2023.
  5. A machine learning approach to model the impact of line edge roughness on gate-all- around nanowire FETs while reducing the carbon footprint
    Antonio García-Loureiro, Natalia Seoane, Julián G. Fernández, Enrique Comesaña and Juan C. Pichel.
    PLoS ONE, Vol. 18, Issue 7, pages 1-17, 2023.
  6. An Accurate Machine Learning Model to Study the Impact of Realistic Metal Grain Granularity on Nanosheet FETs
    Julián G. Fernández, Natalia Seoane, Enrique Comesaña, Juan C. Pichel and Antonio García-Loureiro
    Solid State Electronics, pages 108710, 2023.
  7. A Multistage Retrieval System for Health-related Misinformation Detection   
    Marcos Fernández-Pichel, David E. Losada and Juan C. Pichel.
    Engineering Applications of Artificial Intelligence, Vol. 115, pages 1-17, 2022.
  8. A Unified Framework to Improve the Interoperability between HPC and Big Data Languages and Programming Models       
    César Piñeiro and Juan C. Pichel.
    Future Generation Computer Systems, Vol. 134, pages 123-139, 2022.
  9. Real-Time Focused Extraction of Social Media Users   
    Rodrigo Martínez-Castaño, David E. Losada and Juan C. Pichel.
    IEEE Access, Vol. 10, pages 42607-42622, 2022.
  10. A Big Data Platform for Real Time Analysis of Signs of Depression in Social Media   
    Rodrigo Martínez-Castaño, Juan C. Pichel and David E. Losada.
    Int. Journal of Environmental Research and Public Health, Vol. 17 (3), 2020.
  11. VeryFastTree: Speeding Up the Estimation of Phylogenies for Large Alignments through Parallelization and Vectorization Strategies   
    César Piñeiro, José M. Abuín and Juan C. Pichel.
    Bioinformatics, Vol. 36, Issue 17, pages 4658-4659, 2020.
  12. A Big Data Approach to Metagenomics for All-food-sequencing      
    Robin Kobus, José M. Abuín, André Müller, Sören Lukas Hellmann, Juan C. Pichel, Tomás F. Pena, Andreas Hildebrandt, Thomas Hankeln and Bertil Schmidt.
    BMC Bioinformatics, Vol. 21 (102), 2020.
  13. Ignis: An efficient and scalable multi-language Big Data framework   
    César Piñeiro, Rodrigo Martínez-Castaño and Juan C. Pichel.
    Future Generation Computer Systems, Vol. 105, pages 705-716, 2020.
  14. Sparse Matrix Classification on Imbalanced Datasets using Convolutional Neural Networks   
    Juan C. Pichel and Beatriz Pateiro-López.
    IEEE Access, Vol. 7, pages 82377-82389, 2019.
  15. PASTASpark: multiple sequence alignment meets Big Data   
    José M. Abuín, Tomás F. Pena and Juan C. Pichel.
    Bioinformatics, Vol. 33, Issue 18, pages 2948-2950, 2017.
  16. SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data   
    José M. Abuín, Juan C. Pichel, Tomás F. Pena and Jorge Amigo.
    PLoS ONE, Vol. 11, Issue 5, pages 1-21, 2016.
  17. Boosting Performance of a Statistical Machine Translation System Using Dynamic Parallelism   
    M. Fernández, Juan C. Pichel, José C. Cabaleiro and Tomás F. Pena.
    Journal of Computational Science, Vol. 13, pages 37-48, 2016.
  18. BigBWA: Approaching the Burrows-Wheeler Aligner to Big Data Technologies   
    José M. Abuín, Juan C. Pichel, Tomás F. Pena and Jorge Amigo.
    Bioinformatics, Vol. 31, Issue 24, pages 4003-4005, 2015.
  19. Power and Energy Implications of the Number of Threads Used on the Intel Xeon Phi   
    Oscar G. Lorenzo, Tomás F. Pena, José C. Cabaleiro, Juan C. Pichel, F.F. Rivera and D. S. Nikolopoulos.
    Annals of Multicore and GPU Programming, Vol. 2, Issue 1, pages 55-65, 2015.
  20. Análisis Morfosintáctico y Clasificación de Entidades Nombradas en un Entorno Big Data   
    Pablo Gamallo, Juan C. Pichel, Marcos García, José M. Abuín and Tomás F. Pena.
    Procesamiento del Lenguaje Natural, Vol. 53, pages 17-24, 2014.
  21. Using an Extended Roofline Model to Understand Data and Thread Affinities on NUMA Systems   
    Oscar G. Lorenzo, Tomás F. Pena, José C. Cabaleiro, Juan C. Pichel and Francisco F. Rivera.
    Annals of Multicore and GPU Programming, Vol. 1, Issue 1, pages 56-67, 2014.
  22. A Hardware Counter-Based Toolkit for the Analysis of Memory Accesses in SMPs
    Oscar G. Lorenzo, Tomás F. Pena, José C. Cabaleiro, Juan C. Pichel, Juan A. Lorenzo and Francisco F. Rivera.
    Concurrency and Computation: Practice and Experience, Vol. 26, Issue 6, pages 1328-1341, 2014.
  23. Using Sampled Information, Is It Enough for the SpMV Locality Optimization?   
    Juan C. Pichel, Juan A. Lorenzo, Dora B. Heras, Francisco F. Rivera and Tomás F. Pena.
    Concurrency and Computation: Practice and Experience, Vol. 26, Issue 1, pages 98-117, 2014.
  24. 3DyRM: A Dynamic Roofline Model Including Memory Latency Information
    Oscar G. Lorenzo, Tomás F. Pena, Juan C. Pichel, José C. Cabaleiro and Francisco F. Rivera.
    Journal of Supercomputing, Vol. 70, Issue 2, pages 696-708, 2014.
  25. Sparse Matrix–Vector Multiplication on the Single-Chip Cloud Computer Many-Core Processor   
    Juan C. Pichel and Francisco F. Rivera.
    Journal of Parallel and Distributed Computing, Vol. 73, Issue 12, pages 1539-1550, 2013.
  26. A Flexible and Dynamic Page Migration Infrastructure Based on Hardware Counters   
    Juan A. Lorenzo, Juan C. Pichel, Francisco F. Rivera, Jose C. Cabaleiro and Tomás F. Pena.
    Journal of Supercomputing, Vol. 65, Issue 2, pages 930-948, 2013.
  27. Optimization of Sparse Matrix-Vector Multiplication Using Reordering Techniques on GPUs   
    Juan C. Pichel, Francisco F. Rivera, Marcos Fernández and Aurelio Rodríguez.
    Microprocessors and Microsystems, Vol. 36, Issue 2, pages 65-77, 2012.
  28. Analyzing the Execution of Sparse Matrix-Vector Product on the Finisterrae SMP-NUMA System
    Juan C. Pichel, Juan A. Lorenzo, Dora B. Heras, José C. Cabaleiro and Tomás F. Pena.
    Journal of Supercomputing, Vol. 58, Issue 2, pages 195-205, 2011.
  29. Increasing the Locality of Iterative Methods and its Application to the Simulation of Semiconductor Devices   
    Juan C. Pichel, Dora B. Heras, José C. Cabaleiro, A. J. Garcia-Loureiro and Francisco F. Rivera.
    Int. Journal of High Performance Computing Applications, Vol. 24, Issue 2, pages 136-153, 2010.
  30. Increasing Data Reuse of Sparse Algebra Codes on Simultaneous Multithreading Architectures   
    Juan C. Pichel, Dora B. Heras, José C. Cabaleiro and Francisco F. Rivera.
    Concurrency and Computation: Practice and Experience, Vol. 21, Issue 15, pages 1838-1856, 2009.
  31. A Collective I/O Implementation Based on Inspector-Executor Paradigm
    David E. Singh, Florin Isaila, Juan C. Pichel and Jesús Carretero.
    Journal of Supercomputing, Vol. 47, Issue 1, pages 53-75, 2009.
  32. Image Segmentation Based on Merging of Sub-Optimal Segmentations   
    Juan C. Pichel, David E. Singh and Francisco F. Rivera.
    Pattern Recognition Letters, Vol. 27, Issue 10, pages 1105-1116, 2006.
  33. Performance Optimization of Irregular Codes Based on the Combination of Reordering and Blocking Techniques   
    Juan C. Pichel, Dora B. Heras, José C. Cabaleiro and Francisco F. Rivera.
    Parallel Computing, Vol. 31, Issue 8-9, pages 858-876, 2005.

Conferences

  1. MPI4All: Universal Binding Generation for MPI Parallel Programming      
    César Piñeiro, Álvaro Vázquez and Juan C. Pichel.
    24th International Conference on Computational Science (ICCS). Málaga, Spain, 2024.
  2. Large Language Models for Binary Health-Related Question Answering: A Zero- and Few-Shot Evaluations   
    Marcos Fernández-Pichel, David E. Losada and Juan C. Pichel.
    24th International Conference on Computational Science (ICCS). Málaga, Spain, 2024.
  3. An Accurate Neural Network Model to Study Threshold Voltage Variability due to Metal Grain Granularity in Nanosheet FETs   
    Julián G. Fernández, Enrique Comesaña, Natalia Seoane, Juan C. Pichel and Antonio García-Loureiro.
    Joint International EuroSOI Workshop. Tarragona, Spain, 2023.
  4. CiTIUS at the TREC 2022 Health Misinformation Track   
    Marcos Fernández-Pichel, Manuel Prada-Corral, David E. Losada and Juan C. Pichel.
    Text Retrieval Conference (TREC). 2022.
  5. Social Minder: a tool for Social Media monitoring and its use for detecting COVID-19 misinformation   
    Marcos Fernández-Pichel, David E. Losada and Juan C. Pichel.
    Joint Conference of the Information Retrieval Communities in Europe (CIRCLE). Toulouse, France, 2022.
  6. CiTIUS at the TREC 2021 Health Misinformation Track   
    Marcos Fernández-Pichel, Manuel Prada-Corral, David E. Losada, Juan C. Pichel and Pablo Gamallo.
    Text Retrieval Conference (TREC). 2021.
  7. Comparing Traditional and Neural Approaches for detecting Health-related Misinformation   
    Marcos Fernández-Pichel, David E. Losada, Juan C. Pichel and David Elsweiler.
    Conference and Lab of the Evaluation Forum (CLEF). Bucharest, Romania, 2021.
  8. Reliability Prediction for Health-related Content: A Replicability Study   
    Marcos Fernández-Pichel, David E. Losada, Juan C. Pichel and David Elsweiler.
    European Conference on Information Retrieval (ECIR). Lucca, Italy, 2021.
  9. Colaboración entre docentes de una universidad alemana y una española para el desarrollo de seminarios prácticos acerca de la credibilidad de la información   
    Marcos Fernández-Pichel, David Elsweiler, David E. Losada and Juan C. Pichel.
    XXVII Jornadas sobre la Enseñanza Universitaria de la Informática (JENUI). Valencia, Spain, 2021.
  10. CiTIUS at the TREC 2020 Health Misinformation Track   
    Marcos Fernández-Pichel, David E. Losada, Juan C. Pichel and David Elsweiler.
    Text Retrieval Conference (TREC). Gaithersburg, USA, 2020.
  11. eXtream: a System for Real-time Monitoring of Dynamic Web Sources    
    Marcos Fernández-Pichel, Rodrigo Martínez-Castaño, David E. Losada and Juan C. Pichel.
    Joint Conference of the Information Retrieval Communities in Europe (CIRCLE). Samatan, France, July 2020.
  12. Dataflow Execution of Hierarchically Tiled Arrays   
    Chih-Chieh Yang, Juan C. Pichel and David A. Padua.
    European Conference on Parallel and Distributed Computing (Euro-Par). Göttingen, Germany, August 2019.
  13. LinguaKit: a Big Data-based multilingual tool for linguistic analysis and information extraction      
    Pablo Gamallo, Marcos Garcia, César Piñeiro, Rodrigo Martínez-Castaño and Juan C. Pichel.
    Int. Workshop on Advances in Natural Language Processing (ANLP). Valencia, Spain, October 2018.
  14. A New Approach for Sparse Matrix Classification Based on Deep Learning Techniques      
    Juan C. Pichel and Beatriz Pateiro-López.
    IEEE Cluster (CLUSTER). Belfast, UK, September 2018.
  15. Towards a Big Data Multi-language Framework using Docker Containers
    César Piñeiro, Rodrigo Martínez-Castaño and Juan C. Pichel.
    Jornadas Sarteco (JP). Teruel, Spain, September 2018.
  16. Building Python-Based Topologies for Massive Processing of Social Media Data in Real Time   
    Rodrigo Martínez-Castaño, Juan C. Pichel and David E. Losada.
    5th Spanish Conference in Information Retrieval (CERI). Zaragoza, Spain, June 2018.
  17. A Micromodule Approach for Building Real-Time Systems with Python-Based Models: Application to Early Risk Detection of Depression on Social Media   
    Rodrigo Martínez-Castaño, Juan C. Pichel, David E. Losada and Fabio Crestani.
    40th European Conference on Information Retrieval (ECIR). Grenoble, France, March 2018.
  18. Perldoop2: a Big Data-oriented source-to-source Perl-Java compiler     
    César Piñeiro, José M. Abuín and Juan C. Pichel.
    IEEE Int. Conference on Big Data Intelligence and Computing (DataCom). Orlando, USA, November 2017.
  19. Sentiment Analysis on Multilingual Tweets using Big Data Technologies   
    Rodrigo Martínez-Castaño, Juan C. Pichel and Pablo Gamallo.
    Jornadas Sarteco (JP). Salamanca, Spain, September 2016.
  20. Power and Energy Implications of the Number of Threads Used on the Intel Xeon Phi
    Oscar G. Lorenzo, Tomás F. Pena, José C. Cabaleiro, Juan C. Pichel, F.F. Rivera and D. S. Nikolopoulos.
    2nd Congress on Multicore and GPU Programming. Cáceres, Spain, March 2015.
  21. Perldoop: Efficient Execution of Perl Scripts on Hadoop Clusters   
    José M. Abuín, Juan C. Pichel, Tomás F. Pena, Pablo Gamallo and Marcos García.
    IEEE Int. Conference on Big Data (IEEE Big Data). Washington D.C., USA, October 2014.
  22. Thread Migration Techniques Based on Dynamic Roofline Models and Latency Information
    Oscar G. Lorenzo, Tomás F. Pena, José C. Cabaleiro, Juan C. Pichel and F.F. Rivera.
    XXV Jornadas de Paralelismo. Valladolid, Spain, September 2014.
  23. Multiobjective Optimization Technique Based on Monitoring Information to Increase the Performance of Thread Migration on Multicores
    Oscar G. Lorenzo, Tomás F. Pena, José C. Cabaleiro, Juan C. Pichel and F.F. Rivera
    IEEE Cluster (CLUSTER). Madrid, Spain, September 2014.
  24. Hierarchically Tiled Array as a High-Level Abstraction for Codelets   
    Chih-Chieh Yang, Juan C. Pichel, Adam R. Smith and David A. Padua.
    4th Int. Workshop on Data-Flow Models for Extreme Scale Computing (DFM). Edmonton, Alberta, Canada, August 2014.
  25. DyRM: A Dynamic Roofline Model Based on Runtime Information
    Oscar G. Lorenzo, Tomás F. Pena, José C. Cabaleiro, Juan C. Pichel and Francisco F. Rivera.
    13th Int. Conference on Computational and Mathematical Methods in Science and Engineering (CMMSE). Almería, Spain, June 2013.
  26. Hardware Counters Based Analysis of Memory Accesses in SMPs
    Oscar G. Lorenzo, Tomás F. Pena, Jose C. Cabaleiro, Juan C. Pichel, Juan A. Lorenzo and Francisco F. Rivera.
    10th IEEE Int. Symposium on Parallel and Distributed Processing with Applications (ISPA). Leganés, Spain, July 2012.
  27. A Graphical Tool for Performance Analysis of Multicore Systems Based on the Roofline Model
    Francisco F. Rivera, R. Iglesias, Juan A. Lorenzo, Juan C. Pichel, Tomás F. Pena and Jose C. Cabaleiro.
    10th IEEE Int. Symposium on Parallel and Distributed Processing with Applications (ISPA). Leganés, Spain, July 2012.
  28. Experiences with the Sparse Matrix-Vector Multiplication on a Many-Core Processor      
    Juan C. Pichel and Francisco F. Rivera.
    21st Int. Heterogeneity in Computing Workshop (HCW, together with IPDPS). Shanghai, China, May 2012.
  29. Herramientas para la Monitorización de los Accesos a Memoria de Códigos Paralelos Mediante Contadores Hardware
    Oscar G. Lorenzo, Juan A. Lorenzo, Dora B. Heras, Juan C. Pichel and Francsico F. Rivera.
    XXII Jornadas de Paralelismo. La Laguna, Spain, September 2011.
  30. Study of Performance Issues on a SMP-NUMA System Using the Roofline Model
    Juan A. Lorenzo, Juan C. Pichel, Tomás F. Pena, Marcos Suarez and Francisco F. Rivera.
    Int. Conf. on Parallel and Distributed Processing Techniques and Applications (PDPTA). Las Vegas, USA, July 2011.
  31. A Study of Memory Access Patterns in Irregular Parallel Codes Using Hardware Counter-Based Tools
    Oscar G. Lorenzo, Juan A. Lorenzo, José C. Cabaleiro, Dora B. Heras, Marcos Suarez, and Juan C. Pichel.
    Int. Conf. on Parallel and Distributed Processing Techniques and Applications (PDPTA). Las Vegas, USA, July 2011.
  32. Lessons Learnt Porting Parallelisation Techniques for Irregular Codes to NUMA Systems   
    Juan A. Lorenzo, Juan C. Pichel, David LaFrance-Linden, Francisco F. Rivera and David E. Singh.
    18th Euromicro Conference on Parallel, Distributed and Network based Processing (PDP). Pisa, Italia, February 2010.
  33. On the Influence of Thread Allocation for Irregular Codes in NUMA Systems   
    Juan A. Lorenzo, Francisco F. Rivera, Petr Tuma and Juan C. Pichel.
    10th Int. Conf. on Parallel and Distributed Computing, Applications and Technologies (PDCAT). Hiroshima, Japan, December 2009.
  34. Thread Allocation Issues for Irregular Codes in the Finisterrae System
    Juan A. Lorenzo, Francisco F. Rivera, Dora B. Heras, José C. Cabaleiro, Tomás F. Pena, Juan C. Pichel and David E. Singh.
    XX Jornadas de Paralelismo. A Coruña, Galicia, Spain, September 2009.
  35. Evaluating Sparse Matrix-Vector Product on the FinisTerrae Supercomputer   
    Juan C. Pichel, Juan A. Lorenzo, Dora B. Heras and José C. Cabaleiro.
    9th Int. Conference on Computational and Mathematical Methods in Science and Engineering (CMMSE). Gijón, Spain, June 2009.
  36. Exploiting Data Compression in Collective I/O Techniques   
    Rosa Filgueira, David E. Singh, Juan C. Pichel and Jesús Carretero.
    IEEE Int. Conference on Cluster Computing. Tsukuba, Japan, September 2008.
  37. Reordering Algorithms for Increasing Locality on Multicore Processors   
    Juan C. Pichel, David E. Singh and Jesús Carretero.
    10th IEEE Int. Conference on High Performance Computing and Communications (HPCC). Dalian, China, September 2008.
  38. Data Locality Aware Strategy for Two-Phase Collective I/O   
    Rosa Filgueira, David E. Singh, Juan C. Pichel, Florin Isaila and Jesús Carretero.
    Int. Meeting High Performance Computing for Computational Science (VECPAR). Toulouse, France, June 2008.
  39. A Collective I/O Implementation Based on Inspector-Executor Paradigm   
    David E. Singh, Florin Isaila, Juan C. Pichel and Jesús Carretero.
    Int. Workshop on Scalable Data Management Applications and Systems (SDMAS). Las Vegas, USA, June 2007.
  40. A New Technique to Reduce False Sharing in Irregular Codes Based on Distance Functions   
    Juan C. Pichel, Dora B. Heras, José C. Cabaleiro and Francisco F. Rivera.
    8th Int. Symposium on Parallel Architectures, Algorithms and Networks (I-SPAN). pp. 306-311. Las Vegas, USA, December 2005.
  41. Mejora de la Localidad en SMPs: el Producto Matriz Dispersa-Vector como Caso de Estudio
    Juan C. Pichel, Dora B. Heras, José C. Cabaleiro, Marcos Boullón, David E. Singh and Francsico F. Rivera.
    XV Jornadas de Paralelismo. Almería, Spain, September 2004.
  42. Improving the Locality of the Sparse Matrix-Vector product on Shared Memory Multiprocessors   
    Juan C. Pichel, Dora B. Heras, José C. Cabaleiro and Francisco F. Rivera.
    12th Euromicro Conference on Parallel, Distributed and Network based Processing (PDP). A Coruña, Galicia, February 2004.
  43. Algoritmo Paralelo de Segmentación de Imágenes Basado en el Crecimiento Desacoplado de Regiones
    Juan C. Pichel, David E. Singh and Francisco F. Rivera.
    Conferencia Iberoamericana en Sistemas, Cibernética e Informática (CISCI). pp. 134-139. Orlando, USA, July 2002.

Preprints

  1. OMP4Py: a pure Python implementation of OpenMP     
    César Piñeiro and Juan C. Pichel
    arXiv:2411.14887, 2024.
  2. NetQIR: An Extension of QIR for Distributed Quantum Computing   
    Jorge Vázquez-Pérez, F. Javier Cardama, César Piñeiro, Tomás F. Pena, Juan C. Pichel and Andrés Gómez
    arXiv:2408.03712, 2024.
  3. Review of Distributed Quantum Computing. From single QPU to High Performance Quantum Computing   
    David Barral, F. Javier Cardama, Guillermo Díaz, Daniel Faílde, Iago F. Llovo, Mariamo Mussa Juane, Jorge Vázquez-Pérez, Juan Villasuso, César Piñeiro, Natalia Costas, Juan C. Pichel, Tomás F. Pena and Andrés Gómez
    arXiv:2404.01265, 2024.
  4. Search Engines, LLMs or Both? Evaluating Information Seeking Strategies for Answering Health Questions   
    Marcos Fernández-Pichel, Juan C. Pichel and David E. Losada
    arXiv:2407.12468, 2024.
  5. A unified framework to improve the interoperability between HPC and Big Data languages and programming models     
    César Piñeiro and Juan C. Pichel
    arXiv:2112.00467, 2021.
  6. Polypus: a Big Data Self-Deployable Architecture for Microblogging Text Extraction and Real-Time Sentiment Analysis     
    Rodrigo Martínez-Castaño, Juan C. Pichel and Pablo Gamallo
    arXiv:1801.03710, 2018.

Book chapters

  1. A Parallel Framework for Image Segmentation Using Region Based Techniques   
    Juan C. Pichel, David E. Singh and Francisco F. Rivera
    Vision Systems: Segmentation and Pattern Recognition, edited by Goro Obinata and Ashish Dutta, 2007.

Software

BigSeqKit

  

BigSeqKit is a parallel toolkit to manipulate FASTA/Q files at scale with speed and scalability at its core. BigSeqKit takes advantage of an HPC-Big Data framework (IgnisHPC) to parallelize and optimize the commands included in seqkit. In this way, in most cases it is from tens to hundreds of times faster than other state-of-the-art tools such as seqkit, samtools and pyfastx. At the same time, our tool is easy to use and install on any kind of hardware platform (single server or cluster). Routines in BigSeqKit can be used as a bioinformatics library or from the command line. In order to improve the usability and facilitate the adoption of BigSeqKit, it implements the same command interface than seqkit.

Citation:
César Piñeiro and Juan C. Pichel. BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale. GigaScience, Vol. 12, 2023.

PyPlexity

    

This package provides a simple interface to apply Perplexity filters to any text document. A possible use case for this technology could be the removal of boilerplate (sentences with a high perplexity score): ads, incomplete or noisy text and rests of the navigation structure, such as menus or navigation bars. Furthermore, it provides a rough HTML tag cleaner and a WARC and HTML bulk processor, with distributed capabilities.

Citation:
Marcos Fernández-Pichel, Manuel Prada-Corral, David E. Losada, Juan C. Pichel and Pablo Gamallo. An Unsupervised Perplexity-based Method for Boilerplate Removal. Natural Language Engineering, Vol. 30, 2024.

IgnisHPC

    

IgnisHPC is a framework whose main objective is to unify the execution of Big Data and HPC workloads in the same computing engine. IgnisHPC has native support for multi-language applications using JVM and non-JVM-based languages. Currently it supports C, C++, Python, Go and Java. Since MPI was used as its backbone technology, IgnisHPC allows MPI applications and libraries to be directly executed in an efficient way in the framework. The main consequence is that users could combine in the same multi-language code HPC tasks (using MPI) with Big Data tasks (using MapReduce operations). The experimental evaluation demonstrates the benefits of our proposal in terms of performance and productivity with respect to other frameworks such as Spark. For example, considering a 12-node cluster with 2 × Intel Xeon E5-2630v4 (2.2Ghz, 10 cores) per node, the experimental results show that:

Application No. times faster than Spark
Minebench 3.87x [Python & C++], 1.26x [Python]
TeraSort 1.76x [C++], 1.35x [Python]
K-Means 1.94x [Python & C++]
PageRank 1.10x [Python]
Transitive Closure 1.12x [Python]
IgnisHPC is publicly available for the Big Data and HPC research community.

Citation:
César Piñeiro and Juan C. Pichel. A Unified Framework to Improve the Interoperability between HPC and Big Data Languages and Programming Models. Future Generation Computing Systems, Vol. 134, 2022.

VeryFastTree

  

VeryFastTree is a new tool designed for efficient phylogenetic tree inference, specifically tailored to handle massive taxonomic datasets. It is a highly-tuned implementation based on the FastTree-2 tool that takes advantage of parallelization and vectorization strategies to speed up the inference of phylogenies for huge alignments. Regarding the performance, for example, VeryFastTree (v4.0 - July 2023) is able to construct a tree on one server (two 32-core Intel Xeon Ice Lake 8352Y processors) using single precision arithmetic from an ultra-large one million taxa alignment in 36 hours, while our previous version (v3.0) and FastTree-2 require more than 5 days. That is, VeryFastTree-4.0 is more than 3x times faster than VeryFastTree-3.0 and FastTree-2, respectively.

VeryFastTree is available as a package in Bioconda, MacPorts and Debian Linux ditributions. It has also Python bindings.

Citations:
César Piñeiro and Juan C. Pichel. Efficient phylogenetic tree inference for massive taxonomic datasets: harnessing the power of a server to analyze 1 million taxa. GigaScience, Vol. 13, pages 1-12, 2024.
César Piñeiro, José M. Abuín and Juan C. Pichel. VeryFastTree: speeding up the estimation of phylogenies for large alignments through parallelization and vectorization strategies. Bioinformatics, Vol. 36, Issue 17, pages 4658-4659, 2020.

PASTASpark

  

PASTASpark is a tool that uses the Big Data engine Apache Spark to boost the performance of the alignment phase of PASTA (Practical Alignments using SATé and TrAnsitivity). PASTASpark guarantees scalability and fault tolerance, and allows to obtain MSAs from very large datasets in reasonable time.

Citation:
José M. Abuín, Tomás F. Pena and Juan C. Pichel. PASTASpark: multiple sequence alignment meets Big Data. Bioinformatics, Vol. 33, Issue 18, pp. 2948-2950, 2017.

SparkBWA

  

SparkBWA is a new tool that exploits the capabilities of a Big Data technology as Apache Spark to boost the performance of one of the most widely adopted DNA sequence aligner, the Burrows-Wheeler Aligner (BWA).

Citation:
José M. Abuín, Juan C. Pichel, Tomás F. Pena and Jorge Amigo. SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data. PLoS ONE, Vol. 11, Issue 5, pp. 1-21, 2016.

BigBWA

  

BigBWA allows to execute the Burrows-Wheeler Aligner (BWA) on an Apache Hadoop cluster.

Citation:
José M. Abuín, Juan C. Pichel, Tomás F. Pena and Jorge Amigo. BigBWA: Approaching the Burrows-Wheeler Aligner to Big Data Technologies. Bioinformatics, Vol. 31, Issue 24, pp. 4003-4005, 2015.

Projects

Here you can find a list of some of the most recent research projects I am/was involved with:

C3HS: Content curation for consumer health search - Search and misinformation detection
Funded by Ministerio de Economía y Competitividad (PID2022-137061OB-C22)
Period: Sep 2023 - Aug 2026
HYBRIDS: Hybrid Intelligence to monitor, promote and analyse transformations in good democracy practices
Funded by Horizon Europe, Marie Skłodowska-Curie Actions (MSCA), Doctoral Networks, European Union (101073351)
Period: Jan 2023 - Dec 2026
Big-eRisk: Early Prediction of Personal Risks on Massive Data
Funded by Ministerio de Economía y Competitividad (PLEC2021-007662)
Period: Nov 2021 - Nov 2024
eRISK: Technologies for the early prediction of signs related with psychological disorders
Funded by Ministerio de Economía, Industria y Competitividad (RTI2018-093336-B-C21)
Period: Jan 2019 - Dec 2021

Contact

Juan Carlos Pichel
CITIUS (Universidade de Santiago de Compostela)
Rúa de Jenaro de la Fuente
15782 Santiago de Compostela (Spain)
juancarlos.pichel@usc.es
   +34 881816437