I noticed that the provided data has been processed. Some h5ad/rds didn’t include raw counts, and genes were filtered. In some data, only about 1200 genes were retained. I believe at least raw counts of all genes are needed for the analysis. Could you check the data?
Thank you for reaching out! For some datasets, the original authors provided raw data which we processed in-house, while for others, only pre-processed data was made publicly available. In the latter case, we are unfortunately limited to what was available to us.
Update (April 17): We internally are going over each dataset to check for which dataset only processed data was available and for which datasets both raw+processed data available. We will update you this next week.
This is also a problem for us: we rely on raw count data for our labeling. Also, is there a way to know how the data has been processed in the h5ad objects?
Thanks so much for your feedback, and sorry for the slow reply on our end, as we wanted to take the time to properly gather and implement everything from Webinar 1 before getting back to you.
We have made raw counts available for all studies except infection_study_07, for which we unfortunately couldn’t arrange raw count data. Teams that had been assigned infection_study_07 have been given a different study to work with, and we have opened up infection_study_07 to everyone as a bonus. If you’re able to take a look at it, fantastic, but no pressure, it’s entirely optional.
To keep things comparable and reproducible across methods, we have applied a standardized QC pipeline to all datasets in the benchmark. Specifically, we applied following steps: i) removed cells with fewer than 400 detected genes to exclude low-quality or empty droplets; ii) removed cells with total UMI counts ≤200 to exclude barcodes with insufficient sequencing depth; iii) identified mitochondrial genes by the “MT-” prefix and calculated per-cell mitochondrial read percentages; and iv) filtered out cells with ≥15% mitochondrial reads, since elevated mitochondrial content usually points to damaged or dying cells that have lost cytoplasmic RNA.