Since Android became the first smartphone operating system, malware developers have put large efforts to craft new threats uploaded to the Google Play store and other third market places. Companies and researchers now include in their activities the analysis of malware targeting smartphones. Most of the time, the problem that is addressed consists in deciding if an application should be considered as a malware or not. Nevertheless, once a malware is tagged as a malicious application, users that have been infected ask for more technical explanations about the threat they have been exposed to. Dissecting a malware requires a lot of efforts for a security analyst to be conducted and companies are in demand of new tools for automatizing the analysis. From a research perspective, testing new ideas about malware analysis requires performing experiments on malware datasets. Compared to other operating systems, Android has fast development cycles with a new major release each year. A lot of malware samples do not run anymore when executed on new versions of Android. Experiments of the literature becomes quickly out of date and non reproducible when studying few samples. Thus, working on larger datasets, built at the time of writing, may give more consistent experimental results. New challenges come from using such datasets. First, as the behavior of the samples are unknown, the obtained results from the experiments are difficult to evaluate. Second, the experiment itself may require a large amount of time, depending of the quality of the automatization and the complexity of the analysis. Third, the protections that are put by developers in the malware decrease the quality of the results. This paper discusses these challenges and describes our efforts to build reliable and large scale experiments.